Methods xxx (2014) xxx–xxx

Contents lists available at ScienceDirect

Methods journal homepage: www.elsevier.com/locate/ymeth

Protein–protein interaction predictions using text mining methods Nikolas Papanikolaou, Georgios A. Pavlopoulos, Theodosios Theodosiou, Ioannis Iliopoulos ⇑ Division of Basic Sciences, School of Medicine, University of Crete, Heraklion 71003, Greece

a r t i c l e

i n f o

Article history: Received 2 April 2014 Received in revised form 5 September 2014 Accepted 21 October 2014 Available online xxxx Keywords: Protein–protein interaction prediction Text mining Computational tools

a b s t r a c t It is beyond any doubt that proteins and their interactions play an essential role in most complex biological processes. The understanding of their function individually, but also in the form of protein complexes is of a great importance. Nowadays, despite the plethora of various high-throughput experimental approaches for detecting protein–protein interactions, many computational methods aiming to predict new interactions have appeared and gained interest. In this review, we focus on text-mining based computational methodologies, aiming to extract information for proteins and their interactions from public repositories such as literature and various biological databases. We discuss their strengths, their weaknesses and how they complement existing experimental techniques by simultaneously commenting on the biological databases which hold such information and the benchmark datasets that can be used for evaluating new tools. Ó 2014 Published by Elsevier Inc.

1. Introduction Proteins are the molecules that facilitate most biological processes in a cell. While most of the known proteins are characterized by a unique function, many of them act in coordination with others towards the formation of protein networks in order to deliver complex actions. Two proteins, for example, may directly interact through their physical proximity or by being members of the same protein complex [1]. At a systems biology level, the correct identification of Protein–Protein Interactions (PPIs) is of key importance for the understanding of the complex mechanisms in a cell. Such processes include cell cycle control, differentiation, protein folding, signal transduction, transcription, translation, post-translational modification and transportation. Today, in order to better understand such systems, relatively new high-throughput methods are used to reveal protein interaction networks [2,3]. Yeast two-hybrid system (Y2H) or two-hybrid screening, for example, is being used for more than twenty years, mainly aiming to detect binary interactions [4,5] whereas other experimental methods for PPI identification are the protein microarrays [6] (including reverse phase protein arrays [7]), pull down assays [8], tandem affinity purification (TAP) [9], immunoaffinity chromatography (affinity-purification) in conjunction with mass spectrometry [10], dual polarization interferometry (DPI) [11], microscale thermophoresis [12], phage display [13,14] and protein complex immunoprecipitation (Co-IP) [15]. In addition, some other ⇑ Corresponding author. E-mail address: [email protected] (I. Iliopoulos).

methods take advantage of X-ray crystallography [16] and Nuclear Magnetic Resonance (NMR) spectroscopy [17]. While most of the aforementioned high-throughput techniques have proven to be very valuable in instigating a huge growth of experimentally verified PPIs [18], they come with several shortcomings, as findings are often fractional or not conclusive, and accompanied by high false positive and false negative rates [19]. In addition, most of the experiments can often become quite costly and time consuming [20]. Therefore, algorithmic PPI predictions have become a necessity as they can provide strong indications and clues about putative PPIs and thus help steering the experimental verification to the right direction. Non Text-mining prediction methods for PPIs can vary widely depending on the strategy they follow to infer putative interactions. Accordingly, those methods can be categorized depending on whether prediction is based on protein sequence, protein structure, genomic context, homology, experimental profiles and literature-derived associations [21]. In the case of sequences, prediction tools use artificial intelligence and machine learning approaches [22,23] to predict protein interactions through their sequence or structural characteristics [24] such as shared binding partners [25], domains [26,27] or neighboring residues [28]. Homology based prediction tools try to detect evolutionary relationships between the proteins, taking into account their structures or sequences as many known protein interactions are conserved across species [29]. The previous methods are often combined to complement each other in order to provide additional physical details about the interactions as more and more structures become available overtime [30]. According to the first subcategory of

http://dx.doi.org/10.1016/j.ymeth.2014.10.026 1046-2023/Ó 2014 Published by Elsevier Inc.

Please cite this article in press as: N. Papanikolaou et al., Methods (2014), http://dx.doi.org/10.1016/j.ymeth.2014.10.026

2

N. Papanikolaou et al. / Methods xxx (2014) xxx–xxx

genomic context based prediction tools, the assumption which is made, is that two proteins interact with each other according to their conservation of relative genomic locations of genes [31,32]. Alternatively, others examine gene fusion events [33] as an implication that respective fused proteins are functionally related, something that in many cases has been experimentally verified [34]. Lastly, many genomic context based prediction tools use phylogenetic profiling and base their functionality on the hypothesis that proteins involved in common pathways co-evolve in a correlated fashion across large numbers of species [35,36]. Text-mining based techniques on the other hand, try to automate the extraction of interconnected proteins through their coexistence in sentences, abstracts or paragraphs within text corpuses. This can be done by searching for statistically significant co-occurrences between gene names [37] in public repositories and online resources. Such approaches are very promising as they significantly expand the available proteome coverage, something that is currently done partially by the existing experimental approaches [38,39]. More complex Text Mining (TM) methodologies use advanced dictionaries and generate networks by Natural Language Processing (NLP) of text, considering gene names as nodes and verbs as edges giving a semantic notion on the graphs. Notably, even newer developments use kernel methods to predict protein interactions from literature [40,41]. While the available tools follow different concepts for predicting PPIs, a combination of the aforementioned methods along with meta-methods that combine the results of the presented tools is preferable [42–44]. This review is focused on PPI extraction through Text Mining methods as they gain importance in a large array of biological fields [45,46]. We mention the advantages and the disadvantages of the available tools of the past decade and we comment on how they perform information extraction, protein entity recognition and linking from various types of textual collections, such as Medline abstracts or other biological databases that contain textual information. We believe that this review can be a fruitful guide for researchers in the field.

2. Text mining tools In this section, we review tools and databases according to the following criteria: first, we select tools or databases that offer,

among other functionalities, PPI predictions based on Text Mining methods. This entails that publications that only describe methods or have applied an ad hoc PPI prediction approach are not included [47–50]. Furthermore, databases like PIPS [44] and STITCH [51,52], which contain PPI predictions derived from non-Text Mining methods, are also not included. Widely used systems like DIP [53] and MINT [54], which only contain manually curated data, are shortly described below. Second, we only focus on tools which come with a functional web or a standalone interface. As a result, tools like SUISEKI [55], which is one of the first approaches in the field or AkaneRE [56], are not reviewed. Similarly, databases such as BOND (formerly known as BIND [57]), are not included in the review. Lastly, the review focuses on tools that are freely available and not accompanied by a payment scheme. Therefore, systems like MedScan [58] are not included in the review. Following the aforementioned criteria, we have located nineteen tools. We briefly describe each approach in the following paragraphs and also present tables containing URLs, technical features, key characteristics and quality measures for each system (see Table 1, Supplementary Tables 1 and 2). BioRAT (Biological Research Assistant for Text mining) [59] is a standalone application. Given a typical PubMed query by the user, BioRAT attempts to locate and download full papers or, if not possible, abstracts, starting from PubMed and following links, jumping from and to web pages. Informative terms, such as proteins and genes, are then highlighted in the collected corpus. BioRAT implements a general purpose Information Extraction (IE) system, the GATE toolbox [60]. It tries to identify bioentities such as proteins, even when their names resemble common English words using a ‘part-of-speech’ tagger and dictionaries (called ‘gazetteers’). Following that, it extracts information using predefined or userdefined semantic templates such as ‘interaction of’ (PROTEIN_1) ‘and’ (PROTEIN_2). PPIs are presented in a table along with the textual information (sentences) that lead to their identification. BioRAT was evaluated using DIP [53] subsets. eFIP (Extracting Functional Impact of Phosphorylation) [61] is a Text Mining system focused on mining protein interaction networks of phosphorylated proteins. It employs several NLP techniques in order to locate abstracts that mention protein phosphorylation alongside with indicators of PPIs and evidence of altering effects of said phosphorylation on the PPI. To that purpose, it integrates previously developed tools by the authors,

Table 1 Text mining-based PPI prediction tools. Name

Type

Non-TM

NER

NLP

Co-occurrence

Meta

Dictionaries/ ontologies

Corpus

Results

Scoring/ ranking scheme

Benchmarking/ evaluation

BioRAT eFIP FACTA+ GeneWays HitPredict hPRINT I2D iHOP IMID Negatome openDMAP PCorral PIE the search Polysearch PPIExtractor PPI Finder PPInterFinder

Standalone Online tool Online tool Online tool Online DB Online DB Online DB Online DB Online DB Online DB Standalone Online tool Online tool Online tool Standalone Online tool Online tool

    U U U U  U     U  

U U U U  U  U  U U U U U U U U

U U  U  U  U U U U U U   U U

U U U   U U U U   U U U U U U

     U U  U        

U  U U U U U U U U    U  U U

XML, table Table Table Table Table/graph Table Table/graph List of sentences Table/graph Downloadable list List Table List List Graph List Table

 U U U U U U U U U  U U U U  

U U U U  U  U  U U  U U U U U

PPLook STRING

Standalone Online DB

 U

U U

U U

U U

 U

U U

PubMed abstracts/full-text PubMed abstracts PubMed abstracts Full text articles – Corpus used by used DBs – PubMed abstracts Not specified PubMed abstracts/full-text Any biomedical corpus PubMed abstracts PubMed abstracts Many DBs PubMed abstracts PubMed abstracts PubMed abstracts, integrates data from DBs Any biomedical corpus PubMed full-text

Graph Graph/table

 U

U U

Please cite this article in press as: N. Papanikolaou et al., Methods (2014), http://dx.doi.org/10.1016/j.ymeth.2014.10.026

N. Papanikolaou et al. / Methods xxx (2014) xxx–xxx

namely eGRAB [62], for protein-specific document retrieval and Name Entity Recognition (NER) along with RLIMS-P [63] for detecting protein phosphorylation. Additionally, using a rule-based system, it detects PPIs that implicate phosphorylated proteins and extracts relations. Finally, a rule-based impact module is used to rank the confidence of a temporal/causal relationship between phosphorylation and interaction. The user may input a protein and retrieve the results as a list of relevant PubMed identifiers (IDs) along with extracted information related to phosphorylation and protein interaction. Users may also input a collection of PubMed publications that can be tagged in the same way. eFIP was evaluated using BioCreative [64] datasets. FACTA+ (Finding Associated Concepts with Text Analysis) [65] is the latest version of the FACTA [66] tool. FACTA+ can use, among others one or more proteins as a query and retrieve PubMed abstracts using specific word/concept indexes. In addition, it also introduces Pivot Concepts (intermediate concepts that can connect 2 proteins). Concepts found in the documents are then ranked, depending on the query. FACTA+ employs Name Entity Recognition (NER) and co-occurrence analysis along with the GENIA event corpus [67] which allows for refining by type of regulation event (e.g. positive regulation). FACTA+ performed its benchmarking using data of the BioNLP’09 Shared Task [68]. Additionally, it was manually evaluated by a bioNLP researcher. GeneWays [69] analyzes interactions between molecular substances regarding signal transduction pathways by performing NER and by disambiguating between synonyms, homonyms and classifying terms. The tagged text is then passed on to the GENIES NLP parser [70], which, given a sentence or a text, it creates complex, machine readable, semantic trees. A module, then, converts these complex output trees into simple binary statements such as ‘interleukin- 2 binds interleukin-2 receptor’. The resulting statements are stored into the Interaction Knowledge Base and are built on the GeneWays Ontology. Users can search by using a protein name as input and retrieve the results in a table of interactions along with a confidence score. GeneWays was manually evaluated by an expert in molecular biology. HitPredict [71] is a database of high confidence PPIs integrated from other PPI databases such as IntAct [72] and BIOGRID [73]. Interactions from various databases are combined and assigned a confidence score. Scores are calculated in the form of likelihood ratio using Bayesian networks on evidence from Pfam [74] domains of proteins, Gene Ontology (GO) [75] terms and homologous interactions in other species. Users can search HitPredict for a protein and see putative PPIs interactions along with the type of supporting evidence. HitPredict was evaluated using sets of known and high confidence interactions from a previous analysis [76]. Also some predictions were experimentally verified. hPRINT [77] uses eighteen types of evidence to predict from various databases and tools and distinguishes functional associations from physical binding. It shows Text Mining based PPI predictions by bringing evidence from STRING and GoGene [78] tools. Non-Text Mining features include genomic neighborhood, gene fusion, phylogenetic profiling. Based on these features, the interactions are classified as physical, functional or non-related. The system is trained on the eighteen features, using manually verified data from the Human Protein Reference Database (HPRD) [79], the Comprehensive Resource of Mammalian protein complexes (CORUM) [80] and Kyoto Encyclopedia of Genes and Genomes (KEGG) [81]. The Random Forests [82] machine learning method was used in order to assess the probability of these types of interactions based on the various evidence features. I2D (Interologous Interaction Database) [83] is the successor to OPHID (Online Predicted Human Interaction Database) [84]. It hosts, known and predicted PPIs for 6 eukaryotes and a virus and is built on high-throughput experimental data. It combines data

3

from external databases such as Unigene [85], Swissprot [86] Entrez [85]. I2D mostly uses non-Text Mining predictive methods, i.e. mapping PPIs from one organism to the respective orthologs of another. These conserved interactions are called interologs. The Text Mining aspect of the predictions involves the usage of Gene Ontology (GO) terms to annotate the cellular localization of UniProt [87] proteins and estimate their relatedness. I2D can be queried using one protein or more and the results are presented in a table or a graph. iHOP (information Hyperlinked Over Proteins) [88] creates connections between sentences from articles contained in PubMed. These connections are carried through automatically annotated gene names and protein names contained in these sentences. More specifically, genes and proteins function as hyperlinks between the sentences and the abstracts derived from PubMed. This way, the PubMed corpus is rendered into a navigable resource. Additionally, experimental data is integrated into the network and contributes to the ranking of each sentence. Associating verbs are highlighted and influence the ranking positively. Currently iHOP is comprised of 40,000 genes from 8 species. Aspects of iHOP were manually evaluated. The web interface allows user to construct and save an undirected PPI network, based on the information from sentences retrieved by it. It should be noted that many other tools use the API of iHOP today as a back-end service. IMID (Integrated Molecular Interaction Database) [89] collects PPIs (among other data) from manually curated databases but also automatically extracts pairs of interacting molecules and the type of interaction e.g. phosphorylation, from literature using a Bayesian Network method [90]. PPI extraction is performed using dictionaries of keywords related to PPIs and PPI-specific linguistic rules. It implements Bayesian Networks for training and rules are determined by using manually curated text samples. The rules are based on 12 linguistic features closely related to language rules that describe PPI relationships. Negatome [91] follows a distinctly different approach from all other systems that are described in this review. Instead of storing PPIs, it stores protein and domain pairs that are unlikely to appear in a PPI. This can be useful for providing training and evaluation sets for other prediction systems. Although the data presented to the user is manually curated, Negatome uses Text Mining methods in order to facilitate the curation. Negatome employs Excerbt [92], a tool that offers NLP, Machine Learning and rule-based approaches, in order to pinpoint negated connections. The connections are assigned a confidence score based on five simple characteristics of the sentences such as the word that indicates the negations. Negatome was manually evaluated. openDMAP [93] is a concept analysis and information extraction system that focuses, among others, on protein transport and protein interactions. It creates complex ontologies that are built on top of Protégé [94], an open-source ontology editor and framework. These ontologies are combined with a complex set of patterns that describe protein transport and protein interactions. In addition it implements NER provided by ABNER [95] and LingPipe [96]. Using a biomedical corpus as input, openDMAP can provide a list of possible protein transportation events or PPIs. openDMAP was evaluated using BioCreative [64] datasets. PCorral (Protein Corral) [97] is a web application which combines UniProt [87] gene and protein names with retrieved Medline abstracts. It analyzes the resulting associations using various Text Mining methods, namely NLP, co-occurrence and tri co-occurrence. The user can query for a list of PubMed identifiers (IDs) or other terms. The system allows for complex searching such as fuzzy search or use of wildcards. Results are presented in a table of putative associations, where each association is ranked by the aforementioned Text Mining techniques.

Please cite this article in press as: N. Papanikolaou et al., Methods (2014), http://dx.doi.org/10.1016/j.ymeth.2014.10.026

4

N. Papanikolaou et al. / Methods xxx (2014) xxx–xxx

PIE the search [98] is offered by NCBI in the place of PIE (Protein Interaction information Extraction system) [99] which used to be a web service to extract PPIs from PubMed or from other manually provided scientific articles. The service currently only acts on Medline articles. It retrieves PPI-related articles, ranked by confidence scores using Machine Learning to implement co-occurrence and NLP aspects. PIE the Search is trained on BioCreative [64] test sets. Polysearch [100] is a web tool which exploits a variety of techniques in Text Mining and information retrieval to identify, highlight and rank informative abstracts, paragraphs or sentences combining text from PubMed but also other databases. It uses a ‘bag-of-words’ approach (i.e. syntax and grammar are ignored and only the frequency of each word is taken in account) for mining text and implements multiple thesauruses for tagging bioentities. Polysearch presents to the user relationships between human diseases, genes/proteins, mutations (SNPs), drugs, metabolites, pathways, tissues, organs and sub-cellular localizations. The results are presented in a form of ranked list. Polysearch was evaluated using multiple methods such as using datasets from iHOP [88], against various tools and using the SPIES corpus [101]. PPIExtractor [102] is a downloadable tool that accepts a list of PubMed abstracts as input and constructs a PPI network by inferring PPIs. It does so by using Feature Coupling Generalization [103] in order to perform protein name recognition. The names are then normalized and, following that, PPIExtractor combines feature-based, convolution tree and kernels to extract PPIs which are subsequently used to create a visualizing of a PPI network in the form of graph where edges represent the proteins and nodes the interactions, with weights that represent the reliabilities of the PPIs. PPIExtractor was benchmarked using the AIMED corpus [104]. PPI finder [105] is a web-based tool which extracts human PPIs using PubMed abstracts indexed by gene2pubmed [106] in NCBI Entrez Gene. PPIFinder employs an NLP tool, the GENIA tagger [107]. It identifies co-occurrence frequencies and interactions between words, i.e. words that are commonly used to describe PPIs in biomedical literature. These results are integrated with data from HPRD [79], BioGRID [73] and GO [75] and presented in a table format. PPI finder was evaluated using 29 genes listed in the HPSD database [108] and 100 pairs of PPIs from the HPRD database. PPInterFinder [109] acts on PubMed abstracts and uses semantic co-occurrence of relation keywords with protein names by employing a relation dictionary. NER is performed with the human-specific protein module called NAGGNER [110] and ProNormz [111] for normalization. A rule-based approach is then implemented in order to extract putative PPIs. In general, the rules are applied on sentences containing more than one protein and examine the relative position and type of other words contained in those sentences. PPInterFinder was evaluated using AIMED, HPRD50 [112] and IntAct [72] corpora. PPLook [113] is standalone tool which extracts PPIs starting by a query protein and proceeding to find sentences that contain the protein of interest. PPLook uses the GENIA Tagger [107] for NER. PPIs are located by a pattern-matching algorithm using a small dictionary of keywords containing terms that define protein relationships, e.g. ‘inhibit’. Results are visualized in a 3D graph

based on OpenGL. Benchmarking was performed with the aid of the GENIA V3.02 Corpus [114]. STRING [115] is an online database that contains experimentally verified and putative PPIs. Putative PPIs are inferred by a variety of non-Text Mining approaches such as the genomic context, experiments, co-expression, literature and public repositories. Many of these processes encompass Text Mining aspects such as NER (for orthographic variations of bioentities), NLP that implements a plethora of modules [116] and co-occurrence (for deriving associations between proteins). As of version 9.1, STRING parses the full text of publications. Currently, STRING covers 5,214,234 proteins from 1133 organisms. STRING was evaluated using KEGG [81] datasets. It is worth mentioning, that evaluating the quality of such tools is not a trivial task as no golden standard benchmark dataset exists and the currently available datasets that are often used for such purposes vary significantly in context. Therefore, the aforementioned measures are not directly comparable. Nevertheless, key features of the tools and any available quality measures such as specificity, sensitivity, precision, recall and F-scores along with the respective datasets are collected from the original papers (when available) and summarized in Supplementary Table 1.

3. PPI databases Protein interaction data is stored in specialized public databases which vary in size. Such repositories are often species-specific and hold information about manually validated and/or computationally predicted PPIs. Despite the fact that curated data from such databases is often available to researchers for free, several come with restricted access. In this review, we focus on six freely available databases that contain experimentally verified PPIs and we discuss their information context in Table 2. The databases are: (1) The Biological General Repository for Interaction Datasets (BioGRID) [73]. (2) The Molecular INTeraction database (MINT) [54]. (3) The High-quality INTeractomes database (HINT) [117]. (4) The Database of Interacting Proteins (DIP) [53]. (5) The IntAct molecular interaction database (IntAct) [72]. (6) The Human Protein Reference Database (HPRD) [79]. Notably, DIP, IntAct and MINT are active members of the International Molecular Exchange consortium (IMEx; http://imex.sourceforge.net/) [118]. The HPRD database, as its name also implies, is a specialized database containing data for Homo sapiens only, whereas the other databases contain PPI data for several organisms. IntAct contains the largest amount of proteins for many organisms, whereas BioGRID contains the largest amount of interactions. The HINT database focuses on high-quality interactions by manually filtering out low-quality and erroneous interactions [117]. A major problem of these repositories is that their model representation differs significantly from database to database and thus

Table 2 Latest releases of public PPI databases (March 2014) and their context. Database

URL

Proteins

Interactions

Publications

Organisms

BioGRID MINT HINT DIP IntAct HPRD

http://thebiogrid.org http://mint.bio.uniroma2.it/mint/Welcome.do http://hint.yulab.org/index.html http://dip.mbi.ucla.edu/dip http://www.ebi.ac.uk/intact http://www.hprd.org

54,566 35,553 19,314 26,453 81,795 30,047

506,961 241,458 53,110 76,844 287,103 41,327

42,259 5554 7769 6678 12,531 453,521

49 441 7 618 894 1

Please cite this article in press as: N. Papanikolaou et al., Methods (2014), http://dx.doi.org/10.1016/j.ymeth.2014.10.026

N. Papanikolaou et al. / Methods xxx (2014) xxx–xxx

5

Fig. 1. Left: Sum of Google-Scholar citations for the text-mining tools. Right: Google-Scholar citation trends for each text mining tool individually.

the interchange and direct comparison of them is not straightforward. A solution to this problem is proposed by the Proteomics Standards Initiative (PSI), a part of the IMEx consortium, which was developed as a standard for PPI data exchange. This format, called PSI-MI XML [119], is based on eXtensible Markup Language (XML). Additionally, PSI-MI TAB, a tab delimited format for data interchange is also available. Although, the PSI-MI XML format exists for several years now and is well established, not all PPI databases adopt it. While several databases may follow the same format, another recurring problem, is that one protein may be represented by different identifiers. For example, BioGRID [73] uses Entrez gene/ locuslink [106] identifiers whereas IntAct [72] uses Uniprot Knowledge Base (UniprotKB) [87] IDs. Widely used tools to automate the mapping between different protein identifiers among databases are the Protein Resource Information (PIR) identifier mapping tool [120], the ID mapping tool of UniProt [121] and iRefIndex [122]. 4. PPI benchmark datasets Benchmark datasets are necessary to evaluate the PPI prediction tools irrespectively of the methodology they follow. The assessment of the PPI ‘‘gold standard’’ datasets is not a trivial task as information and existing knowledge vary from organism to organism. MIPS database [123] for example, was initially used to evaluate Yeast PPI predictions but is not used anymore as many of the proteins have been found to be ribosomal [124]. Therefore, alternative datasets have been proposed [80,125]. It is shown that Y2H products are comparable to other experimental data and even curated data despite their error rate [126]. Therefore, Y2H results can potentially be used for benchmarking [127]. For functional associations, KEGG pathways [81], Gene Ontology [75] and Panther [128] repositories are widely used. In cases where more advanced text-mining approaches use machine learning techniques such as Support Vector Machines (SVMs), a negative control set is required to estimate the precision and the recall of them. For this purpose true literature-based negative control datasets have been generated [129–132]. 5. Discussion Information handling due to the tremendous growth of textual information stored in public biological repositories has become a

true challenge in health sciences. Currently, PubMed literature database contains over 23 million abstracts whereas PubMed Central (PMC) holds information for over 3 million full text publications. Considering the exponential expansion of literature repositories, along with the growth of hundreds of biological databases due to the recent advances of sequencing techniques, new approaches to mine the textual information, identify bioentities and links between each other have become a necessity. As text-mining is gaining ground over the years, we wanted to investigate the impact of the current tools in the field of PPI prediction. In order to accomplish that, we followed the Google Scholar citation trends of the nineteen reviewed tools (Fig. 1). We chose to keep track of the citations of only the first publication for each tool. It is worth mentioning that although the number of citations is a descent indicator of popularity, there are text mining papers which cite already existing tools in their introduction or for possible comparisons. This could be misleading for evaluating the impact of these tools in the actual PPI extraction field. The first Text Mining approaches to predict PPIs appeared in the beginning of the current millennium (e.g. SUISEKI, 2001) when PubMed contained 14 million abstracts, almost two thirds of today’s size. The corresponding vast growth of the total citations of the tools over the years gives a clear indicator that Text Mining approaches have gained ground over the years (Fig. 1 – left). Similar conclusions can be drawn by observing the citation trends of individual tools (Fig. 1 – right). We notice an abundance of new tools from year 2008 onwards compared to the previous years. Most probably, this is due to the fact that dictionaries for bioentities and ontologies, Name Entity Recognition techniques (NERs) and Natural Language Processing (NLP) started becoming more mature. In addition, around the same time, text mining competitions start to pursuit the goal of automated extraction of PPI from text, something that can be considered as a stimulating factor for the development of such systems (e.g. the BioCreative challenge). While longer lasting tools such as STRING [115], I2D [83] and iHOP [88] are currently widely used in the field (an observation based on the number of citations), newer tools such as Polysearch [100] follow up rapidly. This trend is expected to continue, given the rise in the interest regarding TM-based PPI predictions, as 20 of such TM-tools appeared in the field during the past decade only. With an approximate extrapolation one could expect that roughly double the number of relevant applications could exist in total during the next ten years.

Please cite this article in press as: N. Papanikolaou et al., Methods (2014), http://dx.doi.org/10.1016/j.ymeth.2014.10.026

6

N. Papanikolaou et al. / Methods xxx (2014) xxx–xxx

Acknowledgments Funding: This work was supported by the European Commission FP7 programmes INFLA-CARE (EC grant agreement number 223151), ‘Translational Potential’ (EC grant agreement number 285948) and the Greek Ministry of Education and Religious Affairs (Thalis-MIDAS). Appendix A. Supplementary data Supplementary data associated with this article can be found, in the online version, at http://dx.doi.org/10.1016/j.ymeth.2014.10. 026. References [1] [2] [3] [4] [5] [6] [7]

[8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39]

[40] [41]

B. Alberts, Cell 92 (1998) 291–294. E.M. Phizicky, S. Fields, Microbiol. Rev. 59 (1995) 94–123. A.-C. Gavin, K. Maeda, S. Kühner, Curr. Opin. Biotechnol. 22 (2011) 42–49. T. Ito, T. Chiba, R. Ozawa, M. Yoshida, M. Hattori, Y. Sakaki, Proc. Natl. Acad. Sci. U.S.A. 98 (2001) 4569–4574. S. Fields, O. Song, Nature 340 (1989) 245–246. L. Melton, Nature 429 (2004) 101–107. C.P. Paweletz, L. Charboneau, V.E. Bichsel, N.L. Simone, T. Chen, J.W. Gillespie, M.R. Emmert-Buck, M.J. Roth, E.F. Petricoin III, L.A. Liotta, Oncogene 20 (2001) 1981–1989. H.G. Vikis, K.-L. Guan, Methods Mol. Biol. (Clifton, NJ) 261 (2004) 175–186. O. Puig, F. Caspary, G. Rigaut, B. Rutz, E. Bouveret, E. Bragado-Nilsson, M. Wilm, B. Séraphin, Methods (San Diego, Calif) 24 (2001) 218–229. W.H. Dunham, M. Mullin, A.-C. Gingras, Proteomics 12 (2012) 1576–1590. G.H. Cross, A.A. Reeves, S. Brand, J.F. Popplewell, L.L. Peel, M.J. Swann, N.J. Freeman, Biosens. Bioelectron. 19 (2003) 383–390. C.J. Wienken, P. Baaske, U. Rothbauer, D. Braun, S. Duhr, Nat. Commun. 1 (2010) 100. W.G.T. Willats, Plant Mol. Biol. 50 (2002) 837–854. G.P. Smith, Science 228 (1985) 1315–1317. D. Auerbach, S. Thaminy, Proteomics 2 (2002) 611–623. J. Janin, C. Chothia, J. Biol. Chem. 265 (1990) 16027–16030. J. Vaynberg, J. Qin, Trends Biotechnol. 24 (2006) 22–27. M.I. Klapa, K. Tsafou, E. Theodoridis, A. Tsakalidis, N.K. Moschonas, BMC Syst. Biol. 7 (2013) 96. T. Berggård, S. Linse, P. James, Proteomics 7 (2007) 2833–2842. M. Küchle, Br. J. Ophthalmol. 76 (1992) 98–100. J.G. Lees, J.K. Heriche, I. Morilla, J.A. Ranea, C.A. Orengo, Phys. Biol. 8 (2011) 035008. Y. Guo, M. Li, X. Pu, G. Li, X. Guang, W. Xiong, J. Li, BMC Res. Notes 3 (2010) 145. C.-Y. Yu, L.-C. Chou, D.T. Chang, BMC Bioinform. 11 (2010) 167. J. Shen, J. Zhang, X. Luo, W. Zhu, K. Yu, K. Chen, Y. Li, H. Jiang, Proc. Natl. Acad. Sci. 104 (2007) 4337–4341. F. Pazos, M. Helmer-Citterich, G. Ausiello, A. Valencia, J. Mol. Biol. 271 (1997) 511–523. X.-W. Chen, M. Liu, Bioinformatics 21 (2005) 4394–4400. O. Keskin, R. Nussinov, A. Gursoy, Methods Mol. Biol. (Clifton, NJ) 484 (2008) 505–521. A. Ben-Hur, W.S. Noble, Bioinformatics (Oxford, England) 21 (Suppl. 1) (2005) i38–46. L.R. Matthews, P. Vaglio, J. Reboul, H. Ge, B.P. Davis, J. Garrels, S. Vincent, M. Vidal, Genome Res. 11 (2001) 2120–2126. I. Ezkurdia, L. Bartoli, P. Fariselli, R. Casadio, A. Valencia, Brief Bioinform. 10 (2009) 233–246. M. Strong, P. Mallick, M. Pellegrini, M.J. Thompson, D. Eisenberg, Genome Biol. 4 (2003) R59. T. Dandekar, B. Snel, M. Huynen, P. Bork, Trends Biochem. Sci. 23 (1998) 324–328. A.J. Enright, I. Iliopoulos, N.C. Kyrpides, C.A. Ouzounis, Nature 402 (1999) 86–90. V.J. Promponas, C.A. Ouzounis, I. Iliopoulos, Brief Bioinform. 6 (2012) 443–454. M. Pellegrini, Proc. Natl. Acad. Sci. U.S.A. 96 (1999) 4285–4288. J. Sun, Y. Li, Z. Zhao, Biochem. Biophys. Res. Commun. 353 (2007) 985–991. C. Blaschke, R. Hoffmann, Comp. Funct. Genomics 2 (2001) 310–313. C.F. Schaefer, K. Anthony, S. Krupa, J. Buchoff, M. Day, T. Hannay, K.H. Buetow, Nucleic Acids Res. 37 (2009) D674–D679 (Database issue). L. Matthews, G. Gopinath, M. Gillespie, M. Caudy, D. Croft, B. de Bono, P. Garapati, J. Hemish, H. Hermjakob, B. Jassal, A. Kanapin, S. Lewis, S. Mahajan, B. May, E. Schmidt, I. Vastrik, G. Wu, E. Birney, L. Stein, P. D’Eustachio, Nucleic Acids Res. 37 (2009) D619–D622 (Database issue). D. Tikk, P. Thomas, P. Palaga, J. Hakenberg, U. Leser, PLoS Comput. Biol. 6 (2010) e1000837. J.R.A. Hutchins, Y. Toyoda, B. Hegemann, I. Poser, J.-K. Hériché, M.M. Sykora, M. Augsburg, O. Hudecz, B.A. Buschhorn, J. Bulkescher, C. Conrad, D. Comartin,

[42] [43] [44] [45] [46] [47] [48] [49] [50] [51] [52] [53] [54] [55] [56] [57] [58] [59] [60]

[61] [62] [63] [64] [65] [66] [67] [68] [69] [70] [71] [72]

[73]

[74] [75] [76] [77]

[78] [79]

[80]

[81]

[82] [83] [84]

A. Schleiffer, M. Sarov, A. Pozniakovsky, M.M. Slabicki, S. Schloissnig, I. Steinmacher, M. Leuschner, A. Ssykor, S. Lawo, L. Pelletier, H. Stark, K. Nasmyth, J. Ellenberg, R. Durbin, F. Buchholz, K. Mechtler, A.A. Hyman, J.-M. Peters, Science 328 (2010) 593–599. J.-F. Xia, X.-M. Zhao, D.-S. Huang, Amino Acids 39 (2010) 1595–1599. M.S. Scott, G.J. Barton, BMC Bioinformatics 8 (2007) 239. M.D. McDowall, M.S. Scott, G.J. Barton, Nucleic Acids Res. 37 (Suppl. 1) (2009) D651–D656. A.M. Cohen, W.R. Hersh, Brief Bioinform. 6 (2005) 57–71. R. Rodriguez-Esteban, PLoS Comput. Biol. 5 (2009) e1000597. M. Miwa, R. Saetre, Y. Miyao, J. Tsujii, Int. J. Med. Inform. 78 (2009) e39–46. F. Rinaldi, G. Schneider, K. Kaljurand, M. Hess, C. Andronis, O. Konstandi, A. Persidis, Artif. Intell. Med. 39 (2007) 127–136. H.H.H.B.M. Van Haagen, PLoS ONE 4 (2009) e7894. H.H.H.B.M. Van Haagen, Proteomics 11 (2011) 843–853. M. Kuhn, D. Szklarczyk, A. Franceschini, M. Campillos, C. von Mering, L.J. Jensen, A. Beyer, P. Bork, Nucleic Acids Res. 38 (Suppl. 1) (2010) D552–D556. M. Kuhn, D. Szklarczyk, A. Franceschini, C. von Mering, Nucleic Acids Res. 40 (2011) D876–D880. L. Salwinski, Nucleic Acids Res. 32 (2004) D449–D451 (Database issue). L. Licata, L. Briganti, D. Peluso, L. Perfetto, M. Iannuccelli, E. Galeota, F. Sacco, A. Palma, Nucleic Acids Res. 40 (2012) D857–D861 (Database issue). C. Blaschke, A. Valencia, Genome Inform. 12 (2001) 123–134. R. Saetre, K. Yoshida, M. Miwa, T. Matsuzaki, Y. Kano, J. Tsujii, IEEEACM Trans. Comput. Biol. Bioinform. IEEE ACM 7 (2010) 442–453. G.D. Bader, D. Betel, C.W.V. Hogue, Nucleic Acids Res. 31 (2003) 248–250. S. Novichkova, S. Egorov, N. Daraselia, Bioinformatics (Oxford, England) 19 (2003) 1699–1706. D.P.A. Corney, B.F. Buxton, W.B. Langdon, D.T. Jones, Bioinformatics (Oxford, England) 20 (2004) 3206–3213. H. Cunningham, D. Maynard, K. Bontcheva, V. Tablan, in: Proc. 40th Annu. Meeting Assoc. Comput. Linguist., Association for Computational Linguistics, Stroudsburg, PA, USA, 2002, pp. 168–175 (ACL ’02). C.O. Tudor, C.N. Arighi, Q. Wang, C.H. Wu, K. Vijay-Shanker, Database 2012 (2012) bas044. C.O. Tudor, C.J. Schmidt, K. Vijay-Shanker, BMC Bioinformatics 11 (2010) 418. Z.Z. Hu, M. Narayanaswamy, K.E. Ravikumar, K. Vijay-Shanker, C.H. Wu, Bioinformatics (Oxford, England) 21 (2005) 2759–2765. L. Hirschman, A. Yeh, C. Blaschke, A. Valencia, BMC Bioinformatics 6 (Suppl. 1) (2005) S1. Y. Tsuruoka, M. Miwa, K. Hamamoto, J. Tsujii, S. Ananiadou, Bioinformatics (Oxford, England) 27 (2011) i111–119. Y. Tsuruoka, J. Tsujii, S. Ananiadou, Bioinformatics (Oxford, England) 24 (2008) 2559–2560. J.-D. Kim, T. Ohta, J. Tsujii, BMC Bioinformatics 9 (2008) 10. in: Proceedings of the BioNLP 2009 Workshop Companion Volume for Shared Task, Association for Computational Linguistics, Boulder, Colorado, 2009. A. Rzhetsky, I. Iossifov, T. Koike, M. Krauthammer, P. Kra, M. Morris, H. Yu, J. Biomed. Inform. 37 (2004) 43–53. C. Friedman, P. Kra, H. Yu, M. Krauthammer, A. Rzhetsky, Bioinformatics (Oxford, England) 17 (Suppl. 1) (2001) S74–82. A. Patil, K. Nakai, H. Nakamura, Nucleic Acids Res. 39 (2011) D744–D749 (Database issue). S. Kerrien, B. Aranda, L. Breuza, A. Bridge, F. Broackes-Carter, C. Chen, M. Duesbury, M. Dumousseau, M. Feuermann, U. Hinz, C. Jandrasits, R.C. Jimenez, J. Khadake, U. Mahadevan, P. Masson, I. Pedruzzi, E. Pfeiffenberger, P. Porras, A. Raghunath, B. Roechert, S. Orchard, H. Hermjakob, Nucleic Acids Res. 40 (2012) D841–D846 (Database issue). A. Chatr-Aryamontri, B.-J. Breitkreutz, S. Heinicke, L. Boucher, A. Winter, C. Stark, J. Nixon, L. Ramage, N. Kolas, L. O’Donnell, T. Reguly, A. Breitkreutz, A. Sellam, D. Chen, C. Chang, J. Rust, M. Livstone, R. Oughtred, K. Dolinski, M. Tyers, Nucleic Acids Res. 41 (2013) D816–D823 (Database issue). M. Punta, Nucleic Acids Res. 40 (2012) D290–D301. M. Ashburner, Nat. Genet. 25 (2000) 25–29. A. Bossi, B. Lehner, Mol. Syst. Biol. 5 (2009) 260. A. Elefsinioti, Ö.S. Saraç, A. Hegele, C. Plake, N.C. Hubner, I. Poser, M. Sarov, A. Hyman, M. Mann, M. Schroeder, U. Stelzl, A. Beyer, Mol. Cell. Proteomics MCP 10 (M111) (2011) 010629. C. Plake, L. Royer, R. Winnenburg, J. Hakenberg, M. Schroeder, Nucleic Acids Res. 37 (2009) W300–W304 (Web Server issue). T.S. Keshava, R. Goel, K. Kandasamy, S. Keerthikumar, S. Kumar, S. Mathivanan, D. Telikicherla, R. Raju, B. Shafreen, A. Venugopal, L. Balakrishnan, A. Marimuthu, S. Banerjee, D.S. Somanathan, A. Sebastian, S. Rani, S. Ray, C.J. Harrys Kishore, S. Kanth, M. Ahmed, M.K. Kashyap, R. Mohmood, Y.L. Ramachandra, V. Krishna, B.A. Rahiman, S. Mohan, P. Ranganathan, S. Ramabadran, R. Chaerkady, A. Pandey, Nucleic Acids Res. 37 (2009) D767–D772 (Database issue). A. Ruepp, B. Brauner, I. Dunger-Kaltenbach, G. Frishman, C. Montrone, M. Stransky, B. Waegele, T. Schmidt, Nucleic Acids Res. 36 (2008) D646–D650 (Database issue). M. Kanehisa, M. Araki, S. Goto, M. Hattori, M. Hirakawa, M. Itoh, T. Katayama, S. Kawashima, S. Okuda, T. Tokimatsu, Y. Yamanishi, Nucleic Acids Res. 36 (2008) D480–D484 (Database issue). L. Breiman, Mach. Learn. 45 (2001) 5–32. K.R. Brown, I. Jurisica, Genome Biol. 8 (2007) R95. K.R. Brown, I. Jurisica, Bioinformatics 21 (2005) 2076–2082.

Please cite this article in press as: N. Papanikolaou et al., Methods (2014), http://dx.doi.org/10.1016/j.ymeth.2014.10.026

N. Papanikolaou et al. / Methods xxx (2014) xxx–xxx [85] NCBI Resource Coordinators, Nucleic Acids Res. 41 (2013) D8–D20 (Database issue). [86] B. Boeckmann, A. Bairoch, R. Apweiler, M.-C. Blatter, A. Estreicher, E. Gasteiger, Nucleic Acids Res. 31 (2003) 365–370. [87] The UniProt Consortium, Nucleic Acids Res. 42 (2014) D191–D198. [88] R. Hoffmann, Curr. Protoc. Bioinformatics (2007) (Chapter 1: Unit1.16). [89] S. Balaji, C. Mcclendon, R. Chowdhary, Bioinformatics (Oxford, England) 28 (2012) 747–749. [90] R. Chowdhary, J. Zhang, Bioinformatics (Oxford, England) 25 (2009) 1536–1542. [91] P. Blohm, G. Frishman, P. Smialowski, F. Goebels, B. Wachinger, A. Ruepp, D. Frishman, Nucleic Acids Res. 42 (2014) D396–400. [92] T. Barnickel, J. Weston, R. Collobert, H.-W. Mewes, V. Stümpflen, PLoS ONE 4 (2009) e6393. [93] L. Hunter, Z. Lu, J. Firby, W.A. Baumgartner Jr, BMC Bioinformatics 9 (2008) 78. [94] Protégé, A free, open-source ontology editor and framework for building intelligent systems. . [95] B. Settles, Bioinformatics (Oxford, England) 21 (2005) 3191–3192. [96] LingPipe 4.1.0, . [97] C. Li, A. Jimeno-Yepes, M. Arregui, H. Kirsch, D. Rebholz-Schuhmann, Database 2013 (2013) bat030. [98] S. Kim, D. Kwon, S.-Y. Shin, Bioinformatics (Oxford, England) 28 (2012) 597–598. [99] S. Kim, S.-Y. Shin, I.-H. Lee, S.-J. Kim, R. Sriram, B.-T. Zhang, Nucleic Acids Res. 36 (2008) W411–W415 (Web Server issue). [100] D. Cheng, C. Knox, N. Young, P. Stothard, S. Damaraju, Nucleic Acids Res. 36 (2008) W399–W405 (Web Server issue). [101] M. Huang, X. Zhu, Y. Hao, Bioinformatics (Oxford, England) 20 (2004) 3604–3612. [102] Z. Yang, Z. Zhao, Y. Li, Y. Hu, H. Lin, IEEE Trans. Nanobiosci. 12 (2013) 173–181. [103] Y. Li, H. Lin, Z. Yang, BMC Bioinformatics 10 (2009) 223. [104] R.C. Bunescu, R.J. Mooney, in: Proc. Hum. Lang. Technol. Conf. Conf. Empire Methods Nat. Lang. Process, 2005, pp. 724–731. [105] M. He, Y. Wang, W. Li, PLoS ONE 4 (2009) e4554. [106] D. Maglott, J. Ostell, K.D. Pruitt, T. Tatusova, Nucleic Acids Res. 35 (2007) D26–D31 (Database issue). [107] Y. Tsuruoka, Y. Tateishi, J.-D. Kim, T. Ohta, J. McNaught, S. Ananiadou, J. Tsujii, in: P. Bozanis, E.N. Houstis (Eds.), Advances in Informatics, vol. 3746, Berlin Heidelberg, Springer, 2005, pp. 382–392 (Lecture Notes in Computer Science). [108] W. Li, M. He, H. Zhou, Hum. Mutat. 27 (2006) 402–407. [109] K. Raja, S. Subramani, J. Natarajan, Database 2013 (2013) bas052. [110] K. Raja, S. Subramani, J. Natarajan, in: Proc. Tenth Asia Pac. Bioinform. Conf., Melbourne, Australia, 2012. [111] K. Raja, S. Subramani, J. Natarajan, in: Proc. Second Int. Conf. Bioinform. Syst. Biol. INCOBS, 2011. [112] Fundel K, Küffner R, Zimmer R, Bioinformatics (Oxford, England) 23 (2007) 365–371.

7

[113] S.-W. Zhang, Y.-J. Li, L. Xia, Q. Pan, BMC Bioinformatics 11 (2010) 326. [114] T. Ohta, Y. Tateisi, J.-D. Kim, in: Proc. Second Int. Conf. Hum. Lang. Technol. Res., Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2002, pp. 82– 86 (HLT ’02). [115] A. Franceschini, D. Szklarczyk, S. Frankild, M. Kuhn, M. Simonovic, A. Roth, J. Lin, P. Minguez, P. Bork, C. von Mering, L.J. Jensen, Nucleic Acids Res. 41 (2013) D808–D815 (Database issue). [116] J. Saric, Bioinformatics (Oxford, England) 22 (2006) 645–650. [117] J. Das, H. Yu, BMC Syst. Biol. 6 (2012) 92. [118] S. Orchard, S. Kerrien, S. Abbani, B. Aranda, J. Bhate, S. Bidwell, A. Bridge, L. Briganti, F.S.L. Brinkman, F. Brinkman, G. Cesareni, A. Chatr-aryamontri, E. Chautard, C. Chen, M. Dumousseau, J. Goll, R.E.W. Hancock, R. Hancock, L.I. Hannick, I. Jurisica, J. Khadake, D.J. Lynn, U. Mahadevan, L. Perfetto, A. Raghunath, S. Ricard-Blum, B. Roechert, L. Salwinski, V. Stümpflen, M. Tyers, et al., Nat. Methods 9 (2012) 345–350. [119] H. Hermjakob, L. Montecchi-Palazzi, G. Bader, J. Wojcik, L. Salwinski, A. Ceol, S. Moore, S. Orchard, U. Sarkans, C. von Mering, B. Roechert, S. Poux, E. Jung, H. Mersch, P. Kersey, M. Lappe, Y. Li, R. Zeng, D. Rana, M. Nikolski, H. Husi, C. Brun, K. Shanker, S.G.N. Grant, C. Sander, P. Bork, W. Zhu, A. Pandey, A. Brazma, B. Jacq, et al., Nat. Biotechnol. 22 (2004) 177–183. [120] C.H. Wu, L.-S.L. Yeh, H. Huang, L. Arminski, J. Castro-Alvear, Y. Chen, Z. Hu, P. Kourtesis, R.S. Ledley, B.E. Suzek, C.R. Vinayaka, J. Zhang, W.C. Barker, Nucleic Acids Res. 31 (2003) 345–347. [121] UniProt Consortium, Nucleic Acids Res. 41 (2013) D43–D47 (Database issue). [122] S. Razick, G. Magklaras, I.M. Donaldson, BMC Bioinformatics 9 (2008) 405. [123] H.W. Mewes, C. Amid, R. Arnold, D. Frishman, U. Güldener, G. Mannhaupt, M. Münsterkötter, P. Pagel, N. Strack, V. Stümpflen, J. Warfsmann, A. Ruepp, Nucleic Acids Res. 32 (2004) D41–D44 (Database issue). [124] G.T. Hart, I. Lee, E.R. Marcotte, BMC Bioinformatics 8 (2007) 236. [125] S. Pu, J. Wong, B. Turner, E. Cho, Nucleic Acids Res. 37 (2009) 825–831. [126] K. Venkatesan, J.-F. Rual, A. Vazquez, U. Stelzl, I. Lemmens, T. HirozaneKishikawa, T. Hao, M. Zenkner, X. Xin, K.-I. Goh, M.A. Yildirim, N. Simonis, K. Heinzmann, F. Gebreab, J.M. Sahalie, S. Cevik, C. Simon, A.-S. de Smet, E. Dann, A. Smolyar, A. Vinayagam, H. Yu, D. Szeto, H. Borick, A. Dricot, N. Klitgord, R.R. Murray, C. Lin, M. Lalowski, J. Timm, et al., Nat. Methods 6 (2009) 83–90. [127] H. Yu, P. Braun, M.A. Yildirim, I. Lemmens, K. Venkatesan, J. Sahalie, T. Hirozane-Kishikawa, F. Gebreab, N. Li, N. Simonis, T. Hao, J.-F. Rual, A. Dricot, A. Vazquez, R.R. Murray, C. Simon, L. Tardivo, S. Tam, N. Svrzikapa, C. Fan, A.S. de Smet, A. Motyl, M.E. Hudson, J. Park, X. Xin, M.E. Cusick, T. Moore, C. Boone, M. Snyder, F.P. Roth, et al., Science 322 (2008) 104–110. [128] P.D. Thomas, M.J. Campbell, A. Kejariwal, H. Mi, B. Karlak, R. Daverman, K. Diemer, A. Muruganujan, A. Narechania, Genome Res. 13 (2003) 2129–2141. [129] P. Smialowski, P. Pagel, P. Wong, B. Brauner, I. Dunger, G. Fobo, G. Frishman, C. Montrone, T. Rattei, D. Frishman, A. Ruepp, Nucleic Acids Res. 38 (2010) D540–D544 (Database issue). [130] F. Browne, H. Wang, H. Zheng, F. Azuaje, Source Code Biol. Med. 4 (2009) 2. [131] X. Chen, Nucleic Acids Res. 39 (2011) D750–D754 (Database issue). [132] R. Sharan, S. Suthram, Proc. Natl. Acad. Sci. U.S.A. 102 (2005) 1974–1979.

Please cite this article in press as: N. Papanikolaou et al., Methods (2014), http://dx.doi.org/10.1016/j.ymeth.2014.10.026

Protein-protein interaction predictions using text mining methods.

It is beyond any doubt that proteins and their interactions play an essential role in most complex biological processes. The understanding of their fu...
622KB Sizes 4 Downloads 6 Views