IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 13, NO. 4, DECEMBER 2014

415

Prediction of Essential Proteins Based on Overlapping Essential Modules Bihai Zhao, Jianxin Wang , Senior Member, IEEE, Min Li, Fang-xiang Wu, and Yi Pan, Senior Member, IEEE

Abstract—Many computational methods have been proposed to identify essential proteins by using the topological features of interactome networks. However, the precision of essential protein discovery still needs to be improved. Researches show that majority of hubs (essential proteins) in the yeast interactome network are essential due to their involvement in essential complex biological modules and hubs can be classified into two categories: date hubs and party hubs. In this study, combining with gene expression profiles, we propose a new method to predict essential proteins based on overlapping essential modules, named POEM. In POEM, the original protein interactome network is partitioned into many overlapping essential modules. The frequencies and weighted degrees of proteins in these modules are employed to decide which categories does a protein belong to? The comparative results show that POEM outperforms the classical centrality measures: Degree Centrality (DC), Information Centrality (IC), Eigenvector Centrality (EC), Subgraph Centrality (SC), Betweenness Centrality (BC), Closeness Centrality (CC), Edge Clustering Coefficient Centrality (NC), and two newly proposed essential proteins prediction methods: PeC and CoEWC. Experimental results indicate that the precision of predicting essential proteins can be improved by considering the modularity of proteins and integrating gene expression profiles with network topological features. Index Terms—Essential modules, essential protein, overlapping.

E

I. INTRODUCTION

SSENTIAL proteins, considered to be the foundation of life [1], are indispensable for the survival of an organism. Identification of essential proteins is very important not only for understanding the basic requirements to sustain a life form, but also for emerging field of synthetic biology which aims to create a cell with minimal genome [2]. Furthermore, essential proteins are drug targets for novel antimicrobials [3] due to their indispensability for bacterial cell survival. Manuscript received November 23, 2013; revised May 19, 2014; accepted July 08, 2014. Date of publication August 07, 2014; date of current version November 25, 2014. This work is supported in part by the National Natural Science Foundation of China under Grant No. 61232001, No. 61370024, No. 61379108 and the Program for New Century Excellent Talents in University under Grant NCET-12-0547. Asterisk indicates corresponding author. B. Zhao and M. Li are with the Department of Computer Science, School of Information Science and Engineering, Central South University, Changsha 410083, China (e-mail: [email protected]; [email protected]). *J. Wang is with the Department of Computer Science, School of Information Science and Engineering, Central South University, Changsha 410083, China (e-mail: [email protected]). F. Wu is with the Department of Mechanical Engineering and Division of Biomedical Engineering, University of Saskatchewan, Saskatoon, SK S7N 5A9, Canada (e-mail: [email protected]). Y. Pan is with the Department of Computer Science, Georgia State University, Atlanta, GA 30302-4110 USA (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNB.2014.2337912

Many experimental methods have been developed to predict and discover essential proteins, such as single gene knockouts [4], RNA interference [5] and conditional knockouts [6]. However, these experimental methods are expensive and time-consuming. Moreover, the speed for genome sequencing far outpaces that of the genome-wide gene essentiality studies [7]. As a result, a computational approach is an alternative for identifying essential proteins. Recent developments in experiments such as yeast two-hybrid [8], tandem affinity purification [9] and mass spectrometry [10] have resulted in the publication of many high-quality, largescale protein-protein interaction (PPI) data sets, which provide fundamental and abundant data for computational approaches to the prediction of essential proteins. In recent years, many computational approaches have been developed to identify essential proteins according to their features. One of the most important features of essential proteins is their sequence properties, which are intrinsic features of an individual protein. Sequence features like GC-content, protein size, ORF length and cellular localization [11]–[14] are used for predicting essential proteins. They have been successfully applied for inferring essential genes from the widely studied yeast S. cerevisiae to the less studied yeast strain S. mikatae. Another important feature of essential proteins is their topological properties in PPI networks. Sequence features provide the information of individual proteins, while the topological features provide the information of interacting protein pairs. A PPI network contains a small number of highly connected nodes (hubs) and a large number of poorly connected nodes (non-hubs). It was found that hubs tend to be essential and evolutionarily conserved to a larger extent than non-hubs [15], [16]. This phenomenon is observed in several species, such as Saccharomyces cerevisiae, Caenorhabditis elegans, and Drosophila melanogaster [17]–[19]. As a consequence, a series of centrality measures based on network topological features have been used for identifying essential proteins, such as Eigenvector Centrality (EC) [20], Information Centrality (IC) [21], Degree Centrality (DC) [22], Closeness Centrality (CC) [23], Betweenness Centrality (BC) [24], Subgraph Centrality (SC) [25], Edge Clustering Coefficient Centrality (NC) [26], and so on. Considering functional building blocks within biological network, Koschützki et al. [27] incorporated functional substructures (motifs) into network centrality analysis and presented a network-motif centrality to rank vertices of networks. These methods rank proteins in terms of their centrality in PPI networks and use their ranking scores to judge whether a protein is essential. Sporns et al. [28] studied structural and functional motifs com-

1536-1241 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

416

IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 13, NO. 4, DECEMBER 2014

position (characteristic network building blocks) and obtained network topologies that resemble real brain networks across a broad spectrum of structural measures, including small-world attributes. The results supported the hypothesis that while brain networks maximize both the number and the diversity of functional motifs, the repertoire of structural motifs is relatively small. Based on multivariate statistical analysis, through integrating degree, closeness, betweenness, k-shell, clustering coefficient, semi-local centrality, eigenvector centrality and further considering the appearances of nodes in network motifs, Wang et al. [29] developed a new measure to characterize the structurally dominant proteins (SDP) in PPI networks. Although a great progress has been made on the computational methods for the identification of essential proteins based on network topologies, the identification of essential proteins based on topological property is still very challenging. One of the most important factors is that a significant proportion of PPI networks obtained from high-throughput biological experiments have been found to contain false positives and false negatives [30]. Most centrality methods are sensitive to such noise of PPI network. To overcome these limitations, biological information has been integrated with network topology to improve the precision of essential protein discovery methods. Hsing et al. [31] develop a new method for predicting highly-connected “hub” nodes based on the available interaction data and Gene Ontology (GO) annotations. By using supervised machine learning-based methods, Acencio et al. [14] combine network topological properties with genomic features, such as cellular localization and biological process information to identify essential proteins. Kim [33] construct a new feature space, named CENT-ING-GO consisting of various centrality measures and GO terms and propose a new method to predict essential proteins based on machine learning methods. With the integration of network topology and gene expression, Li et al. [34] propose a new method called PeC to identify essential proteins. Recently, Zhang et al. [35] proposes a new method, named CoEWC to discovery essential proteins based on the integration of topological properties of PPI networks and the co-expression of interacting proteins. Those methods, which integrate network topology and biological information, increase the precision of predicting essential proteins in comparison with those centrality measures only based on network topological features. It is reported that hubs (highly connected nodes) in yeast interactome network can be classified into two categories: date hubs and party hubs [36], [37]. Research [38] shows that most of hubs in the yeast interactome are essential due to their involvement in highly connected biological modules. Proteins in these modules share biological functions that are enriched in essential proteins. Although date hubs and party hubs are mentioned in the CoEWC method, further descriptions about date hubs and party hubs are missing. In other words, CoEWC does not make a clear distinction between date hubs and party hubs for the identification of essential proteins. Inspired by the researches and discoveries mentioned above, we propose a new method for predicting essential proteins in the yeast interactome network based on overlapping essential mod-

ules, named POEM. As currently available PPI datasets contain many false positives, the POEM integrates network topology with gene expression profiles [32] to reduce the negative impact of noise on the prediction of essential proteins. Different from other centrality methods, POEM pays more attention to predict essential biological modules, and to identify date hubs and party hubs literally using a computational method. To evaluate the performance of POEM, we predict essential proteins by using a yeast network: DIP data [40]. Our analysis focuses on the Saccharomyces cerevisiae because both the PPI and gene expression data are most complete in this species. Experimental results show that the POEM method outperforms other previous centrality measures: DC [22], BC [24], CC [23], SC [25], EC [20], IC [21], NC [26], and two other new methods by integrating network topological features and gene expression: PeC [34], CoEWC [35]. We also compare the prediction performance of POEM with other methods based on proteins from Krogan data [41], which is another yeast network compiled from diverse sources of interaction evidence. Results confirm that POEM gets the best performance on prediction of essential proteins in Krogan. II. METHODS A. Motivations Hart et al. [42] point out that essentiality is tied not to the protein or gene itself, but to the molecular module to which that protein belongs. Zotenko et al. [38] put forward the ECOBIMs (Essential Complex Biological Modules), which is a group of densely connected proteins with shared biological function that are enriched in essential proteins. Therefore, we think that partitioning the PPI network into large groups of densely connected and functionally related modules should be a good way to discovery essential proteins. Han et al. [36] suggest that the yeast protein interaction network is made up of two sorts of hubs, party hubs, and date hubs. This classification reveals a model of organized modularity for the yeast protein-protein interaction network. Modules connected through mediators or adaptors, which are date hubs. Party hubs represent integral elements within distinct modules. Date hubs participate in more genetic interactions and evolve more rapidly than party hubs. Bertin et al. [37] confirm the distinctions between date and party hubs in a high-quality filtered yeast interactome dataset. It is suggested there should exist overlaps in these biological modules. Based on these findings, we propose a computational method to partition an original PPI network into many densely connected, overlapping biological modules for the prediction of essential proteins. At the same time, we analyze the relationship between the frequencies of proteins in these modules and its essentiality. Fig. 1 shows how the percentage of essential proteins fluctuates under various frequencies of proteins in biological modules derived from CYC2008 [39]. The CYC2008 is a benchmark functional modules set, which consists of 408 modules. In POEM, we suggest that proteins not contained in any predicted biological modules are regarded as nonessential proteins. When frequencies of proteins in modules equal to 1, these proteins are marked as party hubs and the rest are date hubs.

ZHAO et al.: PREDICTION OF ESSENTIAL PROTEINS BASED ON OVERLAPPING ESSENTIAL MODULES

417

edge (interaction) is weighted as 0.01. While has 10 neighbors and the weight of each edge is 0.1. According to the definition of weighted degree, the weighted degree of both and are 1. However, we think that is more important and has closer connection to its network than . In this case, we introduce the concept of average weighted degree to distinguish and . Given a weighted PPI network and a vertex , . , , . denotes the average weighted degree of within and is defined as: (2)

Fig. 1. The relationship between the frequencies of proteins in modules and their essentiality.

From Fig. 1 we can see that, when the frequency equals to zero, the percentage of essential proteins is 13.1%. That is, 13.1% of proteins not contained in any predicted biological modules are essential proteins. The frequency of proteins in modules equals to 1, the percentage of essential proteins is 19.5%, while the frequency from 6 to 22, more than half of proteins in these biological modules are essential proteins. As only one protein contained in 23 different modules, the phenomenon that the percentage of essential proteins is 0% can be regarded as an accident. The statistical results confirmed our hypothesis mentioned above. B. Preliminaries Proteins in cells are not independent. They interact with each other and make up of PPI networks. A PPI network can be modeled as a simple graph , in which a vertex in set represents a protein and an edge in set represents an interaction between two distinct proteins. To describe POEM simply and clearly, we provide the following definitions, firstly. Weighted degree (WD) [43] Given a weighted PPI network and a vertex , . , , , is the weight of an edge . denotes the weighted degree of within and is defined as: (1) For a protein pair of a weighted PPI network, the higher the weight is, the more likely the two proteins interact with each other. The intuition behind the weighting method is simple: if the weight of an interaction reflects its reliability, then the weighted degree should better represent the actual interaction network than the initial binary ones. So, the concept of weighted degree is more suitable than the concept of degree to describe the importance of a node within a network. Average weighted degree (AWD). Now, let’s carry on further analysis for the weighted degree through an example. and are vertices of weighted PPI networks and , respectively. , , , . There are 100 vertices connected to , and each

Aggregation coefficient (AC). In our proposed POEM method, we try to discovery essential proteins through the predicted overlapping essential modules. But what is essential module? How to define the essential module? We suggest that a subgraph representing an essential module should satisfy two simple structural properties: it should contain many reliable interactions between its subunits, and it should be well-separated from the rest of the network. So the Aggregation Coefficient is used to partition the original PPI network and generate many essential modules. and a Given a weighted sub-network . is another weighted subvertex network, which consists of neighbors of all vertices in and their related edges. The Aggregation Coefficient of in is defined as [44]: (3) In this paper, we use the concept of aggregation coefficient to measure whether a subgraph can be marked as an essential module. Actually, the concept of aggregation coefficient is a variation of -module [43]. C. POEM Method Motivated by researches of Han [36] and Zotenko [38], we propose a new method for the prediction of essential proteins based on overlapping essential modules, named POEM. In POEM, proteins are classified into three categories: nonessential proteins, date hubs and party hubs. The original PPI network is partitioned into a lot of overlapping modules. Proteins not contained in any modules are mark as nonessential proteins. If a protein only appears in one module, the protein is labeled with party hubs, otherwise, the protein is represented as date hubs. Our method consists of three major stages: The first stage of POEM, vertex weighting, weights all vertices based on the integration of topological properties of PPI network and gene expression profiles. Edge clustering coefficient (ECC) [45] is widely used to weight vertices (interactions) of PPI networks and identify the modularity of networks. The edges with the higher ECC are more probably involved in the community structure in networks. But here, the adjust edge clustering coefficient (AdjustECC) is used instead of the edge clustering coefficient because AdjustECC take into account the im-

418

IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 13, NO. 4, DECEMBER 2014

pact of a pair of proteins to the weight, while ECC amplifies the influence of the protein with high connections. It is a normaland is the neighborhood sets of ization of ECC, where and , respectively. If there is not an interaction between and , AdjustECC . AdjustECC of edge is defined as follows:

a neighborhood graph is formed by all neighbors of the seed and their corresponding edges. In particular, for any vertex , if is less than a given threshold, the vertex is removed from the neighborhood graph and loses the opportunity to be a seed unfortunately. An essential module is formed by the rest of vertices only when the size of is over three. This process is repeated neighborhood graph vertex in the adjacent mafor the next highest unseeded trix. After visiting all vertices of the adjacent matrix, a lot of overlapping essential modules is predicted. The third stage is computation of ranking scores. For a is defined as the sum of protein , its ranking score weight degree of in all essential modules. According to the previous description, we can see that there are many proteins that lose the opportunity to be a member of any essential modules. In other words, not all the proteins are contained in essential modules. So, if vertex is not contained in any essential module, is identified as a nonessential protein and is assigned to . Why does it not equal zero? According to (4)–(6), the weight of en edge may be taken a value between 1 and 2. So, it is possible that the sum of equals zero, while the protein is a member of predicted essential modules. Let is a set of predicted overlapping essential modules generated at can be calculated by the the second stage. Generally, following formula:

(4) AdjustECC is derived from the statistical results that there are many nonessential proteins having high connections while a certain ratio of essential proteins having low connections in reality. The effectiveness of AdjustECC heavily depends on the reliability of the PPI networks, while PPI networks obtained from high-throughput biological experiments have been found to contain false positives. To improve the precision of essential protein discovery, gene expression information is integrated with network topology. Gene expression is the process by which information from a gene is used in the synthesis of a functional gene product. These gene products are often proteins. For a protein , its gene expressions with different times are denoted as a variate: , denotes the expression level of gene at time . Generally, the cross-correlation (CC) [34] is used to evaluate the probability that two proteins are co-expressed. The CC of a pair of proteins and can be calculated as follows:

(7)

(5)

In POEM, for a pair of proteins and , the weight of edge is calculated by the follow formula: (6) After weighting edges of PPI network, POEM will do some further processing. Edges (interactions) with weight equal to zero are removed, after that isolated vertices (proteins) are also deleted, firstly. As the cornerstone of our POEM method, overlapping essential modules are derived from the neighborhood graph of each vertex in the PPI network. We believe that not all of the proteins play the same important role in identifying essential proteins. So, vertices are sorted in the descending order in their neighborhood by the average weight degree graphs, and then an orderly adjacent matrix is created according to the weighted PPI network. The second stage, overlapping essential modules prediction, inputs the adjacent matrix, and generates many overlapping essential modules through neighborhood graphs. In this stage, a is selected as a seed and vertex which has the highest

Algorithm POEM illustrates the overall framework to predict essential proteins from the PPI network. According to (6), we generate a weighted PPI network from the original PPI network at line 1. Edges whose weights equal to zero are removed from the weighted network, after that, and isolated vertices are also deleted at line 2. At the last step of the first stage, an adjacent matrix is created, in which vertices are ordered by their in their neighborhood graphs at line 3. For each vertex in the weighted network, we identify preliminary essential modules based on their neighborhood ” means that the vertex has graphs. At line 6, “ been kicked out from another vertex’s neighborhood graph and is a candidate consisting of can’t be selected as a seed. the seed and its neighbors at line 7. If the of a member ( is excluded) within is not above the threshold , is removed from and labeled with at line 9–11. The candidate can join the as a formal essential module, only when its size is greater than three at line 12–13. The last stage is ranking proteins at line 15–18. For a protein , the number of occurrences in the predicted essential modules is key to determine its category. The frequency equals to zero, is identified as a nonessential protein, when frequency equals to 1, is marked as a party hub, and otherwise, is as a date hub. Scores of all nonessential proteins are , and scores of essential proteins including date hubs and party hubs are calcuin all predicted essential modules. lated by the sum of

ZHAO et al.: PREDICTION OF ESSENTIAL PROTEINS BASED ON OVERLAPPING ESSENTIAL MODULES

419

POEM algorithm Input: A PPI network represented as Graph aggregation coefficient threshold

, the

Output: Ranking scores of proteins 1. Generate a weighted PPI network

by Equation (6);

2. Remove redundant vertices and edges from ;

3. Create an matrix, order vertices by their ; // initialization

4. 5. FOR all 6.

;

in Matrix DO THEN CONTINUE;

IF TAG

7. 8.

FOR all

9.

IF

10.

THEN BEGIN

Remove

;

11. 12.

; END IF THEN

IF Size

; END FOR

13. 14. END FOR 15. FOR all 16.

IF

17.

ELSE

18.

Output

in

DO THEN ; END FOR

With respect to computational complexity of the POEM, there are three major stages, as shown in the POEM Algorithm. For the first stage, the time spent on weighting edges of PPI , the time for creating an matrix is also . network is In the second stage, the time for the prediction of overlapping essential modules is . In the last stage, the time spent . Thus, the time on computation of ranking scores is complexity of our algorithm is . III. RESULTS AND DISCUSSION A. Experimental Data The computational analysis is performed using the PPI network from Saccharomyces cerevisiae (Bakers’ Yeast), as it has been well characterized by knockout experiments and widely used in the evaluations of essential proteins. Furthermore, as the cornerstone of POEM, some important researches and discoveries [36]–[38] are based on the yeast interactome network. We will first present in detail the results on DIP data [40], and the results using Krogan data [41] will also be briefly presented to demonstrate the effectiveness of our proposed method. The DIP dataset, updated to Feb. 18, 2012, consists of 5023 proteins

Fig. 2. Comparison of the number of essential proteins predicted by POEM and nine other competitive centrality methods.

and 22 570 interactions among the proteins. The Krogan dataset consists of 3672 proteins and 14 317 interactions. The self-interactions and the repeated interactions are filtered out in both DIP data and Krogan data. The gene expression data of yeast [46], [51] contains 6776 gene products and 36 samples in total, with 4902 genes involved in the DIP network and 3611 genes included in the Krogan network. For proteins which have no corresponding gene expression data, we simply set zero values. A reference set of essential proteins used in our experiments are collected from the following databases: MIPS [47], SGD [48], DEG [1] and SGDP [49]. Among the 1285 essential proteins, 1156 and 929 essential proteins present in the DIP network and Krogan network, respectively. B. Comparison With Other Centrality Methods In order to evaluate the performance of the proposed new essential proteins prediction method, POEM, we make comprehensive comparisons of our method to a representative set of essential proteins prediction methods: DC, BC, CC, SC, EC, IC, NC, PeC, and CoEWC. As prior knowledge of some reported essential proteins is required for supervised machine learning methods [14], [33], these methods are not included in the representative set. Proteins are ranked according to their values calculated by each centrality methods. After that, top 100, 200, 300, 400, 500, and 600 of the ranked proteins are selected as candidates for essential proteins. According to the list of known essential proteins, the number of true essential proteins is used to judge the performance of each method. The number of essential proteins detected by POEM and nine other methods from the DIP network is shown in Fig. 2. POEM predicts 2733 essential proteins (hubs) from the DIP network, including 1169 party hubs and 1564 data hubs. As illustrated in Fig. 2, POEM significantly outperforms other methods for the prediction of essential proteins from yeast DIP data. By selecting top 100 proteins, POEM can obtain a prediction accuracy of 82%. NC gets the best performance among seven centrality measures only based on

420

IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 13, NO. 4, DECEMBER 2014

TABLE I OVERLAP AND DIFFERENT PROTEINS PREDICTED BY POEM AND OTHER COMPETITIVE CENTRALITY METHODS RANKED IN TOP 200 PROTEINS

network topological property. Compared with NC, the prediction accuracy of POEM is improved by 32.26%, 20.63%, 20.11%, 16.38%, 15.64%, and 12.3% from top 100 to top 600 proteins, respectively. Compared with PeC and CoEWC, which predict essential proteins by integrating network topology and gene expression, POEM also outperforms both. Especially, the more candidate proteins selected, the more essential proteins predicted by POEM than PeC and CoEWC. From Fig. 2 we can see that PeC gets better performance than CoEWC with top 100 proteins, while CoEWC outperforms PeC from top 200 to top 600 proteins. In other words, PeC is better than CoEWC on the first stage and worse than CoEWC on the rest stage, while CoEWC is the opposite. POEM is always able to get the best performance among three methods integrating network topology and gene expression. Fig. 2 shows an interesting phenomenon that both EC and SC generate the same ranking order of predicted essential proteins, though they have different ranking scores. For this reason, EC is not included in the later comparisons. C. Validation With Jackknife Methodology A comprehensive performance comparison between the proposed new essential protein prediction method POEM and NC, PeC and CoEWC (DC, BC, CC, SC, EC, IC are discarded because of low number of predicted essential proteins) is made by using a jackknife methodology [50], [51]. The experimental results are described in Fig. 3. In Fig. 3, the X-axis represents the proteins ranked from the highest score to the lowest score for each method, while Y-axis is the cumulative count of essential proteins with respect to ranked proteins. The areas under the curve (AUC) for POEM and other existing centrality measures are used for the comparison of their performance. AUC is considered to be the standard method to assess the accuracy of predictive distribution models. It avoids the supposed subjectivity in the threshold selection process, when continuous probability derived scores are converted to a binary presence-absence variable, by summarizing overall model performance over all possible thresholds. Furthermore, the 10 random assortments [50], [51] are also plotted for comparison. Fig. 3 shows the comparison result of POEM, NC, PeC, and CoEWC. As the best among those network topology based methods, NC gets the comparable result until at top 1100 proteins. As shown in this figure, it is clear that POEM appears

Fig. 3. Jackknife curves of POEM and eight other existing centrality methods using DIP data.

to be much better than the two methods, which identify essential proteins by integrating gene expression data with PPI data. Moreover, all of the four methods appearing in Fig. 3 achieve better prediction performance than the randomized sorting. In addition, the areas under the curve in Fig. 3 (AUC) for POEM and other centrality measures are compared. The AUC of NC, PeC, CoEWC and POEM is 2.6E+05, 2.68E+05, 2.77E+05, and 2.88E+05, respectively. These comparison results clearly show that POEM is more effective and suitable for the identification of essential proteins. D. Analysis of the Differences Between POEM and the Eight Centrality Methods To further understand why does POEM outperform the other centrality measures for the discovery of essential proteins, we compare proteins ranked in top 200 by each method, including DC, IC, SC, BC, CC, NC, PeC, CoEWC, and POEM. The comparison is made by investigating how many overlap and different proteins are discovered by POEM and by anyone of the other eight centrality measures, firstly. The number of overlaps and different proteins between POEM and one of the other centrality measures is shown in Table I. denotes the number of common proteins detected by both POEM and one of the other eight centrality methods Mi. {Mi-POEM} (or {POEM-Mi}) represents the set of proteins detected by Mi

ZHAO et al.: PREDICTION OF ESSENTIAL PROTEINS BASED ON OVERLAPPING ESSENTIAL MODULES

421

TABLE II INFORMATION OF PROTEINS RANKED IN TOP 200 BY POEM, NC, PeC, AND CoEWC

TABLE III SIZE OF MODULES TO WHICH PROTEINS RANKED IN TOP 200 BY POEM, NC, PeC, AND CoEWC BELONGING

Fig. 4. Comparison of the percentage of essential proteins out of all the different proteins between PeC and eight other methods.

(or POEM), but not by POEM (or Mi). is the number of proteins in set {Mi-POEM}. As described in Table I, the common proteins detected by POEM and DC, IC, SC, BC, CC are all less than 20%, that common proteins both predicted by POEM and NC are not more than 50% and the common proteins predicted by POEM and PeC, CoEWC are approximately less than 70%. Such a small overlap between the predicted proteins of POEM and other eight existing methods shows that POEM is a special centrality measure which is much different from others. The fourth column in Table I refers to the number of nonessential proteins among different proteins identified by Mi but not predicted by POEM. According to the further investigation about these nonessential proteins predicted by other methods, we have found that more than three-quarter of these nonessential proteins detected by five network topological features-base centrality measures (DC, IC, SC, BC, and CC) have very low scores of POEM (less than 5.6), there are 47.1%, 50%, and 26.1% of the nonessential proteins predicted by NC, PeC, and CoEWC are with very low scores of POEM (less than 5.6), respectively. In addition, we compare the percentages of different essential proteins resulted by POEM and those by other centrality measures. Fig. 4 shows the percentage of essential proteins out of all the different proteins between POEM and other methods. As shown in Fig. 4, the results indicate that the percentage of essential proteins predicted by POEM is consistently higher than that of discovered by eight other methods for the different proteins among them. Methods of SC and PeC are selected as two extreme examples for the analysis as they have the maximum and minimum different numbers of proteins from POEM, respectively. Compared with SC, out of all the top 200 proteins,

there are 180 different proteins detected by POEM. About 77% of these proteins are essential, while there are only 36% of different proteins detected by SC but not by POEM are essential proteins. For another case, there are 59 different proteins identified by either POEM or by PeC. Among these 59 different proteins, POEM can predict more than 69% essential proteins while PeC only detects less than 56% essential proteins. The similar results are obtained from the rest centrality measures: DC, IC, BC, CC, NC, and CoEWC. Since POEM is designed to predict essential proteins by detecting overlapping essential modules, proteins with high ranking scores computed by POEM should be conserved and with modularity. To validate the hypothesis, we select a list of proteins ranked in top 200 by POEM, NC, PeC, and CoEWC, respectively. According to the known functional module set CYC2008 [39], which consists of 408 manually annotated modules, proteins in each list are annotated with the index of modules which they belong to. Table II lists the statistic information of these proteins. As clearly shown in Table II, compared with NC, PeC, and CoEWC, there are more true essential proteins detected by POEM, and more of these proteins ranked in top 200 by POEM belonging to modules with certain biological function than NC, PeC, and CoEWC. The fourth column in Table II refers to the number of modules which contain proteins ranked in top 200 by each method. There is less the number of modules containing proteins ranked in top 200 by POEM than that of containing the proteins ranked by other methods. According to the third and fourth column in Table II, we can easily get that the average size of modules of NC, PeC, CoEWC, and POEM is 1.51, 1.74, 1.72, and 3.18, respectively. In addition, we make a count on the size of modules to which proteins ranked in top 200 by NC, PeC, CoEWC, and POEM. Table III lists the size of these modules. As described in Table III, there are the fewest number of modules containing only one protein ranked in top 200 by POEM and the most number of modules containing more than 5 proteins ranked in top 200 by POEM among the four methods. For example, there are 18 proteins ranked in top 200 by POEM that belong to

422

IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 13, NO. 4, DECEMBER 2014

Fig. 5. Jackknife curves of POEM when ACT falls into [0,1).

Fig. 6. Jackknife curves of Naïve_POEM and six other centrality measures using DIP data.

functional module 370. The module 370 is 19/22S regulator and its GO term is GO: 0008541 with function of proteasome regulatory particle, lid sub-complex. For NC, PeC, and CoEWC, there are only 12, 14 and 14 proteins in top 200 proteins ranked by them that belong to the module 370, respectively. The statistic results indicate that essential proteins predicted by POEM have the strongest modularity and the most connection with each other among all the methods. The results also accord with POEM’s foundation that the majority of hubs are essential due to their involvement in essential biological modules rather than the individual proteins. An essential biological module is a group of densely connected proteins with shared biological function. E. Effects of Parameter ACT In POEM, in order to evaluate the aggregation degree of a neighborhood graph for the detection of essential modules, we employ a user-defined parameter . As is used to describe the aggregation degree of a subgraph, according to the definition of aggregation coefficient, we can easily get that falls into [0, 1]. To study the effect of on performance of POEM using the DIP data, we evaluate the prediction accuracy by setting different values of parameter . We plot their jackknife curves in Fig. 5. We can see that POEM achieves the best performance when is set as 0 and 0.1, then the prediction accuracy of POEM decreases as the —value increases. Especially, the number of essential proteins predicted by POEM is less than 1200 when the —value is above 0.4. The prediction accuracy of POEM is slightly higher with setting as 0.1 than with setting as 0. So, we recommend that the optimum value is 0.1 using DIP data. F. Comparison With Other Centrality Measures The Eigenvector Centrality (EC), Degree Centrality (DC), Closeness Centrality (CC), Betweenness Centrality (BC), Information Centrality (IC), and Subgraph Centrality (SC) only based on properties extracted from network edges and vertices, while the POEM algorithm is a combination of normalized edge-clustering coefficient (AdjustECC) and cross-correlation. For a further and fair comparison of POEM and other centrality

Fig. 7. Comparison results by a jackknife methodology using Krogan data.

measures, we eliminate the impact of AdjustECC and cross-correlation from POEM. That is, the weight of all interactions in PPI network equal to 1. The advised POEM algorithm is named Naive_POEM. Fig. 6 shows the comparison of Naive_POEM, EC, DC, CC, BC, IC, and SC. From Fig. 6 we can see that the Naive_POEM still gets the best performance than other centrality measures. So, we can draw a conclusion that POEM outperforms other methods due to the predicted overlapping essential modules. The AdjustECC and cross-correlation are used to reduce the negative impact of noise on the prediction. G. Prediction Performance of POEM Based on Krogan Data For comprehensive performance comparison between POEM and other methods, we perform the prediction of essential proteins using Krogan data. On the Krogan PPI network, the ranking scores of proteins are calculated by using of POEM , NC, PeC, and CoEWC (DC, BC, CC, SC, EC, IC are discarded because of low number of predicted proteins). POEM predicts 2500 essential proteins (hubs) from the Krogan network, including 969 party hubs and 1531 data hubs. The jackknife curves of each method and the 10 random assortments are plotted in Fig. 7. All of these experimental results

ZHAO et al.: PREDICTION OF ESSENTIAL PROTEINS BASED ON OVERLAPPING ESSENTIAL MODULES

indicate that POEM still outperforms other centrality methods using Krogan data. IV. CONCLUSIONS Essential proteins are indispensable for cell survival. Identifying essential proteins is very important for improving our understanding the way of a cell working. Recent developments in experiments have resulted in the publication of many highquality, large-scale PPI data sets, which enable us to predict essential proteins using computational approaches. Many network topology-based centrality measures for the discovery of essential proteins have been proposed. However, most of them ignore the modularity of essential proteins and pay more attention to proteins themselves. In this paper, we propose an overlapping essential modules-based method for the prediction of essential proteins, named POEM. To improve the prediction accuracy of essential proteins, gene expression profiles are integrated with network topological features. In POEM, proteins are classified into three categories: nonessential proteins, date hubs and party hubs according to their frequencies in the predicted essential modules. To evaluate the performance of POEM, we have applied our method on two yeast PPI networks. The experimental results show that POEM outperforms the existing methods for the prediction of essential proteins. REFERENCES [1] R. Zhang and Y. Lin, “DEG 5.0, a database of essential genes in both prokaryotes and eukaryotes,” Nucleic Acids Res., vol. 37, no. suppl 1, pp. D455–D458, 2009. [2] J. I. Glass, C. A. Hutchison, H. O. Smith, and J. C. Venter, “A systems biology tour de force for a near-minimal bacterium,” Mol. Syst. Biol., vol. 5, no. 1, pp. 1–3, 2009. [3] A. E. Clatworthy, E. Pierson, and D. T. Hung, “Targeting virulence: A new paradigm for antimicrobial therapy,” Nature Chem. Biol., vol. 3, no. 9, pp. 541–548, 2007. [4] J. Wang, P. Wei, and F. X. Wu, “Computational approaches to predicting essential proteins: A survey,” Proteomics Clin. Appl., vol. 7, no. 1, pp. 181–192, 2013. [5] L. M. Cullen and G. M. Arndt, “Genome-wide screening for gene function using RNAi in mammalian cells,” Immunol. Cell Biol., vol. 83, no. 3, pp. 217–223, 2005. [6] T. Roemer, B. Jiang, J. Davison, T. Ketela, and K. Veillette et al., “Large-scale essential gene identification in Candida albicans and applications to antifungal drug discovery,” Mol. Microbiol., vol. 50, no. 1, pp. 167–181, 2003. [7] Y. Lin and R. R. Zhang, “Putative essential and core-essential genes in Mycoplasma genomes,” Sci. Rep., vol. 1, pp. 1–7, 2011. [8] T. Ito, T. Chiba, R. Ozawa, M. Yoshida, M. Hattori, and Y. Sakaki, “A comprehensive two-hybrid analysis to explore the yeast protein interactome,” Proc. Natl. Acad. Sci., vol. 98, no. 8, pp. 4569–4574, 2001. [9] M. Li, J. Chen, J. Wang, B. Hu, and G. Chen, “Modifying the DPClus algorithm for identifying protein complexes based on new topological structures,” BMC Bioinformat., vol. 9, p. e398, 2008. [10] Y. Ho, A. Gruhler, A. Heilbut, G. Bader, and L. Moore et al., “Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry,” Nature, vol. 415, no. 6868, pp. 180–183, 2002. [11] A. M. Gustafson, E. S. Snitkin, S. C. Parker, C. DeLisi, and S. Kasif, “Towards the identification of essential genes using targeted genome sequencing and comparative analysis,” BMC Genomics, vol. 7, no. 1, p. 265, 2006. [12] J. Zhong, J. Wang, W. Peng, Z. Zhang, and Y. Pan, “Prediction of essential proteins based on gene expression programming,” BMC Genomics, vol. 14, Suppl. 4, S7, 2013. [13] J. Deng, L. Deng, S. Su, M. Zhang, X. Lin, L. Wei, A. Minai, D. J. Hassett, and L. Lu, “Investigating the predictability of essential genes across distantly related organisms using an integrative approach,” Nucleic Acids Res., vol. 39, pp. 795–807, 2011.

423

[14] M. L. Acencio and N. Lemke, “Towards the prediction of essential genes by integration of network topology, cellular localization and biological process information,” BMC Bioinformat., vol. 10, no. 1, p. 290, 2009. [15] X. He and J. Zhang, “Why do hubs tend to be essential in protein networks?,” PLoS Genet., vol. 2, no. 6, p. e88, 2006. [16] R. R. Vallabhajosyula, D. Chakravarti, S. Lutfeali, and A. Ray, “Identifying Hubs in protein interaction networks,” PLoS ONE, vol. 4, no. 4, p. e5344, 2009. [17] H. Yu, D. Greenbaum, L. H. Xin, X. Zhu, and M. Gerstein, “Genomic analysis of essentiality within protein networks,” Trends Genet., vol. 20, no. 6, pp. 227–231, 2004. [18] M. W. Hahn and A. D. Kern, “Comparative genomics of centrality and essentiality in three eukaryotic protein-interaction networks,” Mol. Biol. Evol., vol. 22, no. 4, pp. 803–806, 2005. [19] S. Wuchty, “Interaction and domain networks of yeast,” Proteomics, vol. 2, no. 12, pp. 1715–1723, 2002. [20] P. Bonacich, “Power and centrality: A family of measures,” Ameri. J. Sociol., vol. 92, no. 5, pp. 1170–1182, 1987. [21] K. Stephenson and M. Zelen, “Rethinking centrality: Methods and examples,” Social Netw., vol. 11, no. 1, pp. 1–37, 1989. [22] H. Jeong, S. P. Mason, and A. L. Barabasi, “Lethality and centrality in protein networks,” Nature, vol. 411, no. 6833, pp. 41–42, 2001. [23] S. Wuchty and P. F. Stadler, “Centers of complex networks,” J. Theor. Biol., vol. 223, no. 1, pp. 45–53, 2003. [24] M. P. Joy, A. Brock, D. E. Ingber, and S. Huang, “High-betweenness proteins in the yeast protein interaction network,” BioMed Res. Int., vol. 2005, no. 2, pp. 96–103, 2005. [25] E. Estrada and J. A. Rodriuez-Velaquez, “Subgraph centrality in complex networks,” Phys. Rev. E, vol. 71, no. 5, pp. 1–9, 2005. [26] J. Wang, M. Li, H. Wang, and Y. Pan, “Identification of essential proteins based on edge clustering coefficient,” IEEE/ACM Trans. Comput. Biol. Bioinformat., vol. 9, no. 4, pp. 1070–1080, 2012. [27] D. Koschützki, S. Henning, and S. Falk, “Ranking of network elements based on functional substructures,” J. Theor. Biol., vol. 248, no. 3, pp. 471–479, 2007. [28] O. Sporns and R. Kötter, “Motifs in brain networks,” PLoS Biol., vol. 2, no. 11, p. e369, 2004. [29] P. Wang, X. Yu, and J. Lu, “Identification and evolution of structurally dominant nodes in protein-protein interaction networks,” IEEE Trans. Biomed. Circuits Syst., vol. 8, no. 1, pp. 87–97, 2014. [30] R. Mrowka, A. Patzak, and H. Herzel, “Is There a Bias in Proteome Research?,” Genome Res., vol. 11, no. 12, pp. 1971–1973, 2001. [31] M. Hsing, K. Byler, and A. Cherkasov, “The use of Gene Ontology terms for predicting highly-connected ‘hub’ nodes in protein-protein interaction networks,” BMC Syst. Biol., vol. 2, no. 1, p. 80, 2008. [32] J. Wang, X. Peng, M. Li, and Y. Pan, “Construction and application of dynamic protein interaction network based on time course gene expression data,” Proteomics, vol. 13, no. 2, pp. 301–312, 2013. [33] W. Kim, “Prediction of essential proteins using topological properties in GO-pruned PPI network based on machine learning methods,” Tsinghua Sci. Technol., vol. 17, no. 6, pp. 645–658, 2012. [34] M. Li, H. Zhang, J. Wang, and Y. Pan, “A new essential protein discovery method based on the integration of protein-protein interaction and gene expression data,” BMC Syst. Biol., vol. 6, no. 1, p. 15, 2012. [35] X. Zhang, J. Xu, and W. Xiao, “A new method for the discovery of essential proteins,” PloS ONE, vol. 8, no. 3, p. e58763, 2013. [36] J. D. Han, N. Bertin, and T. Hao et al., “Evidence for dynamically organized modularity in the yeast protein-protein interaction network,” Nature, vol. 430, no. 6995, pp. 88–93, 2004. [37] N. Bertin, N. Simonis, D. Dupuy, M. E. Cusick, J. J. Han, H. B. Fraser, F. P. Roth, and M. Vidal, “Confirmation of organized modularity in the yeast interactome,” PLoS Biol., vol. 5, no. 6, p. e153, 2007. [38] E. Zotenko, J. Mestre, D. P. O’Leary, and T. M. Przytycka, “Why do hubs in the yeast protein interaction network tend to be essential: Reexamining the connection between the network topology and essentiality,” PLoS Comput. Biol., vol. 4, no. 8, p. e1000140, 2008. [39] X. Tang, Q. Feng, J. Wang, Y. He, and Y. Pan, “Clustering based on multiple biological information: Approach for predicting protein complexes,” IET Syst. Biol., vol. 7, no. 5, pp. 223–230, 2013. [40] J. Wang, X. Peng, W. Peng, and F. Wu, “Dynamic protein interaction network construction and applications,” Proteomics, vol. 14, no. 4, pp. 338–352, 2014. [41] W. Peng, J. Wang, B. Zhao, and L. Wang, “Identification of protein complexes using weighted PageRank-Nibble algorithm and core-attachment structure,” IEEE/ACM Trans. Comput. Biol. Bioinformat., 2014, 10.1109/TCBB.2014.2343954.

424

IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 13, NO. 4, DECEMBER 2014

[42] G. T. Hart, I. Lee, and E. M. Marcotte, “A high-accuracy consensus map of yeast protein complexes reveals modular nature of gene essentiality,” BMC Bioinformat., vol. 8, no. 1, p. 236, 2007. [43] J. Wang, M. Li, J. Chen, and Y. Pan, “A fast hierarchical clustering algorithm for functional modules discovery in protein interaction networks,” IEEE/ACM Trans. Comput. Biol. Bioinformat., vol. 8, no. 3, pp. 607–620, 2011. [44] B. H. Zhao, J. Wang, M. Li, F. Wu, and Y. Pan, “Detecting protein complexes based on uncertain graph model,” IEEE/ACM Trans. Comput. Biol. Bioinformat., vol. 11, no. 3, pp. 486–497, 2014. [45] X. Tang, J. Wang, J. Zhong, and Y. Pan, “Predicting essential proteins based on weighted degree centrality,” IEEE/ACM Trans. Comput. Biol. Bioinformat., vol. 11, no. 2, pp. 407–418, 2014. [46] M. Li, X. Wu, J. Wang, and Y. Pan, “Towards the identification of protein complexes and functional modules by integrating PPI network and gene expression data,” BMC Bioinformat., vol. 13, no. 19, 2012. [47] X. Ding, W. Wang, X. Peng, and J. Wang, “Mining protein complexes from PPI Networks using the minimum vertex cut,” Tsinghua Sci. Technol., vol. 6, no. 17, pp. 674–681, 2012. [48] J. M. Cherry, “SGD: Saccharomyces Genome Database,” Nucleic Acids Res., vol. 26, no. 9, pp. 73–79, 1998. [49] Saccharomyces Genome Deletion Project [Online]. Available: http:// www-sequence.stanford.edu/group/ [50] W. Peng, J. Wang, W. Wang, Q. Liu, and F. Wu, “Iteration method for predicting essential proteins based on orthology and protein-protein interaction networks,” BMC Syst. Biol., vol. 6, no. 1, p. 87, 2012. [51] M. Li, R. Zheng, H. Zhang, J. Wang, and Y. Pan, “Effective identification of essential proteins based on prior knowledge, network topology and gene expressions,” Methods, vol. 67, no. 3, pp. 325–333, 2014.

Bihai Zhao received the master’s degree in computer science from Central South University, China, in 2005. Currently, he is a Ph.D student in the School of Information Science and Engineering, Central South University, China. His current research interests include molecular systems biology, data mining, and uncertain data management.

Jianxin Wang (SM’12) received the B.Eng. and M.Eng. degrees in computer engineering from Central South University, China, in 1992 and 1996, respectively, and the Ph.D. degree in computer science from Central South University, China, in 2001. He is the chair of and a professor in Department of Computer Science, Central South University, Changsha, Hunan, China. His current research interests include algorithm analysis and optimization, parameraized algorithm, bioinformatics, and computer networks.

Min Li received the Ph.D. degree in computer science from Central South University, China, in 2008. She is currently an Associate Professor at the School of Information Science and Engineering, Central South University, Changsha, Hunan, China. Her main research interests include bioinformatics and systems biology.

Fang-xiang Wu received the B.S. degree and the M.S. degree in applied mathematics, both from Dalian University of Technology, Dalian, China, in 1990 and 1993, respectively, the first Ph.D. degree in control theory and its applications from Northwestern Polytechnical University, Xi’an, China, in 1998, and the second Ph.D. degree in biomedical engineering from University of Saskatchewan (U of S), Saskatoon, Canada, in 2004. During 2004–2005, he worked as a Postdoctoral Fellow in the Laval University Medical Research Center (CHUL), Quebec City, Canada. Dr. Wu is currently a Professor of Bioengineering in the Department of Mechanical Engineering and the Graduate Chair of the Division of Biomedical Engineering at the U of S. His current research interests include computational and systems biology, genomic and proteomic data analysis, biological system identification and parameter estimation, and applications of control theory to biological systems. Dr. Wu has published more than 170 technical papers in refereed journals and conference proceedings. Dr. Wu is serving as the Editorial Board Member or the Guest Editor of a number of refereed journals and as the Program Committee Chair or Member of several international conferences. He has also reviewed papers for many refereed journals.

Yi Pan (SM’91) received the B.Eng. and M.Eng. degrees in computer engineering from Tsinghua University, China, in 1982 and 1984, respectively, and the Ph.D. degree in computer science from the University of Pittsburgh, Pittsburgh, PA, USA, in 1991. He is the chair of and a professor in the Department of Computer Science at Georgia State University, Atlanta, GA, USA, and a Changjiang Chair professor in the Department of Computer Science at Central South University, China. His research interests include parallel and distributed computing, networks, and bioinformatics. He has published more than 100 journal papers with 50 papers published in various IEEE/ACM journals. In addition, he has published more than 100 papers in refereed conferences. He has also authored/edited 34 books (including proceedings) and contributed many book chapters. He has served as the editor-in-chief or an editorial board member for 15 journals, including six IEEE TRANSACTIONS, and a guest editor for 10 journals, including the IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS and the IEEE TRANSACTIONS ON NANOBIOSCIENCE. He has organized several international conferences and workshops and has also served as a program committee member for several major international conferences such as BIBE, BIBM, ISBRA, INFOCOM, GLOBECOM, ICC, IPDPS, and ICPP. He has delivered more than 10 keynote speeches at many international conferences and is a speaker for several distinguished speaker series. He is listed in Men of Achievement, Who’s Who in Midwest, Who’s Who in America, Who’s Who in American Education, Who’s Who in Computational Science and Engineering, and Who’s Who of Asian Americans.

Prediction of essential proteins based on overlapping essential modules.

Many computational methods have been proposed to identify essential proteins by using the topological features of interactome networks. However, the p...
1MB Sizes 4 Downloads 4 Views