Identifying protein complexes based on the integration of PPI network and gene expression data.

30

Int. J. Bioinformatics Research and Applications, Vol. 11, No. 1, 2015

Identifying protein complexes based on the integration of PPI network and gene expression data Weijie Chen, Min Li*, Xuehong Wu and Jianxin Wang School of Information Science and Engineering, Central South University, Changsha 410083, China Email: [email protected] Email: [email protected] Email: [email protected] Email: [email protected] *Corresponding author Abstract: Identification of protein complexes is crucial to understand principles of cellular organisation and predict protein functions. In this paper, a novel protein complex discovery algorithm IPCIPG is proposed based on the integration of Protein–Protein Interaction network (PPI network) and gene expression data. IPCIPG is a local search algorithm which has two versions: IPCIPG-n for identifying non-overlapping clusters and IPCIPG-o for detecting overlapping clusters. The experimental results on the yeast PPI network show that IPCIPG can identify protein complexes with specific biological meaning more effectively, precisely and comprehensively than six other algorithms: HUNTER, HC-PIN, CMC, SPICi, MOCDE and MCL. Keywords: protein complex; protein–protein interaction network; gene expression. Reference to this paper should be made as follows: Chen, W., Li, M., Wu, X. and Wang, J. (2015) ‘Identifying protein complexes based on the integration of PPI network and gene expression data’, Int. J. Bioinformatics Research and Applications, Vol. 11, No. 1, pp.30–44. Biographical notes: Weijie Chen is a Master Student in Central South University, majors in bioinformatics and data mining. Min Li is currently an Associated Professor in School of Information Science and Engineering, Central South University. Her research interests include data mining and bioinformatics. Xuehong Wu is a Master Student in Central south University. His current research is bioinformatics and data mining. Jianxin Wang is currently a Professor and an Associate Dean in School of Information Science and Engineering, Central South University. His research interests include data mining and bioinformatics.

Copyright © 2015 Inderscience Enterprises Ltd.

Identifying protein complexes based on the integration of PPI network

1

31

Introduction

In the post-genome era, analysing and understanding the life activity with protein–protein interaction becomes one of the hottest research issues (Barabasi and Oltvai, 2004). Within cells, proteins seldom act a single isolated unit to perform its function, but interact with other proteins to form a fundamental functional unit (Gavin et al., 2002; Gavin et al., 2006). With the advances in the high-throughput techniques, such as yeasttwo-hybird, mass spectronmetry and protein chip technologies, amount of protein– protein interaction data has been catalogued (Xenarios et al., 2002). Identifying protein complexes from a Protein–Protein Interaction network (PPI network) constructed by such data is crucial for predicting protein functions and explaining special biological processes. Since year 2000, a series of algorithms have been proposed for identifying protein complexes from PPI networks. Gavin and Newman (2002) developed a disive algorithm, named G-N, to detect community structures from complex network by separating it. G-N algorithm calculates betweenness for each edge and considers the edge with high betweenness connects different communities. Spirin and Mimy (2003) proposed three algorithms to find protein complexes and functional modules. One is to enumerate all the maximal cliques and the others apply the Super-Paramagnetic Clustering (SPC) and a Monte Carlo (MC) simulation. Bader et al. (2003) developed a new method MCODE to find protein complexes by using a seed-extension model. MCODE consists of three stages: vertex weighting, complex prediction and optionally post-processing. King et al. (2004) explored a cost function for best partition of networks. Palla et al. (2005) proposed a clique percolation method, named CPM, to identify overlapping communities in complex networks, which was published in Nature. Altaf-UI-Admin et al. (2006) developed a method DPClus for detecting non-overlapping protein complexes in PPI network and they successfully identified the overlapping protein complexes by extending the function modules’ neighbours in PPI network with DPClus. Luo et al. (2007) modified the definition of module by extending the concept of degree from vertex to subgraphs and developed an agglomerative algorithm MoNet by combining the new module definition with relative edge order generated by the G-N algorithm. Li et al. (2008) imported ‘distance’ as a parameter to control the identification of protein complexes. Liu et al. (2009) proposed a novel method named CMC, which assign each interaction a score based on the reliability measure and detect protein complexes from this weighted graph based on maximal cliques. Wang et al. (2010a) developed a new method named HC-PIN to detect protein complexes by using edges’ clustering coefficient. HC-PIN can be applied in both weighted and unweighted graphs. Wang et al. (2011b) presented a topological algorithm named HKC to predict overlapping clusters by the definition of highest k-score and cohesion. Ding et al. (2012) explored the minimum vertex cuts on PPI network to find protein complexes. Chen et al. (2013) introduced a novel method using the clique seed and graph entropy to find protein complexes. Besides the clustering methods introduced above, there are also many other protein complex discovery methods published in recent years, which can be seen in the recent reviews, such as references of Wang et al. (2010b) and Li et al. (2010). On the whole, the protein complex discovery methods can be grouped into three types: one type is to detect non-overlapping clusters, another is to identify overlapping clusters and the other is that can detect both non-overlapping and overlapping clusters. Recently, the third type of methods attracts more and more attentions for different researchers have different opinions on whether non-overlapping or overlapping clusters should be detected. And

32

W. Chen et al.

non-overlapping and overlapping clusters both have biological meanings in different occasions. Hence, we develop a new method IPCIPG to identify both the nonoverlapping and overlapping protein complexes in this paper. In IPCIPG, we integrate PPI network and gene expression data considering that PPI networks generally contain a certain rate of false positive and false negative interactions. In previous studies, gene expression profile has been proved to be one of the most important data which can be used to eliminate the effect of noise in PPI networks. For example, Chin et al developed a hub-based algorithm HUNTER (Chin et al., 2010) to detect protein complexes by using the gene expression profile to weight the PPI network and enhance the quality of outcomes. Srihari and Leong (2012) proposed a method by adding the gene expression profile to define ‘static’ and ‘dynamic’ modules in PPI networks. Different from other pervious methods which use gene expression data to evaluate the reliability of PPIs, IPCIPG integrates the gene expression data into PPI network during the identification of protein complexes. The experimental results showed that IPCPIG can efficiently identify non-overlapping and overlapping clusters and the integration of gene expression data improves the precision for predicting protein complexes. We also compared our method IPCIPG with other six algorithms: HUNTER (Srihari and Leong, 2012), HC-PIN (Wang et al., 2010a), SPICi (Peng and Singh, 2010), CMC (Liu et al., 2009), MCODE (Bader and Hogue, 2003) and MCL (Enright et al., 2002). The experiment results on the data of Saccharomve scerevisiae show that IPCIPG outperforms all these six algorithms on the validation of matching with known complexes.

2

Algorithms IPCIPG

A PPI network is generally represented as an undirected simple graph G(V, E), in which each node represents a protein and each edge represents an interaction between two proteins. Many previous studies have revealed that protein complexes in a PPI network generally correspond to dense sub-graphs. In this paper, our target is also to identify protein complexes by detecting dense sub-graphs from a PPI network. Different from other previous topology-based methods, our proposed algorithm IPCIPG is based on the integration of PPI network and gene expression data for protein complex. The protein complex predicted by IPCIPG is a set of proteins which form the multiple molecular mechanisms through the interactions among them at the same time and space. The integration of gene expression data will help to determine whether proteins in a cluster are co-expressed. The proposed algorithm IPCIPG is a local search method based on seed extension model. IPCIPG has two main steps. The first step is to select seed node and the second step is to extend from the seed vertex by adding new nodes. Two different extension rules are developed for the proposed algorithm IPCIPG as identifying non-overlapping clusters and detecting overlapping clusters are both important in the prediction of protein complexes. Corresponding to the two different extension rules, IPCIPG also has two versions. One is IPCIPG-n for generating non-overlapping clusters and another one is IPCIPG-o for detecting overlapping clusters. IPCIPG-n and IPCIPG-o have the same way for weighting nodes and selecting seed. In the following subsections, we will first introduce how to weight nodes and select seed, then describe the extension rules of IPCIPG-n and IPCIPG-o, respectively.


33

2.1 Weighted nodes and selecting seed Traditional seed-extension based algorithms generally choose the high-degree nodes or those nodes with large clustering coefficients as seeds. In this paper, we will select seed not only from a topological view but also from the biological perspective. To describe simply, some related definitions are given as following. Definition 1: Edge Clustering Coefficient (ECC (Wang et al., 2011a)). Given an edge (u, v)  E, its edge clustering coefficient ECC(u, v) is defined as the number of triangles to which (u, v) belongs, divided by the number of triangles that might potentially include (u, v).

ECC (u, v) 

Z (u ,v ) min{deg(u )  1, deg(v)  1}

(1)

where Z(u, v) donates the number of triangles built on the edge (u, v), deg(u) and deg(v) are the degrees of node u and v, respectively. Definition 2: Pearson Correlation Coefficient (PCC). Given n samples of gene expression data, let Exp(u,i) donate the expression level of gene u in the i-th sample under a specific condition, then the Pearson correlation coefficient between gene u and v is defined as: PCC (u, v) 

1 n  Exp(u , i )  Exp (u )   Exp (v, i )  Exp(v)       u   (v ) n  1 i 1   

(2)

where Exp(u ) represents the average expression level of gene u and (u) denotes the standard deviation of expression level of gene u. Here, we define the Pearson correlation coefficient of a pair of proteins as equal to the PCC of their corresponding paired genes. For an input graph G(V, E), we assign the weight of an edge (u, v) to be the product of its ECC times its PCC. The weight of each vertex is defined to be the sum of the weights of its incident edges, as shown in formula (3).

 (u ) 

 ECC (u, v)  PCC (u, v)

(3)

vN u

where Nu is the set of neighbours of node u. The basic idea behind the definition of node weight is that it characterises the closeness of it and its neighbours by using ECC and PCC, which evaluates how strongly it and its neighbours are co-expressed. It is clear that the larger the product of an edge’s ECC(u, v) times its PCC(u, v), the more possible the two proteins u and v will be in the same cluster. Hence, the node with higher weight is more likely to be in the same cluster with its neighbours. After all nodes in G are assigned weights, we sort them in non-increasing order by their weights and store them in a queue Sq. The nodes with the highest weights will be chosen as seeds.

34

W. Chen et al.

2.2 IPCIPG-n for generating non-overlapping clusters The description of IPCIPG-n for generating non-overlapping clusters is shown in Figure 1. The inputs of IPCIPG-n include the PPI network, the gene expression data and a parameter minSize used to control the minimal size of the predicted clusters. IPCIPG-n proceeds as follows. The first node in the queue Sq is selected as a seed to grow a new cluster. Once the cluster is completed, all nodes in the cluster are removed from the queue Sq and the first node remaining in the queue Sq is selected as the seed for the next cluster. On the above section, we have introduced the method for weighting nodes and described how to select the node. Next, we will discuss how IPCIPG-n generates a new cluster from a seed. Figure 1

The description of algorithm IPCIPG-n for generating non-overlapping clusters Algorithm IPCIPG-n Input: a PPI network G(V,E), gene expression data, parameter minSize; Output: identified clusters(protein complexes) (** Weighted Vertex **) 1. for each edge (u,v)E do Compute its ECC(u,v) and PCC(u,v) 2. for each node vV do Compute the weight of v; tag(v)=0; // all nodes can be extended 3. store the nodes in Sq in non-increasing order in terms of their weights (**Select Seed and Extend Cluster **) 4. While Sq   do {v←Sq; C={v}; Nc←the neighbour of C; While Nc   do {Ne=  ; for each node uNC do if I(u,C) > nc/2 and tag(u)=0 then Ne=Ne{u} if Ne   then Nc   ; if |C|≥minSize then tag all nodes in C with 1; output C; Sq=Sq-C; else j = {x|Wd(x,C)=maxy∈Ne(Wd(y,C))} C=C{j} Nc←the neighbours of C; // update NC } }

First of all, a seed node will be initialised as a single cluster. A cluster C is ex-tended by adding nodes recursively from its neighbours. Whether a neighbour node v can be added to a cluster C is determined by the following three conditions: (a) v must be a Canneighbour; (b) v has not been extended to any other cluster and (c) for all the Canneighbours, v has the highest priority.


35

Definition 3: Cand-neighbour. Given a cluster C and its neighbour set NC, a node v  NC is a Cand-neighbour of C if the following condition is satisfied: I(v,C) > nC/2, where I(v, C) is the number of edges connecting the node v and those nodes in C, nC is the number of nodes included in the cluster C. The priority of a C and-neighbour v of C is determined by the value of Wd(v, C). Given a cluster C, Wd(v, C)of a node v to a cluster C, where v  C , is the sum of the weights of all the edges between v and C, i.e. Wd (v, C )   uC w(u, v) . The weight of an edge (u, v), marked as w(u, v), is also determined by the product of its ECC times its PCC. The C and-neighbour v with the largest value of Wd(v, C) has the highest priority to be extended to the cluster C. Once a new node v is added to the cluster, the cluster is updated. Then, the neighbours of the new cluster will be re-constructed and the algorithm IPCIPG-n will go recursively with the new cluster. If there are not any neighbours of C which have not been extended are Cand-neighbours, then the cluster cannot be further extended. If the cluster size is equal to or larger than the parameter minSize, then a complete cluster is generated and output. All the nodes in the complete cluster are tagged with 1 and removed from the queue Sq. Those nodes tagged with 1 cannot be extended to other clusters anymore.

2.3 IPCIPG-o for generated overlapping cluster The main difference between IPCIPG-o and IPCIPG-n is their extension rules. For IPCIPG-n, all nodes in the graph G are not allowed to be grouped into multiple clusters. IPCIPG-o, however, allows some nodes which satisfy with some special rules to appear in multiple clusters. The basic idea behind the extension rule of IPCIPG-o is that a protein which has many neighbours but has a very small clustering coefficient tends to be involved in multiple protein complexes for its neighbours are obviously to be part of different clusters. In other words, a protein will have a high clustering coefficient if its neighbours belong to one cluster. This is because proteins in one cluster will have a more chance to interact with each other according to the definition of protein complex and the clustering coefficient of a protein is determined with how closely its neighbours interact with each other. The definition of clustering coefficient is as following: Definition 4: Node Clustering Coefficient ((NCC) (Wang et al., 2012)). Given a node u in PPI network, its clustering coefficient is defined as the number of edges that exist among the neighbours of u, divided by the number of edges that potentially exist among the neighbours of u, it is calculated as follows: NCC (u ) 

Zu K u ( K u  1)

(4)

where Zu donates to the number of edges that exist among the neighbours and Ku represents the degree of u. In IPCIPG-o, the nodes which satisfy the following two rules can be extended multiple times:

36

W. Chen et al.

1

NCC(u) < TNCC: The clustering coefficient of a protein u must be smaller than the given NCC threshold TNCC.

2

Ku > TDeg: The degree of a protein u must be larger than the given degree threshold TDeg.

TNCC ranges from 0 to 1 and its default value is set to be 0.5 in this paper. The default value of parameter TDeg is given as the average degree of all the nodes in the given PPI network. In IPCIPG-o, the nodes which satisfy the above two rules are tagged with ‘0’ even they have been extended before and other nodes which do not satisfy the above two rules are tagged with ‘1’ after being extended and cannot be extended any more. The rest operation of IPCIPG-o is the same as that of IPCIPG-n.

3

Experiments and results

To validate the proposed algorithm IPCIPG-n and IPCIPG-o, we apply them to the protein–protein interaction network of Scaccharomves cerevisiae. The PPI network of Scaccharomves cerevisiae is downloaded from the DIP database (Xenarios et al., 2002). After removal of all the self-interactions and repeated interactions, the final network includes 4950 proteins and 21,788 interactions. The gene expression profiles are retrieved from Tu et al. (2005), which contains 6777 gene products and 36 samples in total, with 4858 genes involved in the PPI network of Scaccharomves cerevisiae. In the following subsections, we will discuss the experimental results of IPCIPG-n for generating non-overlapping clusters and IPCIPG-o for generating overlapping clusters and compare the predicted clusters with the known complexes. Also six competing previous algorithms (HUNTER, HC-PIN, SPICi, CMC, MOCDE and MCL) are used to compare with IPCIPG-n and IPCIPG-o for testing their performance of identifying protein complexes.

3.1 Identification of non-overlapping and overlapping clusters We apply the proposed algorithm IPCIPG-n and IPCIPG-o to the PPI network of Scaccharomves cerevisiae. IPCIPG-n produced 1247 non-overlapping clusters with an average size of 2.52 and maximum size of 32. IPCIPG-o generated 2888 overlapping clusters with an average size of 2.94 and maximum size of 56 when using TDeg = 7 and TNCC = 0.5. For the predicted complexes of IPCIPG-o with different values of TDeg and TNCC = 0.5, the predicted number and their average size, overlapping rate and maximum size are shown in Figure 2. From Figure 2, we can see that with the variety of TDeg, the maximum size and the overlapping rate of the predicted complexes of IPCIPG-o do not change much, the number of predicted complexes decreases slightly and the average size increases slightly. For the predicted complexes of IPCIPG-o with TDeg = 7 and TNCC = 0.5, 959 proteins are involved in multiple complexes. Figure 3 shows the two examples of the overlapping complexes produced by IPCIPG-o.

Identifying protein complexes based on the integration of PPI network Figure 2

The evaluation index of protein complexes generated by IPCIPG-o with respect to different degree thresholds

Figure 3

Examples of overlapping complexes identified by IPCIPG-o

Notes:

The blue nodes represent the overlapping proteins appeared in multiple complexes.

37

38

W. Chen et al.

3.2 Comparison with known complexes A scoring scheme (Li et al., 2008) wildly used in previous methods is used here to determine how effectively a predicted cluster (PC) matches a known complex (KC). The overlapping score OS(PC, KC) between PC and KC is calculated by the following formula: OS ( PC , K C ) 

i2 ab

(5)

where i is the size of the intersection set of PC and KC, a is the size of PC and b is the size of KC. A known complex KC that has no proteins in any predicted complex PC has OS(PC, KC) = 0 and a known complex KC that perfectly matches a predicted complex PC has OS(PC, KC) = 1. A known complex and a predicted complex are considered as a match if their overlapping score is equal to or larger than a specific threshold. The number of known complexes matched by the predicted complexes of IPCIPG-n and IPCIPG-o with respect to different overlapping score threshold (from 0 to 1 with a 0.1 increment) is shown in Figure 4. To compare with previous competing algorithms HUNTER, HC-PIN, SPICi, CMC, MCODE and MCL, we also showed the results of these algorithms in Figure 4. All the compared algorithms use the default parameters or that recommended by the authors. In Figure 4a, the predicted complexes which have two or more proteins are considered and in Figure 4b only the predicted complexes which have three or more proteins are considered. From Figure 4, we can see that the number of the matched known complexes by IPCIPG is larger than any other algorithm at each overlapping score threshold in the PPI network. The best matching result is obtained when IPCIPG is used to identify overlapping protein complexes. Out of all the 532 known protein complexes, 425 are matched by IPCIPG-o with OS ≥ 0.2. It is about four times larger than that of HC-PIN, about ten times larger than that of HUNTER, about eight times larger than that of MCODE. There are 210 predicted complexes matched by known protein complexes with OS ≥ 0.5 and 41 matched perfectly with OS = 1.0. Specificity and Sensitivity are another two important indicators to evaluate the identified protein complexes. Specificity (Sp) is the fraction of the predicted clusters that are matched by the known complexes, divided by the total number of the predicted clusters. Sensitivity (Sn) is the fraction of the known complexes that are matched by the predicted clusters among the known complexes. They are defined as follows: Sp 

TP TP  FP

(6)

Sn 

TP TP  FN

(7)

where TP (true positive) represents the number of the identified complexes matched by the known complexes with OS(PC, KC) ≥ 0.2, FP (false positive) is the subtraction of the total number of identified complexes and TP and FN (false negative) represents the number of the known complexes that are not matched.

Identifying protein complexes based on the integration of PPI network Figure 4

39

The number of known complexes matched by the predicted complexes with respect to different overlapping scores threshold (from 0 to 1 with 0.1 increment)

(a) The size of each predicted complexes greater or equal than 2

(b) The size of each predicted complex is greater or equal than 3

Moreover, a comprehensive evaluation method F-measure is defined as formula (8) based on the definitions of Sp and Sn. F

2  S p  Sn S p  Sn

(8)

The comparison results of specificity, sensitivity and F-measure of IPCIPG and six other algorithms: HUNTER, HC-PIN, CMC, SPICi, MCODE, MCL are shown in Table 1.

W. Chen et al.

40 Table 1

Comparison of the sensitivity, specificity and F-measure of algorithm IPCIPG and other algorithms: HUNTER, HC-PIN, CMC, SPICi, MCODE and MCL Complexes Average Perfect Sensitivity Specificity F-measure (size ≥ 2) size matching

Algorithms IPCIPG-n

1247

2.52

45

0.64

0.24

0.35

IPCIPG-o (TDeg = 7; TNCC = 0.5)

2888

2.94

41

0.85

0.21

0.34

HUNTER

25

29.4

3

0.04

0.84

0.08

SPICi

552

4.72

9

0.28

0.24

0.26

HC-PIN (lambda = 1.0 f)

166

11.08

12

0.11

0.34

0.17

MCODE (f = 0.1; VWP = 0.2)

50

16.62

1

0.04

0.46

0.07

MCL (inflation = 2.0)

932

5.04

16

0.39

0.2

0.26

CMC

981

8.9

3

0.37

0.18

0.24

As shown in Table 1, IPCIPG-n, IPCIPG-o, SPICi and MCL are good at generating small-sized protein complexes and HUNTER, HC-PIN and MCODE tend to generate large-sized protein complexes. From Table 1, we can see that the number of known protein complexes recalled perfectly by IPCIPG-n and that by IPCIPG-o is much higher than that recalled by the other six algorithms. For the comprehensive evaluation method F-measure, IPCIPG-n and IPCIPG-o also outperform the other six algorithms HUNTER, SPICi, HC-PIN, CMC, MCODE and MCL. All the above analyses show that IPCIPG-o and IPCIPG-n perform well on the identification of protein complexes.

3.3 Functional homogeneity and annotation Previous studies have showed that known protein complexes often have high functional homogeneity (King et al., 2004). The functional homogeneity is generally quantified with P-value which denotes the probability that a given set of proteins is enriched by a given functional group merely by chance. Given a protein complex C and a functional group F, the hypergeometric distribution P-value (Boyle et al., 2004) is defined as follows:

| F |, i | V |  | F |, | C | i  | V |, | C | i 0 k 1

P  value  1  

(9)

where C contains k proteins in F and the entire PPI network contains |V| proteins. The functional annotations of Sacchromyce scerevisiae proteins are obtained from MIPS Functional Catalog (FunCat) database (Mewes et al., 2004). When the algorithm IPCIPG is applied to identify non-overlapping protein complexes, it produces 201 complexes (size ≥ 3) in total and there are 72 complexes that contain five or more proteins. Out of the 72 predicted protein complexes, we find that 71 complexes match well with known functional categories with P-value < 0.01 and 70 complexes match well with known functional categories with P-value < 0.001. The high functional homogeneity of the identified complexes makes it possible to predict protein function based on the known proteins. By assigning each predicted


41

complex in a main function category with its lowest P-value, the functions of previously unknown proteins can be predicted. For example, a predicted 14-member complex is shown in Figure 5, the blue node represents the function unknown protein YMR125W. The complex’s main function category is ‘splicing’ with the lowest P-value = 4.76E-20. All 13 function known proteins in the complex have the function of ‘splicing’. Hence, we predict that the function unknown protein YMR125W is also a splicing protein. Similarly, another six-member complex shown in Figure 5 has the function of ‘small GTPase mediated signal transduction’. The function unknown protein YML109W may have the function of ‘small GTPase mediated signal transduction’. Figure 5

Two examples of predicted protein complexes of IPCIPG-o with lowest P-value

3.4 Robustness analysis of algorithm IPCIPG In this section, we evaluated the robustness of our proposed algorithm IPCIPG-n and IPCIPG-o by using graph alterations. Since PPI networks generally contain a certain fraction of false positives and negatives, we simulate the false positives by adding some edges randomly and the false negatives by removing some edges randomly. For the test PPI network, 1%, 5%, 10%, 20%, 30%, 40% and 50% of edges are added into it or removed from it randomly, respectively. Then, the new networks with graph alterations are used to test the robustness of IPCIPG-n and IPCIPG-o. The experiment results are shown in Figure 6. From Figure 6, we can observe that IPCIPG-n and IPCIPG-o both are not sensitive to the false positives. IPCIPG-n and IPCIPG-o can recall the known protein complexes effectively from the high noisy networks. It is also affected faintly by removal of up 30% and there exists a fast drop from 40%. However, there are still 317 and 369 known complexes, respectively, matched to the predicted complexes of IPCIPG-n and IPCIPG-o. This analysis strongly shows that IPCIPG-n and IPCIPG-o are robust against false positives and false negatives in PPI network.

W. Chen et al.

42 Figure 6

4

The robustness analysis of IPCIPG-n and IPCIPG-o against random edges addition and removal

Conclusion

In post-genomic era, explorations on the life activity with protein–protein interactions become hotspot. Although many algorithms based on the topological features have been proposed to identify protein complexes, it is still a challenge to identify protein complexes from the PPI network which has a high false positives or false negatives. Moreover, some of the current methods can only identity non-overlapping complexes. Methods that can detect both non-overlapping and overlapping complexes become an especial favourite. Hence, we propose a new algorithm named IPCIPG to detect both non-overlapping and overlapping protein complexes by integrating PPI network and gene expression profile. The effectiveness of IPCIPG was test on the PPI network of Sacchromyce scerevisiae. We also compared our method with six other previous


43

competing algorithms: HUNTER, HC-PIN, SPICi, CMC, MCODE and MCL. The experiment results show that IPCIPG performs much better than these six algorithms. The robust analysis of IPCIPG shows that IPCIPG can identify protein complexes effectively from the noisy PPI network and this also shows that the combination of gene expression profile can help to restrain the effect of noise in the PPI network.

Acknowledgements This work is supported in part by the National Natural Science Foundation of China under Grant No. 61003124, No.61073036, No.61232001 and No.60970095, and supported by the Program for New Century Excellent Talents in University (NCET-120547).

References Altaf-Ul-Amin, M., Shinbo, Y., Mihara K., Kurokawa, K. and Kanaya, S. (2006) ‘Development and implementation of an algorithm for detection of protein complexes in large interaction networks’, BMC Bioinformatics, Vol. 7, p.207 Bader, G.D. and Hogue, C.W. (2003) ‘An automated method for finding molecular complexes in large protein interaction networks’, BMC Bioinformatics, Vol. 4, p.2. Barabasi, A. L., Oltvai, Z. N.(2004) Network biology: understanding the cell’s functional organization. Nat. Res. 5, 101–114 Boyle, E.I., Weng, S., Gollub, J., Jin, H., Botstein, D., Cherry. J.M. and Sherlock, G. (2004) ‘GO: TermFinder-open source software for accessing Gene Ontology information and finding significantly enriched gene ontology terms associated with a list of genes’, Bioinformatics, Vol. 20, No. 18, pp.3710–3715. Chen, B.L., Shi, J.H., Zhang, S.G. and Wu, F.X. (2013) ‘Identifying protein complexes in proteinprotein interaction networks by using clique seeds and graph entropy’, Proteomics, Vol. 13, No. 2, pp.269–277. Chin, C-H., Chen, S-H., Ho, C-W., Ko, M-T. and Lin, C-Y. (2010) ‘A hub-attachment based method to detect functional modules from confidence-scored protein interactions and expression profiles’, BMC Bioinformatics, Vol. 11, No. l, p.S25. Ding, X., Wang, W., Peng, X. and Wang, J. (2012) ‘Mining protein complexes from PPI networks using the minimum vertex cut’, Tsinghua Science and Technology, Vol. 6, No. 17, pp.674–681. Enright, A.J., Van Dongen, S. and Ouzounis, C.A. (2002) ‘An efficient algorithm for large-scale detection of protein families’, Nucleic Acids Research, Vol. 30, No. 7, pp.1575–1584. Gavin, A.C., Aloy, P., Grandi, P., Krause, R., Boesche, M., Marzioch, M., Rau, C., Jensen, J.L., Bastuck, S., Dumpelfeld, B., Edelmann, A., Heurtier, M.A., Hoffman, V., Hoefert, C., Klein, K., Hudak, M., et al. (2006) ‘Proteome survey reveals modularity of the yeast cell machinery’, Nature, Vol. 440, No. 7084, pp.631–636. Gavin, A.C., Bosche, M., Krause, R., Grandi, P., Marzioch, M., Bauer, A., Schultz, J., Rick, J.M., Michon, A.M., Cruciat, C.M., Remor, M., Hoefert, C., Schelder, M., Brajenovic, M., Ruffner, H., Merino, A., et al. (2002) ‘Funtional organzation of the yeast proteome by systematic analysis of protein complexes’, Nature, Vol. 415, No. 6868, pp.141–147. Girvan, M. and Newman, M.E.J. (2002) ‘Community structure in social and biological network’, Proceedings of the Natural Academy of Sciences, Vol. 99, No. 12, pp.7821–7826. King, A.D., Przulj, N. and Jurisica, I. (2004) ‘Protein complex prediction via costbased clustering’, Bioinformatics, Vol. 20, No. 17, pp.3013–3020.

44

W. Chen et al.

Li, M., Chen, J., Wang, J.X., Hu, B. and Chen, G. (2008) ‘Modifying the DPClus algorithm for identifying protein complexes based on new topological structures’, BMC Bioinformatics, Vol. 9, p.398. Li, X.L., Wu, M., Kwoh, C.K. and Ng, S.K. (2010) ‘Computational approaches for detecting protein complexes from protein interaction networks: a survey’, BMC Genomics, Vol. 11, No. 1, p.S3. Liu, G., Wong, L. and Chua, H.N. (2009) ‘Complex discovery from weighted PPI network’, Bioinformatics, Vol. 15, No. 25, pp.1891–1897. Luo, F., Yang, Y., Chen, C-F., Chang, R., Zhou, J. and Scheuermann, R.H. (2007) ‘Modular organization of protein interaction networks’, Bioinformatics, Vol. 23, No. 2, pp.207–214. Mewes, H.W., Amid, C., Arnold, R., Frishman, D., Guldener, U., Mannhaupt, G., Munsterkotter, M., Pagel, P., Strack, N., Stumpflen, Y., Warfsmann, J. and Ruepp, A. (2004) ‘MIPS: analysis and annotation of proteins from whole genomes’, Nucleic Acids Research, Vol. 32, pp.41–44. Palla, G., Dernyi, I., Farkas, I. and Vicsek, T. (2005) ‘Uncovering the overlapping community structure of complex networks in nature and society’, Nature, Vol. 435, No. 7043, pp.814–818. Peng, J. and Singh, M. (2010) ‘SPICi: a fast clustering algorithm for large biological networks’, Bioinformatics, Vol. 26, No. 8, pp.1105–1111. Spirin, V. and Mirny, L.A. (2003) ‘Protein complexes and functional modules in molecular networks’, Proceedings of the National Academy of Sciences USA, Vol. 100, No. 21, pp.12123–12128. Srihari, S. and Leong, H.W. (2012) ‘Temporal dynamics of protein complexes in PPI networks: a case study using yeast cell cycle dynamics’, BMC Bioinformatics, Vol. 13, No. 17, p.16. Tu, B.P., Kudlicki, A., Rowicka, M. and McKnight, S.L. (2005) ‘Logic of the yeast metabolic cycle: temporal compartmentalization of cellular processes’, Science, Vol. 310, No. 5751, pp.1152–1158. Wang, J., Li, M., Chen, J. and Pan, Y. (2010a) ‘A fast hierarchical clustering algorithm for functional modules discovery in protein interaction networks’, IEEE/ACM Transactions on Computational Biology and Bioinformatics, Vol. 8, No. 3, pp.607–620. Wang, J., Li, M., Deng, Y. and Pan, Y. (2010b) ‘Recent advances in clustering methods for protein interaction networks’, BMC Genomics, Vol. 11, No. 3, p.S10. Wang, J., Li, M., Wang, H. and Pan, Y. (2012) ‘Identification of essential proteins based on edge clustering coefficient’, IEEE/ACM Transactions on Computational Biology and Bioinformatics, Vol. 9, No. 4, pp.1070–1080. Wang, H., Li, M., Wang, J. and Pan, Y. (2011a) ‘A new method for identifying essential proteins based on edge clustering coefficient’, Bioinformatics Research and Application, Vol. 6674, pp.87–89. Wang, X., Wang, Z. and Ye, J. (2011b) ‘HKC: an algorithm to predict protein complexes in protein-protein interaction networks’, Journal of Biomedicine and Biotechnology, Vol. 2011, pp.1–14. Xenarios, I., Salwinski, L., Duan, X.J., Higney, P., Kim, S.M. and Eisenberg, D. (2002) ‘DIP: the database of interaction proteins: a research tool for studying cellular networks of protein interactions’, Nucleic Acids Research, Vol. 30, No. 1, pp.303–305.

LNDriver: identifying driver genes by integrating mutation and expression data based on gene-gene interaction network.

Identifying network biomarkers based on protein-protein interactions and expression data.

Protein complex detection in PPI networks based on data integration and supervised learning method.

Identifying protein complexes in PPI network using non-cooperative sequential game.

Advances in network-based metabolic pathway analysis and gene expression data integration.

A New Method for Identifying Essential Proteins Based on Network Topology Properties and Protein Complexes.

Identifying hierarchical and overlapping protein complexes based on essential protein-protein interactions and "seed-expanding" method.

Integration of molecular network data reconstructs Gene Ontology.

Construction of microRNA and transcription factor regulatory network based on gene expression data in cardiomyopathy.

PTHGRN: unraveling post-translational hierarchical gene regulatory networks using PPI, ChIP-seq and gene expression data.

Network Biomarkers Constructed from Gene Expression and Protein-Protein Interaction Data for Accurate Prediction of Leukemia.

Pathway network inference from gene expression data.

Network completion for static gene expression data.

Inferring gene dependency network specific to phenotypic alteration based on gene expression data and clinical information of breast cancer.

Multiple network algorithm for epigenetic modules via the integration of genome-wide DNA methylation and gene expression data.

A novel method for identifying disease associated protein complexes based on functional similarity protein complex networks.

Integrating PPI datasets with the PPI data from biomedical literature for protein complex detection.

Integration of protein abundance and structure data reveals competition in the ErbB signaling network.

Network based stratification of major cancers by integrating somatic mutation and gene expression data.

Identifying functions of protein complexes based on topology similarity with random forest.

Improving protein function prediction using domain and protein complexes in PPI networks.

A model-based approach to transcription regulatory network reconstruction from time-course gene expression data.

Identifying driver genes in cancer by triangulating gene expression, gene location, and survival data.

Integration of MicroRNA, mRNA, and Protein Expression Data for the Identification of Cancer-Related MicroRNAs.