YMETH 3355

No. of Pages 9, Model 5G

26 February 2014 Methods xxx (2014) xxx–xxx 1

Contents lists available at ScienceDirect

Methods journal homepage: www.elsevier.com/locate/ymeth 5 6

Effective identification of essential proteins based on priori knowledge, network topology and gene expressions

3 4 7

Q1

Min Li a,b, Ruiqing Zheng a, Hanhui Zhang a, Jianxin Wang a,⇑, Yi Pan a,c,* a

8 9 10

School of Information Science and Engineering, Central South University, Changsha 410083, China State Key Laboratory of Medical Genetics, Central South University, Changsha 410078, China c Department of Computer Science, Georgia State University, Atlanta, GA 30302-4110, USA b

11

a r t i c l e

1 1 3 2 14 15

i n f o

Article history: Available online xxxx

16 17 18 19 20

Keywords: Essential protein Protein interaction networks Gene expression

a b s t r a c t Identification of essential proteins is very important for understanding the minimal requirements for cellular life and also necessary for a series of practical applications, such as drug design. With the advances in high throughput technologies, a large number of protein–protein interactions are available, which makes it possible to detect proteins’ essentialities from the network level. Considering that most species already have a number of known essential proteins, we proposed a new priori knowledge-based scheme to discover new essential proteins from protein interaction networks. Based on the new scheme, two essential protein discovery algorithms, CPPK and CEPPK, were developed. CPPK predicts new essential proteins based on network topology and CEPPK detects new essential proteins by integrating network topology and gene expressions. The performances of CPPK and CEPPK were validated based on the protein interaction network of Saccharomyces cerevisiae. The experimental results showed that the priori knowledge of known essential proteins was effective for improving the predicted precision. The predicted precisions of CPPK and CEPPK clearly exceeded that of the other 10 previously proposed essential protein discovery methods: Degree Centrality (DC), Betweenness Centrality (BC), Closeness Centrality (CC), Subgraph Centrality (SC), Eigenvector Centrality (EC), Information Centrality (IC), Bottle Neck (BN), Density of Maximum Neighborhood Component (DMNC), Local Average Connectivity-based method (LAC), and Network Centrality (NC). Especially, CPPK achieved 40% improvement in precision over BC, CC, SC, EC, and BN, and CEPPK performed even better. CEPPK was also compared to four other methods (EPC, ORFL, PeC, and CoEWC) which were not node centralities and CEPPK was showed to achieve the best results. Ó 2014 Elsevier Inc. All rights reserved.

22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42

43 44

1. Introduction

45

It is well known that protein is an important component of every cell in the body. Predicting proteins’ functions, discovering their structures, exploring protein–protein interactions, and identifying essential proteins are always hot topics in the post-genomic era. Especially, the essential proteins are necessary for growth and the deletion of such proteins will result in lethality or infertility [1–3]. The identification of essential proteins is not only crucial for the understanding of the minimal requirements for cellular life [1], but also very important for the discovery of human disease genes and defending against human pathogens [4–6]. Different experimental methods, such as single gene knockouts [7], RNA interference [8] and conditional knockouts [9], have been

46 47 48 49 50 51 52 53 54 55 56

⇑ Corresponding authors at: School of Information Science and Engineering,

Q2

Central South University, Changsha 410083, China. E-mail address: [email protected] (Y. Pan).

implemented for the discovery of essential proteins. However, these experimental methods are very expensive and time-consuming. To break through these experimental constraints, some researchers have proposed various computational approaches. With the accumulation of data derived from experimental smallscale studies and high-throughput techniques, there is a growing awareness that the topological properties of biological networks would be useful for the identification of essential proteins. It has been observed in several species, such as Saccharomyces cerevisiae, Caenorhabditis elegans, and Drosophila melanogaster [10,11], that proteins in the network highly connecting with other proteins are more likely to be essential than those selected by chance [12]. Although there exist some controversies that whether, why and how the highly connected proteins tend to be essential in biological networks [13–16], most of the researchers confirmed the correlation between topological centrality and protein essentiality [11,17–21]. In our previous studies [22,23,25], we have shown the feasibility of using network topological features to identify

http://dx.doi.org/10.1016/j.ymeth.2014.02.016 1046-2023/Ó 2014 Elsevier Inc. All rights reserved.

Q1 Please cite this article in press as: M. Li et al., Methods (2014), http://dx.doi.org/10.1016/j.ymeth.2014.02.016

57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74

YMETH 3355

No. of Pages 9, Model 5G

26 February 2014 Q1

2

M. Li et al. / Methods xxx (2014) xxx–xxx

128

essential proteins from yeast protein interaction networks. In [25], six typical centrality measures: degree centrality [12], betweenness centrality [20],closeness centrality [26], subgraph centrality [30], eigenvector centrality [31], and information centrality [32], were studied. Besides the six centrality measures, other topological features, such as bottlenecks [34], maximum neighborhood component [27], and edge percolated component [27], have also been developed to identify essential proteins in recent years. More recently, the integration of network topology and other information, such as cellular localization and gene expression, is attracting increasing attentions. Acencio et al. [1] demonstrated that network topological features, cellular localization and biological process information were reliable predictors of essential genes. Considering that most species already have a number of known essential proteins (for example, the database of DEG collects essential genes of different species for both prokaryotes and eukaryotes), in this paper we try to find out whether we can detect new essential proteins based on the priori knowledge of known essential proteins. The intriguing questions are: (1) What are the common characteristics of the known essential proteins? (2) Is there a scheme for identifying new essential proteins based on the known ones? We studied essential proteins and their neighbors in the yeast protein interaction network downloaded from the DIP database [44] and found about 98% essential proteins have at least one neighbor which is also an essential protein. That is to say, essential proteins have a close relationship to each other. In addition, it has been proved that the essentiality tends to be a product of a protein complex rather than an individual protein [40]. Therefore, our goal in this paper was to develop a new scheme to discover new essential proteins based on a small part of known essential proteins. Based on the conclusion that essential proteins tend to be in the same cluster, we proposed a new priori knowledge-based scheme to detect new essential proteins from protein interaction networks. Moreover, two essential protein discovery algorithms, CPPK and CEPPK, were derived from the new scheme. CPPK predicts new essential proteins based on network topology and CEPPK detects new essential proteins by integrating network topology and gene expressions. The performances of CPPK and CEPPK were tested on the well studied species of S. cerevisiae. The experimental results showed that CPPK and CEPPK performed well on the identification of new essential proteins and the priori knowledge of known essential proteins were effective for improving the predicted precision. Compared to other 10 previous centrality measures: Degree Centrality (DC) [12], Betweenness Centrality (BC) [20], Closeness Centrality (CC) [26], Subgraph Centrality (SC) [30], Eigenvector Centrality (EC) [31], Information Centrality (IC) [32], Bottle Neck (BN) [33,34], Density of Maximum Neighborhood Component (DMNC) [27], Local Average Connectivity-based method (LAC) [22], and Network Centrality (NC) [23], CPPK and CEPPK both achieved higher precision for the identification of essential proteins. Especially, the improvement of CPPK compared with BC, CC, SC, EC, and BN was more than 40% and CEPPK performed even better.

129

2. Methods

130

2.1. Form of study data and problem definition

131

A protein interaction network is represented by an undirected graph G = (V,E), where V is a set of proteins and ðv i ; v j Þ 2 E if and only if protein v i is found to interact with protein v j . According to the knowledge of protein’s essentiality, proteins in a graph G can be divided into three sets: essential protein set, non-essential protein set and essentiality unknown protein set. As the essential-

75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127

132 133 134 135 136

ity unknown protein set cannot provide any information, we just consider the first two sets: essential protein set and non-essential protein set. The problem of predicting essential proteins needs to determine a protein’s essentiality: essential or non-essential. In view of the existing known essential proteins, we formulate the priori knowledge-based essential protein discovery problem as following: given a protein interaction network G and k known essential proteins, predict n  k new essential proteins (n < jVj). As the known essential proteins are the most important factor to address the above defined problem, it will be necessary for us to understand the common characteristics of the known essential proteins and uncover the relationship between them and the rest proteins. To study the characteristics of known essential proteins, we downloaded the protein–protein interactions of S. cerevisiae from the DIP database [44] and collected a list of essential proteins of S. cerevisiae from the following databases: MIPS [45], SGD [46], DEG [47], and SGDP [48]. The detailed information was described in the Section 3. By testing the essentiality of the neighbors of essential proteins in the yeast network, we found that about 98% essential proteins have at least one neighbor which is an essential protein. For a protein v i 2 G, its neighbors are those proteins which interact with it directly in G. That is to say, if enough of known essential proteins are given, we can succeed in finding almost all the rest essential proteins from the neighbors of the known essential proteins.

137

2.2. Prior knowledge-based essential protein discovery scheme

163

Based on the fact that part of essential proteins are available for almost all species and they provide clues to discover new essential proteins, we propose a prior knowledge based essential protein discovery scheme, as shown in Fig. 1. Let K ¼ fv 1 ; v 2 ;    ; v k g be a set of known essential proteins and N v i be the set of proteins which are neighbors of protein v i . Then, all the neighbors of proteins in K can be obtained from the set N K , where N K ¼ [v i 2K N v i . There are two types of nodes in N K : essential proteins and non-essential proteins. It is clear that N K ¼ N EK [ N NK with N EK representing the set of essential proteins and N NK representing the set of non-essential proteins. Our goal is to distinguish

164

Fig. 1. A prior knowledge based essential protein fig:01discovery scheme. K is the set of essential proteins; N K is the set of neighbors of proteins of K; N EK is the set of essential candidates and N NK is the set of non-essential proteins in N K .

Q1 Please cite this article in press as: M. Li et al., Methods (2014), http://dx.doi.org/10.1016/j.ymeth.2014.02.016

138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162

165 166 167 168 169 170 171 172 173 174

YMETH 3355

No. of Pages 9, Model 5G

26 February 2014 Q1 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203

3

M. Li et al. / Methods xxx (2014) xxx–xxx

the different types of proteins in N K and group them into N EK and N NK correctly. If the number of known essential proteins in K is more than enough (n  k new essential proteins are included in N EK ), it will be possible to obtain all the required new essential proteins by only once classification. On the contrary, it can be a problem to predict n  k new essential proteins once if K consists of a small number of known essential proteins (n  k > jN EK j). In such case, we need to update K by adding into it the predicted essential proteins of N EK . This update operation can be done multiple times until enough essential protein candidates are obtained. The key problem of the proposed scheme is how to divide the proteins of N K into N EK and N NK correctly. However, it is hard to find such an accurate classification method to separate N K into N EK and N NK exactly. In this study, we introduce a new approach to select essential proteins from N K instead of separating N K into N EK and N NK . In order to make the update operation more accurately we only choose one protein from N K each time. The modified essential protein discovery scheme can be summarized as following: Step 1. Get the neighbors of the known essential proteins. Step 2. For each neighbor of the known essential protein, compute its probability to be a true essential protein. Step 3. Choose the protein with the highest probability from the neighbors of the known essential proteins. Step 4. Add the selected protein into the set of known essential proteins and update their neighbors. Step 5. Iterate the procedure (going back to Step 2) to select new candidate from the neighbors until n  k new essential candidates are obtained.

Fig. 2. The description of algorithm CPPK.

clustering coefficients are calculated for all the edges in graph G. For all the neighbors of K, their co-clustered scores with proteins in K are computed and the one with the largest score will be added into K. Once a new protein is added to K, its neighbors are updated and their co-clustered scores with proteins in K will be re-calculated, and the algorithm CPPK goes recursively with the updated K until n expected essential candidates are obtained.

238

2.4. Algorithm CEPPK

245

It is well known that most topology-based essential protein discovery methods are sensitive to the reliability of protein–protein interaction data. Hence, we developed an improved robust algorithm, CEPPK, for the prediction of essential proteins after developing the topology-based algorithm CPPK. The basic ideas behind CEPPK are as follows: (1) it has been proved that gene expression can help to evaluate the reliability of protein–protein interactions [52,53,58]; (2) interacting proteins are more likely to be involved in similar biological processes and functions and thus they are more likely to be co-expressed [51]. Different from CPPK, the algorithm CEPPK predicts essential proteins not only based on the priori knowledge of known essential proteins and network topology, but also on the basis of assessment how likely two genes are co-expressed. Here, pearson correlation coefficient (PCC) [38] is used to evaluate how strong two interacting proteins are co-expressed. Based on the definition of ECC and PCC, a CC-score is defined to describe the co-clustered and co-expressed score of a protein v 2 N K with proteins in K.

246

Definition 1. (CC-score) Given a known essential protein set K and its neighbor set N K , the co-clustered and co-expressed score of a protein v 2 N K with proteins in K is defined as:

264

239 240 241 242 243 244

204

210

In the above mentioned scheme, an effective computing approach is required to determine a neighbor’s probability of being a true essential protein. In the following subsection, we will introduce two methods for calculating the probability and develop two algorithms CPPK and CEPPK for predicting essential proteins correspondingly.

211

2.3. Algorithm CPPK

212

According to the facts that essential proteins are closely related to each other and the essentiality tends to be a product of a cluster rather than an individual protein [40,41], we use edge clustering coefficient (ECC) [24] to evaluate how likely two connected proteins are to be in the same cluster. Let u and v be two connecting proteins in a graph G, the edge clustering coefficient of ðu; v Þ is defined by

205 206 207 208 209

213 214 215 216 217 218

219 221

222 223 224 225 226 227

228 230

231 232 233 234 235 236 237

ECCðu; v Þ ¼

jNu \ Nv j þ 1 minfdu ; dv g

ð1Þ

where N u (or N v ) is the set of neighbors of vertex u (or v) and du (or dv ) denotes the degree of vertex u (or v), i.e., the number of nodes which u (or v) directly connects in graph G. Let K be a set of known essential proteins and N K be its neighbor set, the co-clustered score of a protein v 2 N K with proteins in K is defined as:

Pc ðv ; KÞ ¼

X ECCðu; v Þ

ð2Þ

u2K

Obviously, a protein with a larger co-clustered score will have more chance of being a true essential protein. The algorithm CPPK is developed based on this idea, as shown in Fig. 2. The inputs of CPPK are a set of protein–protein interactions and k known essential proteins. First, an undirected simple graph GðV; EÞ is created based on the protein–protein interaction data and k known essential proteins are added to a set K. Then, the edge

CC-scoreðv ; KÞ ¼

X

ECCðu; v Þ  PCCðu; v Þ

ð3Þ

u2K

247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263

265 266

267 269

It is clear that a protein will have more chance to be a true essential protein if it has a larger CC-score. The steps of CEPPK is very similar to that of CPPK. CEPPK also begins by generating an undirected graph G from the protein–protein interaction data. The main differences between CEPPK and CPPK are as follows:

270

1. The input data are different. 2. The methods for calculating edge weights are different. 3. The methods for selecting the protein to be extended are different.

275

Q1 Please cite this article in press as: M. Li et al., Methods (2014), http://dx.doi.org/10.1016/j.ymeth.2014.02.016

271 272 273 274

276 277 278

YMETH 3355

No. of Pages 9, Model 5G

26 February 2014 Q1

4

M. Li et al. / Methods xxx (2014) xxx–xxx

279 280

3. Results and discussion

281

3.1. Implementation of CPPK and CEPPK

282

295

The inputs of CPPK are a protein interaction network and a given set of seeds (generally, the given set of seeds should be known essential proteins). The inputs of CEPPK include a gene expression data besides a protein interaction network and a given set of seeds. As both CPPK and CEPPK are based on network topology, an accurate protein interaction network will be the favourite input. Of course, any protein interaction networks are acceptable because CPPK and CEPPK are robust against noise. Some previous methods for predicting, evaluating and purifying protein interaction networks [42,43,25] were recommended to get a relatively clean network. The outputs of CPPK and CEPPK are both predicted essential proteins. The proposed methods of CPPK and CEPPK are available on the web (http://netlab.csu.edu.cn/bioinfomatics/limin/CPPK/ index.html).

296

3.2. Test data and evaluation methods

297

To evaluate the performance of the proposed methods: CPPK and CEPPK, we implemented them on the discovery of essential proteins of S. cerevisiae, as it has been well characterized by knockout experiments and widely used in the evaluations of essential proteins. The test data used in this paper are as following: Protein–protein interaction data. The protein–protein interactions of S. cerevisiae were downloaded from the DIP database [44]. There are 24,743 interactions among 5093 proteins in total after the self-interactions and the repeated interactions were filtered. Essential proteins. A list of essential proteins of S. cerevisiae were collected from the following databases: MIPS [45], SGD [46], DEG [47], and SGDP [48]. A protein in the yeast protein interaction network is considered as an essential protein if it is marked as essential at least in one database. Out of all the 5093 proteins in the yeast network, 1167 proteins are essential, 3591 are non-essential, and the rest 335 are still unknown to be essential or non-essential. Gene expression. The gene expression data of S. cerevisiae was retrieved from Tu et al. [49], containing 6777 gene products and 36 samples in total, with 4858 genes involved in the yeast protein interaction network. To verify the effectiveness of the proposed essential protein discovery methods, we used the following evaluation measures: Precision. Let C be a set of predicted essential candidates of a method M i and V e be the set of true essential proteins in G. The predicted precision of the method M i with respect to C is defined as the percentage of essential proteins out of all the predicted proteins:

283 284 285 286 287 288 289 290 291 292 293 294

298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323

324 326

precisionðMi ; CÞ ¼

jC

T

V ej jCj

ð4Þ

T

335

where jC V e j is the number of true essential proteins in the predicted set. Receiver Operating Characteristic (ROC) Curves and the areas under the curves (AUC). In a ROC Curve the true positive rate (sensitivity) is plotted in function of the false positive rate (specificity) for different cut-off points. The closer the ROC curve follows the upper-left border, the more effective of the method for predicting essential proteins. The areas under the curves (AUC) were also calculated to compare different essential protein discovery methods.

336

3.3. Compare CPPK with other methods

337

First, we compared our proposed method CPPK with 10 previously proposed essential protein discovery methods: Degree

327 328 329 330 331 332 333 334

338

Centrality (DC) [12], Betweenness Centrality (BC) [20], Closeness Centrality (CC) [26], Subgraph Centrality (SC) [30], Eigenvector Centrality (EC) [31], Information Centrality (IC) [32], Bottle Neck (BN) [33,34], Density of Maximum Neighborhood Component (DMNC) [27] and Local Average Connectivity-based method (LAC) [22], Network Centrality (NC) [23]. The comparison result is shown in Table 1. For each one of the 10 methods, proteins are ranked from the highest value to the lowest value. For CPPK, 100 known essential proteins are selected as prior knowledge. The selection is repeated 20 times and the result of CPPK is the average of predicted precisions of CPPK on the 20 repeated operations. As shown in Table 1, the predicted precision of CPPK is much higher than that of the other 10 essential protein discovery methods. For example, when predicting 600 essential candidates, the improvements of CPPK compared with DC, BC, CC, SC, EC, IC, BN, DMNC, LAC, and NC are 48.8, 69.8, 63.8, 69.0, 69.0, 48.8, 84.0, 40.9, 22.1, and 20.9%, respectively. In Table 1, the 100 known essential proteins are included when the predicted precision is calculated for CPPK. Maybe people will argue that whether the improvement of CPPK is caused by these 100 known essential proteins themselves or not. In order to dissipate such a doubt, we removed the 100 known essential proteins when calculating the predicted precision of CPPK. Correspondingly, the 100 known essential proteins were also marked and excluded from the predicted candidates of the 10 topology-based essential protein discovery methods. A new comparison of CPPK and the 10 other methods with the known essential proteins of prior knowledge excluded is shown in Table 2. From Table 2, we can see that CPPK still performs significantly better than the 10 previously proposed essential protein discovery methods. When predicting 600 essential candidates, the improvements of CPPK compared with BC, CC, SC, EC, and BN are all more than 40%, that compared with DC and IC are both more than 26%, that compared with DMNC is 16.8%, and that compared with LAC and NC are both about 4.1%. The above comparison between CPPK and 10 other essential protein discovery methods (DC, BC, CC, SC, EC, IC, BN, DMNC, LAC, and NC) shows that CPPK outperforms all the 10 topologybased methods, which also indicates that the priori knowledge of known essential proteins can help to improve the predicted precision. Moreover, the performances of CPPK and 10 other essential protein discovery methods: DC, BC, CC, SC, EC, IC, BN, DMNC, LAC, and NC were compared in terms of their ROC curves and the areas under the curves (AUC), as shown in Fig. 3. Fig. 3(a) shows the comparison results with the 100 known essential proteins included and Fig. 3(b) shows the comparison results with the 100 known essential proteins excluded. As shown in Fig. 3(a), the AUC value of CPPK is 0.710 which is higher than all the other 10 essential protein discovery methods: DC, BC, CC, SC, EC, IC, BN, DMNC, LAC, and NC. In Fig. 3(b), the AUC value of CPPK is 0.682, which is also higher than the essential protein discovery methods: DC, BC, CC, SC, EC, IC, BN, DMNC and similar to the methods: LAC and NC. From Fig. 3(a) and (b) we can see that the ROC curves of CPPK are the most closest one to the upper-left border in these two figures.

339

3.4. Compare CEPPK with CPPK

395

Considering CPPK has been proved to perform better than the 10 previously proposed centrality measures: DC, BC, CC, SC, EC, IC, BN, DMNC, LAC, and NC, we compared CEPPK and CPPK directly in this subsection. Fig. 4(a) and (b) illustrates the comparison results of CEPPK and CPPK with known essential proteins involved in the calculation of predicted precision and that known essential proteins are excluded. From Fig. 4(a) and (b) we can see that CEPPK

396

Q1 Please cite this article in press as: M. Li et al., Methods (2014), http://dx.doi.org/10.1016/j.ymeth.2014.02.016

340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394

397 398 399 400 401 402

YMETH 3355

No. of Pages 9, Model 5G

26 February 2014 Q1

5

M. Li et al. / Methods xxx (2014) xxx–xxx Table 1 Comparison of CPPK with 10 other previously proposed essential protein discovery methods: DC, BC, CC, SC, EC, IC, BN, DMNC, LAC, and NC.

a

The number of predicted proteins

100 200 300 400 500 600 a

Precision CPPK

DC

BC

CC

SC

EC

IC

BN

DMNC

LAC

NC

1.000 0.839 0.756 0.683 0.625 0.623

0.460 0.410 0.390 0.395 0.404 0.418

0.440 0.385 0.373 0.363 0.354 0.367

0.410 0.395 0.390 0.380 0.378 0.380

0.370 0.385 0.397 0.395 0.384 0.368

0.370 0.385 0.397 0.395 0.384 0.368

0.440 0.400 0.393 0.403 0.414 0.418

0.360 0.380 0.347 0.363 0.350 0.338

0.560 0.445 0.453 0.455 0.450 0.442

0.590 0.605 0.587 0.570 0.532 0.510

0.550 0.630 0.607 0.575 0.558 0.515

The 100 known essential proteins are included when the precision is calculated.

Table 2 Comparison of CPPK with 10 other previously proposed essential protein discovery methods: DC, BC, CC, SC, EC, IC, BN, DMNC, LAC, and NC.

a

The number of predicted proteins

100 200 300 400 500 600 a

Precision CPPK

DC

BC

CC

SC

EC

IC

BN

DMNC

LAC

NC

0.678 0.635 0.578 0.532 0.505 0.479

0.420 0.360 0.340 0.350 0.370 0.380

0.410 0.340 0.330 0.320 0.320 0.330

0.320 0.340 0.350 0.350 0.340 0.320

0.370 0.350 0.350 0.340 0.340 0.340

0.320 0.340 0.350 0.350 0.340 0.320

0.390 0.370 0.350 0.360 0.370 0.370

0.310 0.340 0.310 0.320 0.330 0.300

0.510 0.410 0.400 0.410 0.400 0.410

0.550 0.560 0.540 0.510 0.470 0.460

0.550 0.580 0.550 0.520 0.500 0.460

The 100 known essential proteins are removed when the precision is calculated.

Fig. 3. ROC curves and AUC values of CPPK and 10 other essential protein discovery methods: DC, BC, CC, SC, EC, IC, BN, DMNC, LAC, and NC.

Fig. 4. Comparison of prior knowledge based essential protein discovery methods CEPPK, CPPK and CDPK.

Q1 Please cite this article in press as: M. Li et al., Methods (2014), http://dx.doi.org/10.1016/j.ymeth.2014.02.016

YMETH 3355

No. of Pages 9, Model 5G

26 February 2014 Q1 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425

6

M. Li et al. / Methods xxx (2014) xxx–xxx

achieved higher precision than CPPK whether the priori knowledge of known essential proteins were taken into account or not. The comparison result shows that the integration of gene expression can help to improve the predicted precision. As the above 10 methods compared with CPPK are all based on topological characteristics without using the priori knowledge of known essential proteins. It may be not enough CPPK and CEPPK only compared with these methods. For the purpose of a direct comparison of CPPK and CEPPK with someone which also uses priori knowledge, we introduced a simple method CDPK. CDPK was also based on the same scheme as what CPPK and CEPPK have done. The main difference between CDPK, CPPK and CEPPK is the rule for selecting new essential candidates. In the algorithm CDPK, the score of a protein in N K being an essential protein is calculated by P dc ðv ; KÞ ¼ jeðu; v Þj; u 2 K \ N v . The protein with the largest P dc will be added into K. Once a new protein is added to K, its neighbors are updated and their scores will be re-calculated. The predicted precision of CDPK, which took the priori knowledge of known essential proteins into account, was shown in Fig. 4(a), and that of CDPK which removed the priori knowledge of known essential proteins from the predicted candidates was shown in Fig. 4(b). In Fig. 4(a) and (b), obvious improvements can be found both in CPPK and in CEPPK compared to CDPK for pre-

dicting essential proteins whether the known essential proteins were included or not for the computation of precision. The positive effect of prior knowledge on CEPPK and CPPK is much clearer than that on CDPK maybe because that CEPPK and CPPK describe the cluster property of essential proteins. Therefore, not only the priori knowledge of known essential proteins, but also the selecting rules contribute to identifying essential proteins correctly.

426

3.5. Effect of prior knowledge

433

After confirmation that prior knowledge is useful for predicting essential proteins, we aimed to analyze how many known essential proteins are enough for the prediction. Different number of known essential proteins (from 50 to 250 with the increment of 50) were selected as prior knowledge. Here, we also used two types of statistical methods to calculate the predicted precision when analyzing the effect that different number of known essential proteins were taken as prior knowledge. One was that the known essential proteins of prior knowledge were included, and the other was that the known essential proteins were excluded. The analysis results of CPPK were shown in Fig. 5(a) and (b), and that of CEPPK were shown in Fig. 6(a) and (b),respectively.

434

Fig. 5. Effect of different number of known essential proteins selected as prior knowledge on CPPK.

Fig. 6. Effect of different number of known essential proteins selected as prior knowledge on CEPPK.

Q1 Please cite this article in press as: M. Li et al., Methods (2014), http://dx.doi.org/10.1016/j.ymeth.2014.02.016

427 428 429 430 431 432

435 436 437 438 439 440 441 442 443 444 445

YMETH 3355

No. of Pages 9, Model 5G

26 February 2014 Q1

M. Li et al. / Methods xxx (2014) xxx–xxx

Fig. 7. The predicted precisions of CPPK and CEPPK with 10-fold cross-validation analysis.

446 447 448 449

From Fig. 5(a), we can see that the more known essential proteins were available for prior knowledge, the higher predicted precision was achieved by CPPK. This is because the known essential proteins themselves of prior knowledge also contribute to the

7

prediction. Different results were obtained after the known essential proteins of prior knowledge were removed from the predicted candidates, as shown in Fig. 5(b). There was not much differences between the results of CPPK using different values of k (from k ¼ 50 to k ¼ 250) when the known essential protein were excluded. Similar results were obtained by CEPPK, as shown in Fig. 6(a) and (b). From the above analysis we can conclude that a small number of known essential proteins are enough for the prediction of new essential proteins.

450

3.6. Cross validation

459

In this subsection, 10-fold cross-validation was used to test the effectiveness of CPPK and CEPPK. All the known essential proteins were divided into 10 equal datasets. In the standard 10-fold crossvalidation, a single subset is retained as the validation data, and the remaining nine subsets are used as training data. As our proposed methods CPPK and CEPPK are not machine learning methods and they can work only using a small set of seeds, we used one subset as training data (the seed set) and nine other subsets together as validation data. The cross-validation process was repeated 10 times with each of the 10 subsets used exactly once as the seed set. For each time, the predicted precision was calculated. The 10 predicted precisions were averaged to estimate the effectiveness

460

Fig. 8. Robustness analysis of CPPK and CEPPK by randomly removing a certain proportion of edges (5%, 10%, 15%, 20%, and 25%) from the original protein interaction network.

Fig. 9. Robustness analysis of CPPK and CEPPK by randomly adding a certain proportion of edges (5%, 10%, 15%, 20%, and 25%) to the original protein interaction network.

Q1 Please cite this article in press as: M. Li et al., Methods (2014), http://dx.doi.org/10.1016/j.ymeth.2014.02.016

451 452 453 454 455 456 457 458

461 462 463 464 465 466 467 468 469 470 471

YMETH 3355

No. of Pages 9, Model 5G

26 February 2014 Q1

8

M. Li et al. / Methods xxx (2014) xxx–xxx

Fig. 10. Robustness analysis of CPPK and CEPPK by randomly shuffling a certain proportion of edges (5%, 10%, 15%, 20%, and 25%) in the original protein interaction network.

476

of CPPK and CEPPK. The 10-fold cross-validation analysis was repeated five times and the five results of CPPK and CEPPK were shown in Fig. 7. The 10-fold cross-validation analysis results further confirmed the effectiveness of using part of known essential proteins to predict new essential proteins.

477

3.7. Robustness analysis

478

505

One of the advantages of the proposed approach is that it can predict essential proteins for less studied organisms whose protein interaction networks are not complete. Hence, we simulated the false negatives of protein interaction networks by randomly removing a certain proportion of edges from the original network. Considering that most methods for predicting protein–protein interactions were known to generate a non-negligible amount of noise, we also simulated the false positives of protein interaction networks by randomly adding a certain proportion of edges to the original network. Besides the random additions and deletions, we also perturbed the network by shuffling the edges among proteins. Proportions of edges (5%, 10%, 15%, 20%, and 25%) were removed from, added into, or perturbed in the original network, respectively. It should be expected that a limited number of false negatives, false positives, or random perturbations would not affect the predicted precision of the proposed method. Fig. 8 displays the impact of edge removal on the results of CPPK and CEPPK, Fig. 9 illustrates the impact of edge addition and Fig. 10 shows the impact of edge perturbation. From Figs. 8–10 we can see that CPPK and CEPPK were faintly affected by the removal, addition, and perturbation of up to 25% edges. The analysis strongly shows that CPPK and CEPPK are very robust against the high rate of false positives and false negatives in protein interaction networks. Hence, the proposed methods are not only suitable to predict essential proteins for less studied organisms where protein interaction networks are not complete and also can be used to discover essential proteins from high noisy protein interaction networks.

506

3.8. Compare with methods based on other features

507

In the above subsections, we compared the proposed method to 10 node centralities (DC, BC, CC, SC, EC, IC, BN, DMNC, LAC, and NC). In this subsection, we compared the proposed method to an edge percolated component-based method EPC [57]. EPC was integrated by Lin et al. to Hubba [27] to predict essential proteins. The comparison result was shown in Fig. 11. From Fig. 11 we can see that CEPPK performed much better than EPC. The advantage of EPC is

472 473 474 475

479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504

508 509 510 511 512 513

Fig. 11. The predicted precisions of CEPPK, EPC, PeC, CoEWC and a method using sequence feature (ORFL). For CEPPK, 50 known essential proteins were used as priori knowledge and they were taken into account when computing the predicted precision.

that its predicted precision almost has no fall with the increase of the predicted essential proteins. However, the predicted precisions of EPC were all about 0.4 for predicting no more than 600 essential protein candidates, which were much lower than that of CEPPK. Moreover, to show the difference between our new proposed method CEPPK and our previously proposed method PeC [24], we also gave the results of PeC in Fig. 11. Recently, a method called CoEWC (Co-Expression Weighted by Clustering coefficient) was developed by Zhang et al. [56]. CoEWC was also based on the integration of protein interaction network and gene expression profiles and has been shown to outperform PeC. Hence, we also showed the comparison result of CEPPK and CoEWC. From Fig. 11 we can see that CEPPK outperformed PeC and CoEWC. CEPPK, PeC and CoEWC used the same data of protein interaction network and gene expression profiles. Here, we need to say that CEPPK will not always achieve better results when compared with PeC and CoEWC. This is because CEPPK is a priori knowledge-based method and it may be caught by PeC and CoEWC when predicting a large number of essential proteins. Instead of strictly compared with network based methods, we further compared our proposed method CEPPK with some other methods based on other features. It has been found that the genes

Q1 Please cite this article in press as: M. Li et al., Methods (2014), http://dx.doi.org/10.1016/j.ymeth.2014.02.016

514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536

YMETH 3355

No. of Pages 9, Model 5G

26 February 2014 Q1

546

with longer sequences of open reading frames (ORF) tend to be essential [54,55]. The ORF length of a gene i denotes the total number of base pairs included in the ORF of the gene i. We called the method using the ORF length to predict essential proteins as ORFL. The performance of ORFL was also shown in Fig. 11. From Fig. 11 we can see that the predicted precisions of ORFL were all lower than 0.4 for predicting no more than 600 essential protein candidates. However, the predicted precisions of our proposed method CEPPK were all higher than 0.5. CEPPK had a high precision of more than 0.75 when predicting no more than 300 essential proteins.

547

4. Conclusion

548

585

The identification of essential proteins from the network level is a hot topic in the postgenome era. Based on the fact that most species already have a number of known essential proteins, we proposed a new priori knowledge-based scheme to predict new essential candidates by making use of the existing knowledge. In the new scheme, the relationships among essential proteins and the differences between essential proteins and non-essential proteins were investigated. We discovered that most essential proteins are neighbors of other essential proteins and they tend to be in the same cluster and are coexpressed. Based on these detections, we proposed two essential protein discovery algorithms, CPPK and CEPPK. CPPK only uses topological characteristic to predict new essential proteins and CEPPK detects new essential proteins by integrating network topology and gene expressions. The performances of CPPK and CEPPK were tested on the well studied species of S. cerevisiae. The experimental results showed that CPPK and CEPPK clearly outperformed other previously proposed methods. When considering the predicted precision, the improvements of CPPK compared with BC, CC, SC, EC, and BN were more than 60% with the contribution of priori knowledge of known essential proteins and still more than 40% after the priori knowledge of known essential proteins were excluded for the calculation of predicted precision. These experimental results show that the priori knowledge of known essential proteins are effective for improving the predicted precision. More importantly, the predicted precision of CEPPK was even higher than that of CPPK for predicting the same number of new essential proteins. CPPK and CEPPK were also proved to be robust against the false positives and false negatives existed in protein interaction networks. This makes CPPK and CEPPK useful tools to predict essential proteins for less studied organisms where protein interaction networks are not complete and also effective to identify essential proteins from protein interaction networks with high positives. Of course, CPPK and CEPPK still have some limitations: (1) they require a set of known essential proteins and can not work without any seeds; (2) their good performances depend on that essential proteins are closely related to each other. In future, we will further study new strategies to overcome these limitations.

586

5. Uncited references

537 538 539 540 541 542 543 544 545

549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584

587

Q3

[28,29,35,36,37,39,50].

588

Acknowledgements

589

The authors would like to thank Mark Michaelson who helped to improve the language quality. This work is supported in part by the National Natural Science Foundation of China under Grant Nos. 61370024, 61379108, 61232001, the Program for New Century Excellent Talents in University (NCET-12-0547), China Postdoctoral Science Foundation 2013M531811, the U.S. National

590 591 592 593 594

9

M. Li et al. / Methods xxx (2014) xxx–xxx

Science Foundation under 0646102, and CNS-0831634.

Grant

Nos.

CCF-0514750,

CCF-

595 596

References

597

[1] M.L. Acencio, N. Lemke, BMC Bioinf. 10 (2009) 290. [2] E.A. Winzeler, D.D. Shoemaker, A. Astromoff, H. Liang, K. Anderson, et al., Science 285 (1999) 901–906. [3] R.S. Kamath, A.G. Fraser, Y. Dong, G. Poulin, R. Durbin, et al., Nature 421 (2003) 231–237. [4] S.J. Furney, M.M. Alba, N. Lopez-Bigas, BMC Genom. 7 (165) (July 2006). [5] F.A. Kondrashov, A.Y. Ogurtsov, A.S. Kondrashov, Nucleic Acids Res. 32 (5) (Mar 2004) 1731–1737. [6] L.M. Steinmetz, C. Scharfe, A.M. Deutschbauer, D. Mokranjac, Z.S. Herman, et al., Nat. Gene 31 (Aug 2002) 400–404. [7] G. Giaever, A.M. Chu, L. Ni, et al., Nature 418 (6896) (2002) 387–391. [8] L.M. Cullen, G.M. Arndt, Immunol. Cell Biol. 83 (3) (2005) 217–223. [9] T. Roemer, B. Jiang, J. Davison, et al., Mol. Microbiol. 50 (2003) 167–181. [10] H. Yu, D. Greenbaum, H. Xin Lu, X. Zhu, M. Gerstein, Trends Genet. 20 (6) (2004) 227–231. [11] M.W. Hahn, A.D. Kern, Mol. Biol. Evol. 22 (4) (Dec 2004) 803–806. [12] H. Jeong, S.P. Mason, A.L. Barabái, Z.N. Oltvai, Nature 411 (6833) (2001) 41–42. [13] H. Yu et al., Science 322 (5898) (2008) 104–110. [14] X.L. He, J.Z. Zhang, PLoS Genet. 2 (6) (2006) 0826–0834. [15] E. Zotenko, J. Mestre, D.P. O’Leary, T.M. Przytycka, PLoS Comput. Biol. 4 (8) (2008) 1–16. [16] Kang Ning, Hoong Kee Ng, Sriganesh Srihari, et al., BMC Bioinf. 11 (2010) 505. [17] N.N. Batada, L.D. Hurst, M. Tyers, PLoS Comput. Biol. 2 (7) (2006) e88, http:// dx.doi.org/10.1371/ journal.pcbi.0020088. [18] R. Vallabhajosyula, D. Chakravarti, S. Lutfeali, A. Ray, A. Raval, Plos One 4 (4) (2009) 1–10. [19] E. Ernesto, Available from , 2005. [20] M. Joy et al., J. Biomed. Biotechnol. 2 (2005) 96–103. [21] W. Kim, Tsinghua Sci. Technol. 17 (6) (2012) 645–658. [22] Min Li, Jianxin Wang, Xiang Chen, Huan Wang, Yi Pan, Comput. Biol. Chem. 35 (2011) 143–150. [23] Jianxin Wang, Min Li, Huan Wang, Yi Pan, IEEE/ACM Trans. Comput. Biol. Bioinf. 9 (4) (2012) 1070–1080. [24] Min Li, Hanhui Zhang, Jianxin Wang, Yi Pan, BMC Syst. Biol. 6 (2012) 15. [25] M. Li, J. Wang, H. Wang, Y. Pan, J. Bioinf. Comput. Biol. 11 (3) (2013) 1341002. [26] S. Wuchty, P.F. Stadler, J. Theoret. Biol. 223 (2003) 45–53. [27] Chung-Yen Lin, Chia-Hao Chin, Hsin-Hung Wu, et al., Nucleic Acids Res. (2008) W438–W443. [28] Keunwan Park, Dongsup Kim, Proteomics 9 (2009) 5143–5154. [29] Jun Ren, Jianxin Wang, Min Li, Huan Wang, Binbin Liu, in: ISBRA 2011, LNBI 6674, 2011, pp. 12–24. [30] E. Estrada, J.A. Rodríguez-Velázquez, Phys. Rev. E 71 (5) (2005). [31] P.F. Bonacich, Am. J. Sociol. 92 (5) (1987) 1170–1182. [32] K. Stevenson, M. Zelen, Social Netw. 11 (1989) 1–37. [33] N. Przˇulj, D.A. Wigle, I. Jurisica, Bioinformatics 20 (3) (2004) 340–348. [34] H. Yu, P.M. Kim, E. Sprecher, V. Trifonov, M. Gerstein, PLoS Comput. Biol. 3 (4) (2007) e59, http://dx.doi.org/10.1371/journal.pcbi.0030059. [35] C. Friedel, R. Zimmer, BMC Bioinf. 7 (2006) 519. [36] F. Radicchi, C. Castellano, F. Cecconi, V. Loreto, D. Parisi, Proc. Natl. Acad. Sci. USA 101 (2004) 2658–2663. [37] Min Li, Jianxin Wang, Jian’er Chen, in: BMEI 2008, IEEE Press, 2008, pp. 3–7. [38] http://en.wikipedia.org/wiki. [39] A.-C. Gavin et al., Nature 415 (6868) (2002) 141–147. [40] G.T. Hart, I. Lee, E.M. Marcotte, BMC Bioinf. 8 (2007) 236. [41] X. Ding, W. Wang, X. Peng, J. Wang, J. Tsinghua Sci. Technol. 17 (6) (2012) 674– 681. [42] Z.H. You, Y.K. Lei, J. Gui, et al., Bioinformatics 26 (21) (2010) 2744–2751. [43] L. Bonetta, Nature 468 (7325) (2010) 851–854. [44] I. Xenarios, D.W. Rice, L. Salwinski, M.K. Baron, E.M. Marcotte, D. Eisenberg, Nucleic Acids Res. 28 (1) (2000) 289–291. [45] H.W. Mewes et al., Nucleic Acids Res. 34 (Database issue) (2006) D169–D172. [46] J.M. Cherry et al., Nucleic Acids Res. 26 (1) (1998) 73:9. [47] R. Zhang, Y. Lin, Nucleic Acids Res. 37 (Database issue) (2009) 455–458. [48] Saccharomyces Genome Deletion Project, . [49] B.P. Tu, A. Kudlicki, M. Rowicka, S.L. McKnight, Science 310 (2005) 1152–1158. [50] A.G. Holman, P. Davis, J.M. Foster, et al., BMC Microbiol. 9 (243) (2009). [51] M. Deng, F. Sun, T. Chen, Pacific Symp. Biocomput. 8 (2003) 140–151. [52] A. Grigoriev, Nucleic Acids Res. 29 (2001) 3513–3519. [53] H. Ge, Z. Liu, G.M. Church, M. Vidal, Nat. Genet. 29 (4) (2001) 482–486. [54] A. Gustafson, E. Snitkin, S. Parker, et al., BMC Genom. 7 (2006) 265. [55] Y. Hwang, C. Lin, J. Chang, et al., Mol. BioSyst. 5 (2009) 1672–1678. [56] X. Zhang, J. Xu, W. Xiao, PloS One 8 (3) (2013) e58763. [57] C.S. Chin, P.S. Manoj, Bioinformatics 19 (2003) 2413–2419. [58] Y. Zhang, N. Du, K. Li, K. Jia, A. Zhang, Tsinghua Sci. Technol. 18 (5) (2013) 530– 540.

598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673

Q1 Please cite this article in press as: M. Li et al., Methods (2014), http://dx.doi.org/10.1016/j.ymeth.2014.02.016

674

Effective identification of essential proteins based on priori knowledge, network topology and gene expressions.

Identification of essential proteins is very important for understanding the minimal requirements for cellular life and also necessary for a series of...
2MB Sizes 0 Downloads 2 Views