HHS Public Access Author manuscript Author Manuscript

IASTED Int Conf Comput Syst Biol (2006). Author manuscript; available in PMC 2015 May 27. Published in final edited form as: IASTED Int Conf Comput Syst Biol (2006). 2006 November ; 2006: 68–72.

A NEW CLUSTERING METHOD AND ITS APPLICATION TO PROTEOMIC PROFILING FOR COLON CANCER Yongbin Ou, Department of Mathematics, West Virginia University, Morgantown, WV 26506-6310

Author Manuscript

Lan Guo, and MBR Cancer Center/Department of Community Medicine, West Virginia University, Morgantown, WV 26506-9300 Cun-Quan Zhang Department of Mathematics, West Virginia University, Morgantown, WV 26506-6310

Abstract

Author Manuscript

In this paper, we introduce a new clustering method: quasi-clique merger, and its associated data pretreatment programs. This program constructs non-binary hierarchical trees with much smaller number of clusters in the outputs. And overlapping clusters are also allowed in the outputs. We applied this new method to cluster 60 human cancer cell lines (the NCI-60) using the previously identified proteomic determinants for chemosensitivity of 5-Fluorouracil (5-FU). All colon cancer cell lines were aggregated into a single cluster, indicating that the eight proteomic markers are potential diagnostic markers of colon cancer. The results based on the new clustering method have surpassed those based on previous methods on the same datasets.

Keywords Biological Data Mining; unsupervised hierarchical clustering; overlapping clusters; Microarray Data Analysis; NCI-60; chemosensitivity determinants

1. Introduction

Author Manuscript

Clustering is one of the most important methods for bioinformatics research, and there are a variety of different clustering algorithms now available. Although all these approaches have clearly demonstrated their usefulness in applications, a number of important questions remain to be addressed, such as, "problems related to robustness, uniqueness, and optimality of linear ordering which complicates the interpretation of the resulting hierarchical relationships", problems of "how to determine the optimal number of clusters" (Lukashin & Fuchs, Bioinformatics 2001 (1)), problems that "none of these algorithms can, in general, rigorously guarantee to produce a globally optimal clustering for non-trivial objective function" (Xu, Olman & Xu, Bioinformatics 2002 (2)) and "there are no completely

Send reprint requests to [email protected].

Ou et al.

Page 2

Author Manuscript

satisfactory methods for determining the number of population clusters for any type of cluster analysis… …" (SAS/STAT User's Guide (3)). In this paper, we introduce a new clustering method called "quasi-clique merger" and its associated data pretreatment programs. One of the most significant differences between the new method and other existing methods is that "quasi-clique merger" method constructs a much smaller hierarchical tree, which highlight the meaningful clusters (while the hierarchical trees produced by most existing hierarchical clustering methods are binary). Another special feature of the new method is the property of multi-membership (or, called overlapping clustering), which is a new concept recently introduced by Palla et al. (4), Pereira-Leal, et al. (5), Futschik et al. (6).

Author Manuscript

We applied this new method to cluster 60 human cancer cell lines (the NCI-60) using the previously identified proteomic determinants for chemosensitivity of 5-Fluorouracil (5-FU). All colon cancer cell lines were aggregated into a single cluster, indicating that the eight proteomic markers are potential diagnostic markers of colon cancer. The results based on the new clustering method have surpassed those based on previous methods on the same datasets.

2. The quasi-clique merger algorithm Graph/network is one of the most commonly used model for presenting the real-valued relationship of a set of input items. Let G=(V,E) be a graph with the weight w: E(G)→R where w(e) represents the similarity/closeness of the items u and v where e=uv.

Author Manuscript

Clustering is a processing that detects all denser subgraphs in G, and list their inclusion relation in a hierarchical structure. The following is the algorithm. 2-1. Subprograms For a subgraph C, we define the density of C by

Author Manuscript

Obviously, (for those who are familiar with graph theory or in computer science), if w(e)=1 for every edge e in C, the subgraph C induces a clique. For a weighted graph, a subgraph C is called a Δ-quasi-clique if d(C) ≥ Δ for some positive real number Δ. A heuristic processing is applied here for finding all quasi-cliques with density in various levels. The core of the algorithm is deciding whether or not to add a vertex to a community. For a vertex v not in C, we define the contribution of v to C by

IASTED Int Conf Comput Syst Biol (2006). Author manuscript; available in PMC 2015 May 27.

Ou et al.

Page 3

Author Manuscript

A vertex v is added into C if c(v,C)>α d(C) where α is a user specified parameter. Algorithm Grow(C,G) (grow a community C in G) while V(G)\V(C)≠Ø; begin

Author Manuscript

pick v∈V(G)\V(C) such that c(v,C) is a maximum if c(v,C) > α d(C) then add v to C else return end Algorithm Decompose(G,w0) (decompose a graph G into communities using edges with weights at least w0) compute E0 = {e∈E(G): w(e)≥w0}

Author Manuscript

for each e=uv∈E0 in decreasing order of w(e) begin if either u or v is not in any community then   begin   create a new empty community C and add u, v into C   Grow(C,G)   end

Author Manuscript

end repeatedly if for any two communities C1 and C2, |C1∩C2| >β min(|C1|, |C2|) then merge C1 and C2 into a new community C=C1 ∪ C2 (where β is a user specified parameter). Algorithm Contraction(G) Each community becomes a vertex. The weight of an edge is defined by IASTED Int Conf Comput Syst Biol (2006). Author manuscript; available in PMC 2015 May 27.

Ou et al.

Page 4

Author Manuscript

where Ec is the set of crossing edges which is defined by Ec = {v1v2: v1∈C1,v2∈C2, v1≠v2} 2-2. The main algorithm Algorithm Main-Algorithm (produce hierarchic clustering tree for a graph G) while E(G)≠Ø

Author Manuscript

begin Choose w0 according to some criterion Decompose(G,w0) Contraction(G) store the resulted graph to G end trace the movement of each vertex and produce the hierarchic tree Note: the choice of w0 depends on the weights of the edges in E(G). Usually e0=γ max w(e).

Author Manuscript

3. Data pretreatment The quasi-clique merger algorithm produces much less clusters. This is one of the important features of the new method. However, if the weights of most edges are distributed in a very small interval, then the above algorithm may not be able to recognize the small difference and therefore, produces only a very small number of clusters. In order to output an appropriate number of clusters, we introduce the following data pretreatment in our processing. 3-1. Input One of the most common formats of inputs is an m×n-matrix A = [aij]. Microarray data are usually in this format. Clustering processing is to separate the set of rows into several clusters.

Author Manuscript

3-2. "Minority rules" Find the average bj of the j-th column, for every j.

This processing sets zeros for all those average values and therefore, highlights the difference for those above average and bellow average. IASTED Int Conf Comput Syst Biol (2006). Author manuscript; available in PMC 2015 May 27.

Ou et al.

Page 5

3-3. Similarity – angles between vectors

Author Manuscript

Each row is considered as a vector vi and the angle between two vectors is used to measure their difference. Hence, the similarity between two vectors vi and vj is Cos θ where θ is the angle between the vectors (Cos θ is determined by inner product). 3-4. Difference amplifier

The function f that we used in our application for proteomic profiling for cancer cell lines is a composite function of a rational function (of order (−3)) and an anti-trigonometry function.

Author Manuscript

4. Application -- proteomic profiling for cancer cell lines Assessment of an individual’s predisposition to drugs is essential to achieve the goal of personalized medicine. This approach is needed to allow clinicians to choose a treatment option that includes the most effective therapeutic agents for a given patient while avoiding ineffective agents and unnecessary side effects. In a previous study, we explored proteomic contributions to drug sensitivity and predicted the drug responses of 60 human cancer cell lines (NCI-60 panel) to 118 anti-cancer agents by proteomic profiling (7). The protein expression levels were measured in untreated cells. As the focus was on predicting the response to therapy and not analyzing the molecular consequences of therapy, this study provides a basis for predicting drug responses based on protein markers in the tumors of untreated patients.

Author Manuscript

It is especially challenging to predict chemosensitivity in a clinical context because drug responses reflect the properties intrinsic to both the target cells and host metabolism (8). Our analysis was limited to the intrinsic properties of cells in culture by modeling the response of the NCI-60 panel of 60 human cancer cell lines, which includes lines derived from leukemias, melanomas, and carcinomas of ovarian, renal, breast, prostate, colon, lung, and central nervous system (CNS) origin. These cell lines have been screened previously for the activity of 118 anti-cancer drugs whose mechanisms of action are putatively known (9). Some of these drugs are currently in routine clinical use for cancer treatment; others are in clinical trials or the late stages of drug development.

Author Manuscript

We investigated the feasibility of predicting drug responses using protein expression levels. Both the proteomic profiles (10) and drug activity database (9) were generated by the National Cancer Institute (NCI) and are available at the NCI website (http:// discover.nci.nih.gov/datasets.jsp). The protein expression database was generated by proteomic assays with 52-antibody, reverse-phase, protein lysate microarrays in each individual cell line (10). We sought to identify important protein markers for predicting responses to the 118 anti-cancer agents in each cell line. Classifiers of the complete range of drug responses (sensitive, intermediate, and resistant) were developed, one for each drug evaluated. The chemosensitivity classifiers were designed to be independent of the cells’ tissue origin.

IASTED Int Conf Comput Syst Biol (2006). Author manuscript; available in PMC 2015 May 27.

Ou et al.

Page 6

Author Manuscript Author Manuscript

This study identified the protein markers for predicting chemosensitivity of the 118 agents in 60 cancer cell lines. The markers can, in principle, provide a basis to devise the optimal combination of therapies directed specifically to eliminate the cancer cells, while minimizing toxicity to the normal cells (11). In addition, the markers can theoretically portrait a unique molecular signature for detection and diagnosis of a cancer (10)–(12). Among the studied drugs, 5-Fluorouracil (5-FU) (NSC 19893) has been included in the treatment combinations for patients with stage III colon cancer (13). Using Random forests in software package R (http://www.r-project.org/), eight protein markers were identified for the prediction of drug response to 5-FU, including CDH1, CDH2, KRT8, ERBB2, MSN, MVP, MAP2K1, and MGMT. All of these proteins, except for KRT8, are involved in the pathogenesis of colon cancer. In order to investigate the feasibility of using these markers to diagnose colon cancer, we performed unsupervised hierarchical clustering on the 60 cancer cell lines by using the expression levels of these eight proteins. There were a total of seven colon cancer cell lines in the NCI-60 panel, including KM-12, HCT-15, HT29, COLO-205, HCC-2998, HCT-116 and SW-620. All of them were aggregated together using the new clustering method, indicating that the identified protein markers provided a basis not only for detection and diagnosis of colon cancer, but also for devising the optimal therapeutic combination targeted specifically to eliminate the cancer cells. Fig. 1 is the output of the program, in which all colon cancer cell line are clustered in one cluster. It is noteworthy that our previous research (7) was not able to produce such cluster by using the CIMminer (Fig. 2) (http://discover.nci.nih.gov/cimminer/) developed by the National Cancer Institute (14).

5. Conclusion Author Manuscript

The new clustering method (quasi-clique merger and its associated pre-treatment) introduced in this paper is based on a graph theoretical model. This method is designed for searching and merging cliques or clique-like subgraphs in an input dataset. This method is capable to produce more meaningful clusters than many other methods. The application to 60 human cancer cell lines has clearly indicates that the results produced by this method have surpassed those based on previous methods on the same datasets.

Acknowledgements Y. Ou was supported in part by the West Virginia University Research Corporation; L. Guo was supported in part by NIH under Grant NIH/NCRR P20 RR16440-03;

Author Manuscript

C.-Q. Zhang was supported in part by the National Security Agency under Grant MDA904-01-1-0022 and by WV EPSCoR under Grant EPS2006-37.

References 1. Lukashin AV, Fuchs R. Analysis of temporal gene expression profiles: clustering by simulated annealing and determining the optimal number of clusters. Bioinformatics. 2001; 17(5):405–414. [PubMed: 11331234] 2. Xu Y, Olman YV, Xu D. Clustering gene expression data using a graph-theoretic approach: an application of minimum spanning trees. Bioinformatics. 2002; 18:536–545. [PubMed: 12016051]

IASTED Int Conf Comput Syst Biol (2006). Author manuscript; available in PMC 2015 May 27.

Ou et al.

Page 7

Author Manuscript Author Manuscript Author Manuscript

3. SAS OnlineDoc. SAS/STAT User's Guide. Duluth: University of Minnesota; 1999. (http:// www.d.umn.edu/math/docs/saspdf/stat/pd fidx.htm) 4. Palla G, Derenyi I, Farkas I, Vicsek T. Uncovering the overlapping community structure of complex networks in nature and society. Nature. 2005; 435(Issue 7043):814–818. [PubMed: 15944704] 5. Pereira-Leal JB, Enright AJ, Ouzounis CA. Detection of unctional Modules From Protein Interaction Networks. PROTEINS: Structure, Function, and Bioinformatics. 2004; 54:49–57. 6. Futschik ME, Carlisle B. Noise-Robust soft clustering of gene expression timecourse. Journal of Bioinformatics and Computational Biology. 2005; 3(4):965–988. [PubMed: 16078370] 7. Ma Y, Ding Z, Qian Y, Shi X, Castranova V, Harner EJ, Guo L. Predicting Cancer Drug Response by Proteomic Profiling. Clinical Cancer Research. 2006 (accepted). 8. Staunto JE, Slonim DK, Coller HA, Tamayo P, Angelo MJ, Park J, Scherf U, Lee JK, Reinhold WO, Weinstein JN, Mesirov JP, Lander ES, Golub TR. Chemosensitivity prediction by transcriptional profiling. Proc. Natl. Acad. Sci. U.S.A. 2001:10787–10792. [PubMed: 11553813] 9. Scherf U, Ross DT, Waltham M, Smith LH, Lee JK, Tanabe L, Kohn KW, Reinhold WC, Myers TG, Andrews DT, Scudiero DA, Eisen MB, Sausville EA, Pommier Y, Botstein D, Brown PO, Weinstein JN. A gene expression database for the molecular pharmacology of cancer. Nat. Genet. 2000:236–244. [PubMed: 10700175] 10. Nishizuka S, Charboneau L, Young L, Major S, Reinhold WC, Waltham M, Kouros-Mehr H, Bussey KJ, Lee JK, Espina V, Munson PJ, Petricoin E III, Liotta LA, Weinstein JN. Proteomic profiling of the NCI-60 cancer cell lines using new high-density reverse-phase lysate microarrays. Proc. Natl. Acad. Sci. U.S.A. 2003:14229–14234. [PubMed: 14623978] 11. Munagala K, Tibshirani R, Brown PO. Cancer characterization and feature set extraction by discriminative margin clustering. BMC. Bioinformatics. 2004:21. [PubMed: 15070405] 12. Nishizuka S, Chen ST, Gwadry FG, Alexander J, Major SM, Scherf U, Reinhold WC, Waltham M, Charboneau L, Young L, Bussey KJ, Kim S, Lababidi S, Lee JK, Pittaluga S, Scudiero DA, Sausville EA, Munson PJ, Petricoin EF III, Liotta LA, Hewitt SM, Raffeld M, Weinstein JN. Diagnostic markers that distinguish colon and ovarian adenocarcinomas: identification by genomic, proteomic, and tissue array profiling. Cancer Res. 2003:5243–5250. [PubMed: 14500354] 13. Benson AB III. Adjuvant Chemotherapy of Stage III Colon Cancer. Semin. Oncol. 2005:74–77. 14. Weinstein JN, Myers TG, O'Connor PM, Friend SH, Fornace AJ Jr, Kohn KW, Fojo T, Bates SE, Rubinstein LV, Anderson NL, Buolamwini JK, van Osdol WW, Monks AP, Scudiero DA, Sausville EA, Zaharevitz DW, Bunow B, Viswanadhan VN, Johnson GS, Wittes RE, Paull KD. An information-intensive approach to the molecular pharmacology of cancer. Science. 1997:343– 349. [PubMed: 8994024]

Author Manuscript IASTED Int Conf Comput Syst Biol (2006). Author manuscript; available in PMC 2015 May 27.

Ou et al.

Page 8

Author Manuscript Author Manuscript Author Manuscript Author Manuscript

Fig. 1.

IASTED Int Conf Comput Syst Biol (2006). Author manuscript; available in PMC 2015 May 27.

Ou et al.

Page 9

Author Manuscript Author Manuscript Author Manuscript Author Manuscript

Fig. 2.

IASTED Int Conf Comput Syst Biol (2006). Author manuscript; available in PMC 2015 May 27.

A NEW CLUSTERING METHOD AND ITS APPLICATION TO PROTEOMIC PROFILING FOR COLON CANCER.

In this paper, we introduce a new clustering method: quasi-clique merger, and its associated data pretreatment programs. This program constructs non-b...
685KB Sizes 0 Downloads 10 Views