TDAC: co-expressed gene pattern finding using attribute clustering.

Int. J. Bioinformatics Research and Applications, Vol. 11, No. 1, 2015

TDAC: co-expressed gene pattern finding using attribute clustering Tahleen A. Rahman and Dhruba K. Bhattacharyya* Department of Computer Science & Engineering, Tezpur University, Assam, India Email: [email protected] Email: [email protected] *Corresponding author Abstract: A number of clustering methods introduced for analysis of gene expression data for extracting potential relationships among the genes are studied and reported in this paper. An effective unsupervised method (TDAC) is proposed for simultaneous detection of outliers and biologically relevant coexpressed patterns. Effectiveness of TDAC is established in comparison to its other competing algorithms over six publicly available benchmark gene expression datasets in terms of both internal and external validity measures. Main attractions of TDAC are: (a) it does not require discretisation, (b) it is capable of identifying biologically relevant gene co-expressed patterns as well as outlier genes(s), (c) it is cost-effective in terms of time and space, (d) it does not require the number of clusters a priori, and (e) it is free from the restrictions of using any proximity measure. Keywords: cluster; outlier; core; neighbour; connected; gene; co-expressed. Reference to this paper should be made as follows: Rahman, T.A. and Bhattacharyya, D.K. (2015) ‘TDAC: co-expressed gene pattern finding using attribute clustering’, Int. J. Bioinformatics Research and Applications, Vol. 11, No. 1, pp.45–71. Biographical notes: Tahleen A. Rahman received her BTech degree in Computer Science and Engineering from Tezpur University in 2012. She is now pursuing MS. Dhruba K. Bhattacharyya received his PhD in Computer Science from Tezpur University in 1999. He is a Professor in the Computer Science & Engineering Department at Tezpur University. His research areas include data mining, network security and bioinformatics. He has published 200+ research papers in the leading international journals and conference proceedings. In addition, he has written/edited nine books. He is a Programme Committee/Advisory Body member of several international conferences/workshops. This paper is a revised and expanded version of a paper entitled ‘TDAC: coexpressed gene pattern finding using attribute clustering’ presented at the ‘2nd International Conference on Soft Computing for Problem Solving (SocProS 2012)’, Jaipur, India, 28–30 December 2012.

Copyright © 2015 Inderscience Enterprises Ltd.

45

46

1

T.A. Rahman and D.K. Bhattacharyya

Introduction

DNA microarray technology has revolutionised the monitoring of the expression levels of thousands of genes. Finding the hidden patterns in this huge volume of gene expression data requires computationally efficient methods and thus offers a huge opportunity for understanding of functional genomics. However, the large number of genes and the complexity of biological relationships greatly increase the challenges of comprehending and interpreting the resulting mass of data, which often consists of millions of measurements. Clustering techniques have been found capable in addressing this challenge. These techniques help in identifying the inherent natural structures and the interesting patterns in the dataset. Clustering techniques cluster genes with similar expression patterns (co-expressed genes) into the same cluster and helps in understanding gene function, gene regulation, cellular processes, and sub-types of cells. Many clustering algorithms have been evolved and applied on gene expression data. The existing approaches for gene data clustering are categorised into three: (a) Gene based, where the clustering intends to group together co-expressed genes which indicate cofunction, co-regulation and reveals the natural data structures. Here, genes are treated as the objects, while the samples are as features. (b) Sample based, where samples are generally related to various diseases or drug effects within a gene expression matrix. Only a small subset of genes whose expression levels strongly correlate with the class distinction, rise and fall coherently and exhibiting fluctuation of a similar shape under a subset of conditions, called the informative genes that participate in any relevant cellular process. The remaining genes are regarded as noise in the data as they are irrelevant to the sample of interest. And, finally (c) Subspace clustering, which attempts to find subset of objects such that the objects emerge as a cluster in a subspace created by a subset of the features. The subset of features for different subspace clusters can be unlike in a subspace clustering. Genes and samples are treated symmetrically such that either genes or samples can be regarded as objects or features. Subspace clustering techniques are further classified into two sub-categories, i.e. biclustering and triclustering. In biclustering, it attempts to cluster the gene expression data both row-wise as well as column-wise simultaneously. An exhaustive survey on biclustering techniques can be found in Jiang et al. (2004). Whereas, triclustering aims to mine biologically relevant coherent clusters over a GST (Gene Sample Time) domain for any gene expression datasets. It mines arbitrary positioned and overlapping clusters and depending on different parameter values which mine diverse variety of clusters. Triclustering relies on graph-based approach to mine all valid clusters and merge/delete some clusters having large overlaps. A survey on triclustering techniques can be found in Mahanta et al. (2011). All these three categories namely gene-based, sample-based, and subspace clustering have different challenges. In this paper, we investigate the problems of identification of coherent patterns by using gene-based approach. Most clustering algorithms are invariably dependent on input parameters like the number of clusters, cut-off value, minimum_number_of_points, etc. to keep the biological relevance of the genes intact. Also most existing techniques employ discretisation of the expression data as a preprocessing step. In this paper, we investigate the problems associated with discretisation and propose a cost-effective attribute clustering method for finding co-expressed gene patterns that do not require discretisation. To avoid the restrictions caused due to the use

TDAC: co-expressed gene pattern finding using attribute clustering

47

of any proximity measure while expanding the cluster, it exploits the regulation information computed over the expression values. An outlier detection algorithm for identification of outlier or non-conforming gene(s) is also presented. The major attractions of the proposed TDAC are : (a) it capable to identify simultaneously the biologically relevant co-expressed patterns as well as rare or outlier genes, (b) it is free from discretisation and prior estimation of the number of clusters, (c) it is free from the restrictions of using a specific proximity measure.

1.1 Contributions Following are our contributions: 

a selected survey on the existing missing value estimation techniques;



a selected survey on the existing relevant proximity measures and their pros and cons;



a selected survey on the existing relevant discretisation techniques and their pros and cons;



an attribute clustering technique for biologically relevant co-expressed pattern finding as well as for identification of rare or non-conforming gene instances;



internal and external validation of the cluster results in terms of homogenity, separability, silhoutte index, p-value and Q-value.

1.2 Organisation The remainder of the paper is organised as follows. Section 2 reports a limited survey on the existing clustering methods for gene expression data analysis. Our observations on the effectiveness of these algorithms are also reported in this section. In Section 3, we provide the background of our work and describe the proposed method. Section 4 presents the experimental results and finally the conclusions and future works are reported in Section 5.

2

Related work

The existing approaches for gene clustering are categorised into five distinct categories: partitioning, hierarchical, density based, model based and graph based. Next we discuss category-wise some of the existing clustering methods (Jiang et al., 2003) in brief. 1

Partitioning approach: k-means (McQueen et al., 1967) is a typical partition-based clustering algorithm which divides the data into pre-defined number of clusters in order to optimise a predefined criterion. The major advantages of it are its simplicity and speed, which allows it to run on large datasets. However, it may not yield the same result with each run of the algorithm. Often, it can be found incapable of handling outliers and is not suitable to detect clusters of arbitrary shapes. A Self Organising Map (SOM) (Kohonen, 1995) is more robust than k-means for clustering

48

T.A. Rahman and D.K. Bhattacharyya noisy data. It requires the number of clusters and the grid layout of the neuron map as user input. Specifying the number of clusters in advance is difficult in case of gene expression data. Moreover, partitioning approaches are restricted to data of lower dimensionality, with inherent well-separated clusters of high density. But, gene expression datasets may be high dimensional and often contain intersecting and embedded clusters. QT (quality threshold) clustering (Heyer et al., 1999) is an alternative method of partitioning data, invented for gene clustering. It requires more computing power than k-means, but does not require specifying the number of clusters a priori, and always returns the same result when run several times. The distance between a point and a group of points is computed using complete linkage, i.e., as the maximum distance from the point to any member of the group (Eisen et al., 1998).

2

Hierarchical approach: A hierarchical structure can also be built based on SOM such as Self-Organising Tree Algorithm (SOTA) (Dopazo and Carazo, 1997). Recently, several new algorithms such as Herrero et al. (2001) and Tomida et al. (2002) have been proposed based on the SOM algorithm. These algorithms can automatically determine the number of clusters and dynamically adapt the map structure to the distribution of data. Herrero et al. (2001) extend the SOM by a binary tree structure. At first, the tree only contains a root node connecting two neurons. After a training process similar to that of the SOM algorithm, the dataset is segregated into two subsets. Then the neuron with less coherence is split into two new neurons. This process is repeated level by level, until all the neurons in the tree satisfy some coherence threshold. Other examples of SOM extensions are Fuzzy Adaptive Resonance Theory (Fuzzy ART) (Tomida et al., 2002) which provides some approaches to measure the coherence of a neuron (e.g., vigilance criterion). The output map is adjusted by splitting the existing neurons or adding new neurons into the map, until the coherence of each neuron in the map satisfies a user specified threshold. The drawbacks of k-means are the lack of prior knowledge of the number of gene clusters in a gene expression data which results in the altering of results in successive runs since the initial clusters are selected randomly and the quality of the attained clustering has to be assessed. The drawbacks of SOM is that it is not effective since the main interesting patterns may be merged into only one or two clusters and cannot be identified. Unweighted Pair Group Method with Arithmetic Mean (UPGMA), presented in Eisen et al. (1998), adopts an agglomerative method to graphically represent the clustered dataset. However, it is not robust in the presence of noise. In Alon et al. (1999), the genes are split through a divisive approach, called the Deterministic-Annealing Algorithm (DAA). The Divisive Correlation Clustering Algorithm (DCCA) (Bhattacharya and De, 2008) uses Pearson’s correlation as the similarity measure. All genes in a cluster have highest average correlation with genes in that cluster. Hierarchical clustering not only groups together genes with similar expression patterns but also provides a natural way to graphically represent the dataset allowing a thorough inspection. However, a small change in the dataset may greatly change the hierarchical dendrogram structure. The drawbacks of this method are its high computational complexity, lack of robustness, vagueness of termination criteria and failure with large number of genes as datasets grow in complexity.

TDAC: co-expressed gene pattern finding using attribute clustering 3

49

Density based approach: A density based cluster can be defined as a region over the gene space, in which the local density is higher than its surrounding region. To identify such a region, we need to calculate local densities of genes in space. The density of genes is governed by two factors: (a) the typical distances among the genes, and (b) the number of neighbours of a gene, indicative of the dimension in which the points are embedded. Density based clustering algorithms identify dense areas in the object space. Clusters are hypothesised as high density areas separated by sparsely dense areas. A kernel density clustering method for gene expression profile analysis is reported in Shu et al. (2003). It assumes no parametric statistical model and does not rely on any specific probability distribution. Hyper-spherical uniform kernels of variable radius are used and density estimate of the data points are found. The method is robust and less sensitive to outliers. However, accurate density estimation and assignment of cluster membership require multiple data points in near neighbourhoods and thus density estimation is less accurate when cluster size is small. In Jiang et al. (2003), the authors propose the Density-based Hierarchical Clustering method (DHC) that uses a density-based approach to identify co-expressed gene groups from gene expression data. It considers clusters as highdimensional dense areas where the genes are attracted to each other. DHC uses twolevel hierarchical structures (attraction tree and density tree) to organise the cluster structure of the dataset. The attraction tree reflects relationships among genes in the dense area. Each node in the attraction tree represents a gene and its parent is the attractor of it. The highest density gene becomes the root of the tree. The attraction tree becomes complicated for large datasets and hence the cluster structure is summarised in a density tree. Each node of the density tree represents a dense area. Initially the whole dataset is considered a single dense area represented by the root node of the density tree. This dense area is then split into several sub-dense areas based on some criteria where each sub-dense area is represented by a child node of the root node. The sub-dense areas are further split till each sub-dense area contains a single cluster. DHC is suitable for detecting highly connected clusters but is computationally expensive and is dependent on two global parameters. An alternative to this is to define the similarity of points in terms of their shared nearest neighbours. This idea was first introduced by Jarvis and Patrick (1973). In Chung et al. (2004), a k-nearest neighbour based density estimation technique has been exploited. The density based algorithm proposed by Chung et al. (2004) works in three phases: density estimation for each gene, rough clustering using core genes and cluster refinement using border genes. Density of a gene is calculated by the sum of similarities among its k nearest neighbours. Core genes are high density genes and the method proceeds by clustering core genes to form the rough clusters. Once the rough clusters are formed, the border genes are assigned to the most relevant cluster. In Syamala et al. (2006), authors present a density and shared nearest neighbour based clustering method. The similarity measure used is that of Pearson’s correlation and the density of a gene is given by the sum of its similarities with its neighbours. The shared nearest neighbours of the dense genes are found and merged into the same cluster. The merging is done efficiently using a data structure called the P-tree (Perrizo, 2001). In Bhattacharya and De (2008), RDClust – a density based clustering algorithm – is presented for clustering gene expression data using a twoobjective function. RDClust uses regulation information as well as a suitable dissimilarity measure to cluster genes into regions of higher density separated by

50

T.A. Rahman and D.K. Bhattacharyya sparser regions. Density based approach gives clusters of good quality but suffers from input parameter dependency and high computational complexity with increase in dimensionality.

4

Model based approach: The Expectation Maximisation (EM) algorithm (Dempster et al., 1977) is a model based algorithm and it discovers good values for its parameters iteratively. It can handle various shapes of data, but can be very expensive since a large number of iterations may be required. In Travis and Huang (2009), a signal shape similarity method is used to cluster genes using a Variational Bayes Expectation Maximisation algorithm (Beal and Ghahramani, 2003). An important advantage of model-based approach is that it provides an estimated probability that a data object will belong to a particular cluster. Thus, a gene can have high correlation with two totally different clusters. Gene expression data are typically highly-connected; there may be instances in which a single gene has a high correlation with two different clusters. Thus, the probabilistic feature of model-based clustering is particularly suitable for gene expression data. However, model-based clustering relies on the assumption that the dataset fits a specific distribution which may not be true in many cases.

5

Graph based approach: Among the graph based algorithms, the CLuster Identification via Connectivity Kernels (CLICK) method (Sharan et al., 2003) is suitable for subspace and high dimensional data clustering. CLICK is robust to outliers and does not make assumptions about the number or structure of clusters. Although CLICK does not need the number of clusters a priori, the algorithm may generate a large number of clusters because of the use of a homogeneity parameter. Ben-Dor introduced the idea of corrupted clique graphs and used the concept of a clique graph and divisive clustering in his algorithm, Cluster Affinity Search Techniques (CAST) (Ben-Dor et al., 1999). A Clique graph is an undirected graph formed by the union of disjoint complete sub-graphs where each clique represents a cluster. The model assumes that there is a true biological partition of the genes into disjoint clusters based on the functionality of genes (Ben-Dor et al., 1999). The genes (objects) form sub-graphs or cliques where intra-clique genes are completely similar and inter-cluster genes are completely dissimilar. CAST takes as input the pair-wise similarities between genes and an affinity threshold, t. The algorithm searches through the clusters one at a time adding to or removing genes from a cluster w.r.t. the satisfying of a connectivity condition. CAST does not require a user-defined number of clusters and handles outliers efficiently. But, it faces difficulty in determining a good threshold value. In CAST, the size and number of clusters produced is directly affected by the fixed user-defined parameter, affinity threshold, t. Hence, a priori domain knowledge of the dataset is required. To overcome this problem, E-CAST (Bellaachia et al., 2002) calculates the threshold value dynamically based on similarity values of the objects that are yet to be clustered. The threshold is computed at the creation of each cluster. Another significant approach to overcome this issue can be found in Priyadarshini et al. (2010). The graph theoretic approach can be considered to be relevant to gene expression data mining as they are capable of discovering intersected clusters. However, it sometimes generates non-realistic cluster patterns.


51

2.1 Discussion Following are the observations based on our selective survey: 

The clustering algorithms are useful in identifying groups of co-expressed genes and in discovering coherent expression patterns.



Most clustering algorithms are invariably dependent on multiple input parameters, like number of clusters, proximity threshold, neighbourhood threshold, etc. and the cluster results generated by these algorithms are largely influenced by these parameters.



Most available gene expression datasets possess significant number of missing values and due to incapability of most existing techniques in handling such noisy datasets, their results are largely affected.



Majority of the clustering techniques are dependent on a choice of proximity measure, however, selection of appropriate proximity measure for handling high dimensional gene expression data is a challenging task. The major limitations of the existing proximity measures are: curse of dimensionlity problem, largely influenced by the presence of noisy or outlier gene(s), significant loss of information due to preprocessing, incapability of capturing the true biological requirement(s), etc.



The performance of most existing clustering algorithms are established to be superior in terms of internal validity measures, however, their performances are often found poor from the point of external validity measure, i.e. p-value and Q-value.

Therefore, development of a clustering technique which is free from the restrictions offered by discretisation and proximity measures independent from providing number of clusters as input parameter a priori and able to detect biologically relevant clusters of any shapes is of utmost importance.

3

TDAC: the proposed attribute clustering method

TDAC is basically a three step method. In step 1, the gene expression data matrix i.e. Gmn of order mn, is normalised to have a mean 0, and standard deviation 1. In step 2, we find condition-wise neighbourhood for each expression value based on regulation information and proximity measure with the neighbouring expression values. Figure 1 shows three cases of matching between a pair of genes at a particular condition (say, Ca) based on proximity, regulation or both. To find neighbourhood based on expression value proximity, we use a linear density based clustering that works based on L1 norm with reference to β, a user defined threshold. Similarly, to find similarity between a pair of genes based on regulation information, we use the angular deviation (i.e +ve, –ve or neutral) computed based on the arccos formula given in Sarma et al. (2011). It identifies the core gene groups for each condition based on the regulation information and proximity based neighbourhood information with reference to β. Step 3 performs two major tasks, i.e. identification of (a) outlier genes and (b) co-expressed gene groups. The co-expressed gene group is a subset of genes having common neighbours 2, over at least k conditions. Here, we assume that to form a co-expressed gene group or cluster, there

52


must be at least ‘two’ neighbour genes (neighbour of a gene is defined in Definition 3) over at least k conditions. An outlier gene is defined as a gene having neighbourhood influence of outliers mid-range value will be 1 else 0

Max-X%Max

A fixed cut-off is used w.r.t. maximal expression value to remove a X% of this Easy to implement value. Expression values > (100-X)% of the Max will be 1, else 0

Top X%

Easy to implement

Genes having level of expressions in X% of the Easy to implement highest values are assigned 1, else 0

2

Affected by outlier; Cannot detect –ve regulation

Cannot detect –ve regulation 1

May be affected by very high valued outlier;

2

Cannot detect –ve regulation;

3

Estimation of X is difficult

1

Cannot detect – ve regulation;

2

May be affected by high valued outlier;

3

Estimation of X is difficult

TDAC: co-expressed gene pattern finding using attribute clustering Table 3

55

Discretisation methods: a general comparison (continued)

Method

Approach used

Advantages

1 A parameter α is used to Standard deviation tune the std deviation so that values with little change can 2 and average be assigned zero

Limitations

No effect of outliers;

1 With smaller α, genes with least changes from average may be mapped to 0;

Generates finer cluster results

2 Estimation of α is difficult; 3 Cannot detect –ve regulation 1 Outlier sensitive;

Equal Frequency Principle (EFP)

Divides the sorted z-score normalised data into k intervals of approximately same no. of expression values

Easy to implement

2 Often found unrealistic due to distribution variation; 3 Cannot detect –ve regulation; 4 Appropriate estimation of ‘k’ is difficult 1 Outlier sensitive; 2 Cannot detect –ve regulation

Equal width principle

Each interval is of equal width

Easy to implement

Row k-means

Clusters the adjacent values from the same row into the same interval and for discretisation it uses the expression values with regulatory function

Very small expression Appropriate value change range into estimation of ‘k’ is same interval required

Column k-means

Clusters the adjacent values from the same column into the same interval

– Bi-k-means –

3 Estimation of no of intervals required

1

Appropriate estimation of ‘k’ without prior domain knowledge is difficult;

2

Cannot detect –ve regulation

1

Estimation of k is difficult;

2

May not be cost effective for larger datasets

Bins are different for different conditions

Implements k-means both at row and column, Can detect –ve and then combines the regulation within a two results; limited expression Reflects the expression change value changes within and between genes

56 Table 3

T.A. Rahman and D.K. Bhattacharyya Discretisation methods: a general comparison (continued)

Method

Transitional state discrimination (TSD)

Approach used

Uses two symbols

Advantages

Limitations 1

No discrimination between values with little or no change and more changes;

2

Cannot detect –ve regulation

Effective for timeseries data

Erdal et al.’s (2004) By Erdal et al. (2004) Threshold method

Capable of detecting both +ve and –vely regulated genes

Estimation of α is difficult

Known et al.’s (2003) method

Capable of detecting both +ve and –vely regulated genes

Cannot detect outliers

Constant slope boundary method –

Ji and Tan’s (2004) method –

Applies binning of the variation matrix with a 1 threshold; Focuses on variation 2 tendency of expression values

1

A two-step process;

Can detect +ve & – 2 ve regulation;

Influenced by outliers;

Can handle noise

Threshold estimation is difficult

3

3.2 Proximity measure Over the decades, several novel proximity measures have been introduced by the researchers to quantify similarity or dissimilarity between a pair of gene expression profiles. The results of a clustering algorithm largely influenced by the choice of a proximity measure depending on the type of data to be handled, its dimensionality, size of the attribute domains, etc. So, one has to be careful in selecting the proximity measure for an application. Clustering based gene expression data analysis basically exploits a proximity measure to compare genes from an organism under different development time points, conditions or treatments. For a gene expression dataset with n conditions, each gene is represented by an n-dimensional observation vector known as its gene expression profile. A proximity (similarity or dissimilarity) measure, which is a real-valued function, assigns a positive real number as the proximity value between any two such expression profiles. A clustering algorithm aims to identify co-expressed genes or samples, i.e. genes having similar expression profiles based on an appropriate proximity measure. A comparison of some most widely used proximity measures (Cha, 2007; Bandyopadhyay and Bhattacharyya, 2011) used for gene expression data is given in Table 4. L1 or Manhattan distance is scale variant and cannot detect negative correlation. Euclidean distance gives the distance between two genes but does not focus on the correlation between them. Pearson’s Correlation Coefficient (PCC), on the other hand, retains the correlation information between two genes as well as the regulation information. However, since it uses the mean values while computing the correlation between genes, a single outlier can aberrantly affect the result. Spearman’s rank correlation is not affected by outliers; however, there is information loss w.r.t. regulation since it works on ranked data. Cosine distance also has the same limitations as found in


57

case of L1 and L2 distances. Cross-correlation 1 & 2 are basically variants of PCC, so the demerits of PCC are also carried by these dissimilarity measures. Biosim is a newly introduced similarity measure based on the regulation angle information, which can handle both +ve and –ve regulation cases. Thus, it can also be observed from Table 4 that choosing an appropriate distance measure for gene expression data is a difficult task. Table 4

Comparison of various proximity measures Proximity (D/S)

Measure

Pre-process Scale invariance

Regulation

Outlier robustness no

City block/ Manhattan Distance

D

yes

no

none

Euclidean Distance

D

yes

no

none

no

Minkowski

D

yes

no

none

no

Chebyshev

D

yes

no

Pearson Correlation Coeff

S

yes

on centred data

+ve only

no

Spearman Rank Correlation

S

yes

yes

+ve only

yes

Cosine

S

yes

no

+ve only

no

Cross-correlation 1

D

yes

on centred data

+ve only

no

Cross-correlation 2

D

yes

on centred data

+ve only

no

no

Root mean square

D

yes

no

+ve only

no

Kullback-Leibler

D

yes

no

+ve only

no

S

yes

no

Both +ve and –ve

yes

BioSim

3.3 Definition and lemma Here, we introduce some definitions and lemma based on the density notions available in Ester et al. (1996) which provide the theoretical basis of the proposed method. Definition 1: Neighbour: A gene gi at a particular condition Ca is a neighbour of another gene gj if their regulation with reference to its previous condition (i.e. Ca–1) matches and the dissimilarity between their proximity values at Ca is within a user-defined threshold C

β, i.e., Dist g a, g   where Dist refers to a dissimilarity measure. In other words i

j

Ca i

g j N ( g ) . Definition 2: k -neighbour: The neighbour of a gene gi over any k conditions is the set of genes {gj} such that each gj is repeatedly the neighbour of gi over the same combination k

of k conditions. Mathematically, k  neighbour ( gi )=  a=1N ( gi a ) C

Definition 3: Core gene: A gene gi at condition Ca is a core if the number of its C

neighbours is greater than or equal to a user defined threshold γ, i.e. | N ( gi a )|  . Here, we have chosen γ=2.


58

Definition 4: Border gene: A gene gi at condition Ca is a border gene if 1

gi is a neighbour of a core gene gj, and

2

N ( g j a )< 

C

Definition 5: Connected: A pair of genes (gi, gj) is said to be connected if, 1

either gi or gj is core or border to each other over at least any k conditions

2

g k , which is core and gi , g j k  neighbour ( g k )

3

both gi, gj are core genes and they are mutually neighbours to each other, i.e. either gi k  neighbours ( g j ) or g j k  neighbours ( gi )

Definition 6: Cluster: A cluster is a set S of connected genes, where | S | 2 and for any two genes ( gi , g j )S the conditions for connectedness (Definition 5) are true. Definition 7: Noise: A gene gi is an outlier gene if the neighbour condition (Definition 1) for gi is false over any (n-k) conditions. In other words, a gene gi is an outlier gene if g j another gene between which the connectedness condition (Definition 5) is true. Lemma 1: If genes gi , g j are connected, then they are coherent over at least k conditions. Proof: Let gi , g j be non-coherent, but they are connected. Now, as per Definition 5, (i) gi and g j are either core or border over at least k conditions, and (ii) gi , g j are

similar over at least k conditions in terms of regulation and proximity. These two conditions are sufficient to establish that gi and g j are coherent over at least k conditions, and hence the proof. Lemma 2: If gene gi is an outlier gene, there cannot be another gene g j , it is connected with. Proof: Let gi be an outlier gene and let another gene g j Ci , a cluster. Now assume gi and g j are connected. As per Definition 5, gi also should belong to Ci which contradicts and hence the proof. Alternatively, if gi and g j are connected, then as per conditions given in Definition 6, they should form cluster, which contradicts and hence the proof. Lemma 3: Two genes gi and g j are not connected if they belong to two disjoint clusters. Proof : Let gi Ci , and g j C j , where let Ci and C j be disjoint. Again let gi and g j are connected. Now as per Definition 5, if gi and g j are connected, gi and g j are neighbours and they belong to same cluster. It contradicts and hence the proof.


4

59

Algorithm

TDAC operates on a pre-processed gene expression dataset for simultaneous identification of both outlier as well as co-expressed genes based on regulation information and attribute/condition level proximity. The proposed TDAC is free from the restrictions of using (a) discretisation and (b) specific proximity measure. As shown in Figure 2, based on the regulation information and the expression level proximity for each condition computed over the pre-processed gene matrix, a faster attribute clustering technique, i.e. attrib_clus identifies the core genes based on concepts given in Definition 1, 3 and 4. Here, attrib_clus finds core genes based on the regulation information using the arccos expression given in Sarma et al. (2011) and finds expression level dissimilarity using L1 norm. However, it is free from the restriction of using any proximity measure. Based on regulation, core genes and their connectivity information and by using the concepts given in Definition 2 and 6, TDAC can identify the coexpressed gene groups as well as the outlier genes, with reference to a given user defined threshold. The basic steps of TDAC for finding co-expressed gene groups are stated next. 1

Pre-process Gmn with z-score normalisation to obtain Gm' n .

2

Apply attrib_clus() on Gm' n to obtain core_gene groups for each condition: (a) Find neighbour gene(s) for a given gene gi at condition, say Ca based on regulation information (with reference to its previous condition, i.e. Ca1 ) and L1 proximity with reference to  . (b) Identify gi as core at Ca if satisfies the condition given in Definition 3.

3

Identify the co-expressed gene groups across n conditions: (a) Find genes which are core over at least k conditions. (b) From the subset of neighbour genes obtained, find genes which have the same nearest neighbours across at least k conditions. (c) Find the set of k-neighbour for each gene based on the subsets obtained above. (d) Each gene in the list of genes obtained along with its respective nearest k-neighbours are assigned cluster ids and form clusters. (e) If there are common nearest neighbours between two such genes, i.e. some gene is assigned to more than one cluster, then the respective clusters are merged

4

Call TDOGI() to identify the outlier gene(s); (a) To identify the outlier genes, the proposed outlier identification sub-routine, referred here as TDOGI works on the results generated by attrib_clus(). It uses another sub-routine called neighbour_count() to compute the number of neighbour genes of a given gene gi . It also uses the user-defined threshold k, which is the minimum number of conditions for two genes to become coexpressed.

60


4.1 TDOGI: an effective outlier gene identification algorithm Input: Gm' n ,  ,  , k ; Output: Outlier genes; Steps: 1 Read Gm' n ; 2 For each condition do (a) Apply attrib_clus() on G' to find possible core gene groups with reference to ; (b) Identify gene gi as core if it satisfies the condition given in Definition 3. 3 If the number of neighbours given by neighbour_count() for any gene gi over more than (n  k ) conditions is less than  then output gi as outlier gene.

4.2 Dependency on α, β and k The proposed TDAC and its sub-routine TDOGI are dependent on three user defined thresholds, i.e. α, β and k. However, in our experimentation we used over six publicly available datasets; we considered α=2, i.e. the minimum number of co-expressed genes to form a cluster. Similarly, for the minimum number of conditions (of any combinations) threshold, i.e. for k  60% of total number conditions, we achieved better performance for all those six datasets. Hence, the actual dependency on threshold is with β only. The appropriate values for β depend on number of instances in a dataset and the distribution of data. In our experimentation, we obtained best results for the following β values for those datasets: (i) Dataset 1: 0.05–0.08, (ii) Dataset 2: 0.12–0.14, (iii) Dataset 3: 0.3–0.4, (iv)Dataset 4: 0.2–0.5, (v) Dataset 5: 0.05–0.07 and (vi) Dataset 6: 0.02–0.04.

4.3 Complexity analysis We have analysed the complexity of TDAC in three major steps. For condition-wise attribute clustering the complexity will O(m n) . For groups of co-expressed pattern identification based on neighbour count with respect to α, the complexity will be O( pn) , where p is the average core genes identified for each condition, and finally, for the outlier(s) identification, it will be O(q n) , where q is the average non-core genes identified for each condition. Here, q k conditions. Such patterns will be the descendents of the first level


69

nodes. As we traverse down the tree with the increasing number of conditions starting from ‘k’ upto ‘n’, i.e. total number of conditions, a more finer co-expressed patterns (with very high homogeneity value) will be resulted. 2

The present TDAC is capable of finding only +vely co-regulated co-expressed patterns for any gene expression or gene time-series datasets. However, with a little modification in Step 2 of the main algorithm the present TDAC can be extended to be capable of handling +vely, –vely as well as mixed regulation typed co-expressed patterns. While considering the regulation factor, rather than depending on the regulation type (i.e. +ve, –ve or neutral), by using the regulation angle interval in both +ve and –ve directions in a similar fashion, one can easily extend the present TDAC to enable to handle all these three types of regulations.

6

Conclusions and future work

This paper presents a method for simultaneous identification of biologically relevant coexpressed patterns and outlier genes based on regulation information and attribute clustering. The proposed TDAC is advantageous in comparison to its other competing algorithms because (i) it is free from the limitations due to discretisation and use of a specific proximity measure, (ii) generates co-expressed patterns of high biological relevance both in terms of internal and external validity measures, like homogenity, pvalue etc. and (iii) also capable of identifying outlier genes. Work is going on to extend the present TDAC towards handling of both +vely, –vely and also mixed (i.e. both +vely and –vely) correlated gene patterns for construction of CEN (Co-expression Network). Also, a work on gene topology study for yeast co-expression network constructed based on extended TDAC is going on.

References Alon, U., Barkai, N., Notterman, D.A., Gish, K., Ybarra, S., Mack, D. and Levine, A.J. (1999) ‘Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide array’, Proceedings of National Academy of Sciences, Vol. 96, No. 12, pp.6745–6750. Bandyopadhyay, S. and Bhattacharyya, M. (2011) ‘A biologically inspired measure for coexpression analysis’, IEEE/ACM Transactions on Computational Biology and Bioinformatics, Vol. 8, No. 4, pp.929–942. Beal, M.J. and Ghahramani, Z. (2003) ‘The variational Bayesian EM algorithm for incomplete data: with application to scoring graphical model structures’, Proceedings of the 7th Valencia International Meeting on Bayesian Statistics, Vol. 63, No. 4, pp.453–464. Bellaachia, A., Portnoy, D., Chen, A.G. and Elkahloun, Y. (2002). ‘E-cast: a data mining algorithm for gene expression data’, Proceedings of the BIOKDD02: Workshop on Data Mining in Bioinformatics (with SIGKDD02 Conference), p.49. Ben-Dor, A., Shamir, R. and Yakhini, Z. (1999) ‘Clustering gene expression patterns’, Journal of Computational Biology, Vol. 6, Nos. 3/4, pp.281–297. Berriz, F.G., King, O.D., Bryant, B., Sander, C. and Roth, F.P. (2003) ‘Characterizing gene sets with funcassociate’, Bioinformatics, Vol. 19, pp.2502–2504.

70


Bhattacharya, A. and De, R. (2008) ‘Divisive correlation clustering algorithm (DCCA) for grouping of genes: detecting varying patterns in expression profiles’, Bioinformatics, Vol. 24, No. 11, pp.1359–1366. Cha, S.H. (2007) ‘Comprehensive survey on distance/similarity measures between probability density functions’, International Journal of Mathematical Models and Methods in Applied Science, Vol. 1, No. 4. Cho, R.J., Campbell, M., Winzeler, E., Steinmetz, L., Conway, A., Wodicka, L., Wolfsberg, T., Gabrielian, A., Landsman, D. and Lockart, D. (1998) ‘A genome-wide transcriptional analysis of the mitotic cell cycle’, Molecular Cell, Vol. 2, No. 1, pp.65–73. Chu, S., DeRisi, J., Eisen, M., Mulholland, J., Botstein, D., Brown, P.O. and Herskowitz, I. (1998) ‘The transcriptional program of sporulation in budding yeast’, Science, Vol. 282, pp.699–705. Chung, S., Jun, J. and McLeod, D. (2004) ‘Mining gene expression datasets using density based clustering’, Technical Report IMSC-04-002, USC/IMSC, University of Southern California. Das, R., Bhattacharyya, D.K. and Kalita, J.K. (2009) ‘Clustering gene expression data using a regulation based density clustering’, International Journal of Recent Trends in Engineering, Vol. 2, pp.76–78. Das, R., Bhattacharyya, D.K. and Kalita, J.K. (2011) ‘A new approach for clustering gene expression time series data’, International Journal of Bioinformatics Research and Applications, pp.310–328. Dempster, A., Laird, N. and Rubin, D. (1977) ‘Maximum likelihood from incomplete data via the EM algorithm’, Journal of the Royal Statistical Society, Vol. 39, No. 1, p.138. Dopazo, J. and Carazo, J.M. (1997) ‘Phylogenetic reconstruction using an unsupervised neural network that adopts the topology of a phylogenetic tree’, Journal of Molecular Evolution, Vol. 44, pp.226–233. Eisen, M., Spellman, P., Brown, P. and Botstein, D. (1998) ‘Cluster analysis and display of genome-wide expression patterns’, Proceedings of National Academy of Sciences, Vol. 95, pp.14863–14868. Erdal, S., Ozturk, O., Armbruster, D., Ferhatosmanoglu, H. and Ray, W.C. (2004) ‘A time series analysis of microarray data’, Proceeding of the 4th IEEE Symposium on Bioinformatics and Bioengineering, pp.366–374. Ester, M., Kriegel, H.P., Sander, J. and Xu, X. (1996) ‘A density based algorithm for discovering clusters in large spatial databases’, Proceedings of 2nd International Conference on Data Mining. Garcia, S., Luengo, J., Saez, J.A., Lopez, V. and Herrera, F. (2012) ‘A survey of discretization techniques: taxonomy and empirical analysis in supervised learning’, IEEE Transactions on Knowledge and Data Engineering. Herrero, J., Valencia, A. and Dopazo, J. (2001) ‘A hierarchical unsupervised growing neural network for clustering gene expression patterns’, Bioinformatics, Vol. 17, pp.126–136. Heyer, L.J., Kruglyak, S. and Yooseph, S. (1999) ‘Exploring expression data: identification and analysis of co-expressed genes’, Genome Research, Vol. 9, No. 11. Iyer, V.R., Eisen, M.B., Ross, D.T., Schuler, G., Moore, T., Lee, J., Trent, J.M., Staudt, L.M., Hudson, J.J., Boguski, M.S., Lashkari, D., Shalon, D., Botstein, D. and Brown, P.O. (1999) ‘The transcriptional program in the response of the human fibroblasts to serum’, Science, Vol. 283, pp.83–87. Jarvis, R.A. and Patrick, E.A. (1973) ‘Clustering using a similarity measure based on shared nearest neighbors’, IEEE Transactions on Computers, Vol. 11. Ji, L. and Tan, K. (2004) ‘Mining gene expression data for positive and negative co-regulated gene clusters’, Bioinformatics, Vol. 20, No. 16, pp.2711–2718. Jiang, D., Pei, J. and Zhang, A. (2003) ‘DHC: a density-based hierarchical clustering method for time series gene expression data’, Proceedings of BIBE2003: 3rd IEEE International Symposium on Bioinformatics and Bioengineering, Bethesda, Maryland, USA.


71

Jiang, D., Tang, C., and Zhang, A. (2004) Cluster analysis for gene expression data: a survey. Available online at: www.cse.buffalo.edu/ DBGROUP/bioinformatics/papers/survey.pdf Kim, H., Golub, G.H. and Park, H. (2005) ‘Missing value estimation for DNA microarray gene expression data: local least squares imputation’, Bioinformatics, pp.187–198. Kohonen, T. (1995) Self-Organizing Maps, Springer-Verlag, Heidelberg, Germany. Kwon, A., Hoos, H. and Ng, R. (2003) ‘Inference of transcriptional regulation relationships from gene expression data’, Bioinformatics, Vol. 19, No. 8, pp.905–912. Mahanta, P., Ahmed, H.A., Bhattacharyya, D.K. and Kalita, J.K. (2011) ‘Triclustering in gene expression data analysis: a selected survey’, Proceedings of IEEE NCETACS, pp 1-6 McQueen, J.B. (1967) ‘Some methods for classification and analysis of multivariate observations’, in Le Cam, L.M. and Neyman, J. (Eds): Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, Vol. 1, University of California Press, pp.281–297. Perrizo, W. (2001) ‘Peano count tree technology’, Technical Report, NDSUCSOR-TR-01-1, North Dakota State University, Fargo, North Dakota, USA. Priyadarshini, G., Chakraborty, B., Das, R., Bhattacharyya, D.K. and Kalita, J.K. (2010) ‘Highly coherent pattern identification using graph-based clustering’, Proceedings of the 7th Annual Biotechnology and Bioinformatics Symposium (BIOT-2010), Lafayette, LA. Rousseeuw, P. (1987) ‘Silhouettes: a graphical aid to the interpretation and validation of cluster analysis’, Journal of Computational and Applied Mathematics, Vol. 20, pp.153–165. Sarma, S., Sarma, R. and Bhattacharyya, D.K. (2011) ‘An effective density based hierarchical clustering technique to identify coherent patterns from gene expression data’, Proceedings of PAKDD 2011, Advances in Knowledge Discovery and Data Mining, Vol. 6634, pp.225–236. Sharan, R. and Shamir, R. (2000) ‘CLICK: a clustering algorithm with applications to gene expression analysis’, Proceedings of 8th International Conference on Intelligent Systems for Molecular Biology, pp.307–316. Sharan, R., Maron-Katz, A. and Shamir, R. (2003) ‘Click and Expander: a system for clustering and visualizing gene expression data’, Bioinformatics, Vol. 19, No. 14, pp.1787–1799. Shu, G., Zeng, B., Chen, Y.P. and Smith, O.H. (2003) ‘Performance assessment of kernel density clustering for gene expression prole data’, Comparative and Functional Genomics, Vol. 4, pp.287–299. Syamala, R., Abidin, T. and Perrizo, W. (2006) ‘Clustering microarray data based on density and shared nearest neighbor measure’, Computers and their Applications, pp.360–365. Tomida, S., Hanai, T., Honda, H. and Kobayashi, T. (2002) ‘Analysis of expression profile using fuzzy adaptive resonance theory’, Bioinformatics, Vol. 18, No. 8, pp.107–383. Travis, J.H. and Huang, Y. (2009) ‘Clustering of gene expression data based on shape similarity’, EURASIP Journal on Bioinformatics and Systems Biology. Wang, K., Wang, B. and Peng, L. (2009) ‘CVAP: validation for cluster analyses’, Data Science Journal, Vol. 8, pp.88–93. Wen, X., Fuhrman, S., Michaels, G.S., Carr, D.B., Smith, S., Barker, J.L. and Somogyi, R. (1998) ‘Large-scale temporal gene expression mapping of central nervous system development’, PNAS, Vol. 95, No. 1, pp.334–339.

Community detection in sequence similarity networks based on attribute clustering.

No3CoGP: non-conserved and conserved coexpressed gene pairs.

Assisted clustering of gene expression data using ANCut.

Finding semirigid domains in biomolecules by clustering pair-distance variations.

Clustering cancer gene expression data by projective clustering ensemble.

Detecting Hotspot Information Using Multi-Attribute Based Topic Model.

Dynamic clustering of gene expression.

Finding gene regulatory network candidates using the gene expression knowledge base.

Clustering analysis and pattern discrimination of EMG linear envelopes.

Clustering Acoustic Segments Using Multi-Stage Agglomerative Hierarchical Clustering.

Ensemble Clustering using Semidefinite Programming.

Clustering of gene ontology terms in genomes.

Attribute conditioning: changing attribute-assessments through mere pairings.

Erratum to: Classification and Clustering on Microarray Data for Gene Functional Prediction Using R.

Evolution of akirin family in gene and genome levels and coexpressed patterns among family members and rel gene in croaker.

Finding the missing gene-environment interactions.

Finding approximate gene clusters with Gecko 3.

Classification and Clustering on Microarray Data for Gene Functional Prediction Using R.

Using multi-instance hierarchical clustering learning system to predict yeast gene function.

Quantifying the pattern of microbial cell dispersion, density and clustering on surfaces of differing chemistries and topographies using multifractal analysis.

Identification of differentially coexpressed genes in gonadotrope tumors and normal pituitary using bioinformatics methods.

A Diffusion and Clustering-Based Approach for Finding Coherent Motions and Understanding Crowd Scenes.

Statistical Significance of Clustering using Soft Thresholding.

Clustering Scatter Plots Using Data Depth Measures.