IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 18, NO. 3, MAY 2014

799

Temporal and Spatial Patterns of Gene Profiles during Chondrogenic Differentiation G. Skreti, E. S. Bei, K. Kalantzaki, and M. Zervakis, Member, IEEE

Abstract—Clustering analysis based on temporal profile of genes may provide new insights in particular biological processes or conditions. We report such an integrative clustering analysis which is based on the expression patterns but is also influenced by temporal changes. The proposed platform is illustrated with a temporal gene expression dataset comprised of pellet culture-conditioned human primary chondrocytes and human bone marrow-derived mesenchymal stem cells (MSCs). We derived three clusters in each cell type and compared the content of these classes in terms of temporal changes. We further considered the induced biological processes and the gene-interaction networks formed within each cluster and discuss their biological significance. Our proposed methodology provides a consistent tool that facilitates both the statistical and biological validation of temporal profiles through spatial gene network profiles. Index Terms—Chondrogenic differentiation, gene networks, mixed clustering, spatial profile, temporal profile.

I. INTRODUCTION HE interest on temporal dynamics of gene expression increases dramatically with the development of methods to deal with large-scale genomic data. The genetic progression of an organism or a disease enables the study of complex biological problems and facilitates the evaluation of therapeutic protocols based on the consideration of the dynamics of molecular mechanisms and response to drugs. Therefore, the consideration of the temporal profile of genes in a microarray experiment becomes of particular importance and several research attempts have been developed aiming to capture the temporal dynamics of gene expression. In particular, the development of gene-clustering algorithms that also detect temporal profiles is becoming increasingly important. Regression fitting methods have been proposed for describing the temporal profiles [1], [2] and statistical bootstrap methods have been developed for assigning genes to candidate profiles [3]. A ranked-based consideration of the temporal profile has been proposed in [4], [5], where the time instances are ranked on the basis of the corresponding expression values.

T

Manuscript received April 30, 2013; revised November 28, 2013, February 5, 2014 and February 7, 2014; accepted February 9, 2014. Date of publication February 11, 2014; date of current version May 1, 2014. The initial results of the paper were presented at the IEEE 2012 Annual International Conference of Engineering in Medicine and Biology Society. This work is partly supported by the “OASYS” and “OncoSeed” projects funded by the NSRF 2007–13 of the Greek Ministry of Development. The authors are with the Department of Electronic and Computer Engineering, Technical University of Crete, Chania 73100, Greece ([email protected], [email protected], [email protected], [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/JBHI.2014.2305770

Furthermore, biclustering methods have been developed to discover local expression patterns that are consistent in a subgroup of conditions or time instances [6]–[8]. Most of the developed methods consider clustering criteria based on some form of coded-shape behavior as opposed to the traditional algorithms that are heavily based on gene-expression patterns. In ranked-based methods, the candidate profiles capture only the shape by characterizing the type of temporal progression (increase or decrease) and the points of maximum or minimum expression. In biclustering algorithms, the temporal behavior of genes is encoded by symbols and the coherent expression patterns of subgroups of genes in specific time instances (biclusters) are characterized as strings of symbols. In many biological processes (BPs), however, not only the temporal behavior but also the expression profile itself is important for grouping genes as significant for a particular process or condition. Both the temporal shape and the expression profile must be considered for the clustering of gene profiles [8]. In our study, we attempt to develop tools appropriate for such purposes of clustering based on a dual criterion. More specifically, we develop a criterion based on expression profile, which is also influenced by the coded-shape trend. Following the clustering process, we address two issues related to the comparison of partitions that encode temporal trends. The first relates to the derivation of correspondences in the two partitions, while the second addresses the validity criterion for comparing the compactness of clusters in these partitions. The methodology expands on a brief version presented in [8]. In order to assess the results of clustering, we focus on an enrichment analysis to enable the study of major BPs involved in the derived clusters. Through this analysis, we can consider biological similarities and differences of matched clusters of MSC and chondrocyte cells. We further consider the gene networks formed within each cluster and discuss their biological significance. Gene networks enable the spatial representation of genes with specific expression and temporal profiles. Using such gene networks, we can follow the successive steps of gene/protein interactions that occur in chondrocytes and MSCs during chondrogenic differentiation. The algorithmic tools of our development are presented in Section II, whereas the experimental setup and results are presented and discussed in Section III. Important conclusions are summarized in Section IV.

II. METHODS AND PROCEDURES A. Expression Profile and Coded-Shape Temporal Behavior For each gene, the expression pattern over time is encoded in a N dimensional vector x similar to [8], where N represents the

2168-2194 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information.

800

IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 18, NO. 3, MAY 2014

extent of the time course of interest. The graphical representation of expression values xi , i = 1, . . . ,N defines the temporal profile of each gene. Furthermore, the difference of expressions between two consecutive time points defines the shape of temporal changes of each gene. We utilize a coded form of this shape, which encodes in a ternary form the changes from one to the next time point. Let δi = xi+1 − xi ,i = 1, . . . ,N − 1. Then, by binning the difference values in the set {−1, 0, 1} via a threshold Δ, we create the coded shape vector v of dimensionality N −1. More specifically, we define the coding operator c{.} that derives the coded-shape values: ⎧ ⎫ ⎨ −1 if δi ≤ −Δ ⎬ if − Δ ≤ δi ≤ +Δ . vi = c {δi } = 0 (1) ⎩ ⎭ +1 if δi ≥ +Δ The coding procedure is similar to that in [9], [10], but is based on thresholding around zero-levels; it also resembles the discretization process in [6], [7], with the difference that it employs ternary values instead of symbols. Thus, in our case, a coded-shape string is composed of digits that can be used in an arithmetic computations rather than symbols. Suppose we aim at C clusters St , t = 1, . . . ,C, with the mean vector of cluster St denoted as μt . The coded-shape of this vector is also derived by the coding operator c{.} and is denoted as mt = c μt . Let a member of St with expression pattern xj ∈ St , which happens to have a coded-shape pattern vj . We define the expression pattern of a member of St as wj ∈ St and its coded-shape pattern as vj . The size N −1 of the shape patterns and the ternary discretization of values result in a limited number of code vectors equal to M = 3N −1 . Thus, each cluster may contain a number of coded-shape patterns, numbered from 1, . . . ,M [8]. The clustering process derives C clusters of expression vectors based on both expression and coded-shape values. In our consideration, the compactness of a cluster depends on the consistency of these coded patterns [8]. B. Clustering Methodology The proposed methodology for clustering expression profiles with the influence of their shape profiles is based on a modification of the self-organizing map (SOM) methodology composed of three parts [8]. The first step derives a number of clusters based on the SOM organization of the expression vectors into a two-dimensional network of Q × Q nodes. Then, at a second step, the nodal weight-patterns of SOM are organized into C clusters, based on a K-means iterative approach with a criterion that is influenced by shape. Finally, each sample is assigned to one of C classes based on the distance of its expression vector from the mean of each cluster. Our contribution to this methodology is the criterion used in the second step of organizing the SOM nodes. More specifically, the first step derives a number (Q × Q) of clusters based on the expression profiles. We proceed in a second grouping that organizes nodes more closely, deriving a small number of clusters. At this point, we propose to use information about the temporal trend in order to favor regrouping of clusters with similar performance [8]. This is graphically depicted in Fig. 1, where a new pattern is to be classified in

Fig. 1. Strategy of influence through the shape profile; clusters that behave different from the test pattern in the “shape” space are pushed further apart from this pattern.

one of the existing two clusters. In this example, the expression vector is two dimensional (two time points), giving only one temporal difference. Thus, the shape factor is positive above the diagonal and negative below. According to the proposed criterion, we attempt to positively influence the class on the same side (C1) as the test pattern, or push the pattern away from a class (such as C2) residing on the opposite side. The overall clustering approach is summarized in the following: 1) First step: Self-Organizing Feature Map a) Assign random weight values for each node in the network, wij , i,j = 1, . . . ,Q. b) Upon presenting an input pattern x, calculate the distance between x and each neuron ij represented by the weight vector, wij , identifying the winning neuron as arg minj {||x − wij ||}, where . is the Euclidean norm. c) Adjust the weights of neighborhood of the winner neuron by: wij (t + 1) = wij (t) + η (t) K(i,j,t) {x − wij (t)}, where η(t) is the learning rate at epoch t, and K(i,j,t) is a suitable neighborhood function, in this case of a Gaussian nature. d) Repeat steps 2–3 until convergence, when the absolute squared weight changes is smaller than 0.02 over 2500 epochs. (In our application, we use 10×10 = 100 nodes and an extra stopping criterion restricts the number of repetitions to 500 the number of nodes, equal to 50000). 2) Second step: K-means Clustering of SOM Nodes The K-means algorithm aims to further organize the groups formed by SOM into C clusters based on the minimization of a distance criterion between the samples (SOM nodes) and the cluster means, so as to minimize the overall variance of each cluster. The K-means scheme operates on an iterative form based on the expectation-maximization (EM) solution. The proposed criterion in this step is defined as the overall l2 norm of expression-vector differences:

Q=

C

t=1 x j ∈S t

aj wj − μt 22

(2)

SKRETI et al.: TEMPORAL AND SPATIAL PATTERNS OF GENE PROFILES DURING CHONDROGENIC DIFFERENTIATION

where wj is a sample (expression vector of a SOM node) in the class t with class mean μt and αj is a weighting factor depending on the coded shape of wj compared to that of the class mean μt . 1 More specifically, aj = + 1 is the logistic func1+e −6 ( r j −1 ) tion with range [1, 2] evaluated for rj . This factor reflects the coded difference rj = vj − mt 1 as the l1 norm of the coded-shape differences corresponding to the sample and the class mean, where vj = c{wj } and mt = c{μt }. Notice that rj ranges in [0, 2], with 0 and 2 being the cases of no difference and max difference in all digits of the code vector. In case of no difference, the shape factor αj becomes 1, whereas in max difference in all digits this factor becomes 2, thus increasing the expression-based distance. Considering this criterion, the EM optimization proceeds in discrete steps towards the evaluation of new cluster means and the reassignment of samples in classes. At each step, for certain samples assigned in a cluster, the sample mean is recomputed as 1

aj wj . (3) μt = |St |

within and across clusters in each partition. For the purpose, we may use the shape codes of genes as ternary strings of size N −1 numbered from 1 to M = 3N −1 . Finally, the clusters are biologically enhanced and validated by means of the gene networks resulting from the clustered genes. We propose a method for directed network construction based on kernels [8]. 1) Cluster Matching Based on Shape Profile: Each cluster Ci has been assigned a number Li of samples, each with a corresponding coded-shape vector. The distribution of these shape vectors results in a code histogram for numbered codes 1, . . . ,M with probabilities p [1], . . . ,p[M ]. The proposed similarity metric is based on the difference of code histograms. In the matching process, each cluster of one partition is mapped to the best matching cluster of the other partition. Let two clusters, one with code probabilities p [1], . . . ,p[M ] from one partition and the other with q [1], . . . ,q[M ] from the second partition. The similarity of the two clusters is the summed absolute distance of the two probabilities [8]:

Dc =

x t ∈S t

Subsequently, samples are evaluated and reassigned based on the minimum distance from the class means, i.e.,   mint a j,t wj − μ t 22 , with the recomputed shape factors a j,t for the sample j and the tested mean vector of the class t [8]. The basic steps are as follows: a) Set the number of clusters C. b) Perform random assignment of samples in C classes with initial shape factors aj,t = 1. c) Compute the class means μ t based on the shapeweighted average. d) Compute the new shape factors a j,t and reassign samples in clusters based on the minimum weighted distance for the class means. e) Repeat steps 3–4 until convergence (max number of iterations). 3) Third step: Assignment of initial expression values into clusters Notice that the second step organizes the nodes of the SOM network; the initial expression vectors still need to be assigned in one of the C clusters. This assignment is performed based on the minimum l2 norm of the difference between each expression vector xi and the class means μt computed from the second step [8]. C. Criteria for Partition Evaluation In computational biology, there is an often need to compare two (or more) partitions. In our particular formulation utilizing the temporal profiles of genes, we need to find correspondences between the partitions, but also need to compare the quality of each partition in terms of compactness and discrimination of its clusters. For the first aspect, we need to compare one cluster of the first partition with all clusters of the other partition. The second task needs to consider the distribution of samples (genes)

801

M

|p [i] − q[i]| .

(4)

i=1

The matching of clusters from the two partitions is based on the smaller distance criterion. 2) Validity Index Based on Shape Profile: The proposed index is based on the compactness of each code histogram as well as the distance between pairs of histograms. Due to the probability distribution and the weighted implications of the histogram, the larger histogram bins play much more important role in comparison than smaller bins. Consequently, we now rank the probabilities in descending order to obtain the ranked probabilities p1 , . . . ,pM . Note that ties in ranking are resolved in favor of the minimum Hamming distance from the previous string. Thus, we assign the following terminology [8]: Numbered probabilities p [1] , p [2] , . . . ,p[M ] Ranked probabilities p1 , p2 , . . . ,pM p1 = max {p [i]} and pk = maxk {p [i]} i

i

is the kth higher ranked probability. Suppose we have two clusters C1 and C2 of size L1 and L2 , respectively. The first has ranked probabilities pi corresponding to strings si , where the second has qi corresponding to strings ti , i = 1, . . . ,M, respectively. In order to build an inner-cluster compactness index, we will use the Hamming distance between two strings in descending ranked order, weighted by the probability of the second string. This represents the distance between two strings obtained from the most to least significant ones. The within-cluster distance signified by the most significant string is zero. Considering the next-ranked string, the code word distance introduced is the difference of codes in as many cases as signified by the probability of this string. In general, for the ith-ranked string, the distance introduced can be computed from the code word distance from the previous string weighted by the probability of the ith string. Thus, for within cluster distance,

802

IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 18, NO. 3, MAY 2014

we have Q (C1 ) =

N −1

d (si ,si+1 ) pi+1

(5)

d (ti ,ti+1 ) qi+1

(6)

i=1

Q (C2 ) =

N −1

i=1

where d(.,.) signifies the Hamming distance between two strings. In order to introduce across-cluster distances, we first consider the differences of clusters for each string. Thus, for each numbered code, we calculate Numbered probabilities r[i] = |p [i] − q[i]| Ranked probabilities r1 ,r2 , . . . ,rN , where r1 = max {r[i]} and rk = maxk {r[i]} i

i

is the kth higher ranked probability. For possible comparisons of bounds, we note that r1 ≤ max{p1 ,q1 } and r1 ≥ max |p1 − q1 |. We have now defined a histogram of the difference or probability intersection of the two clusters C1 and C2 , whose compactness index can be defined as before: Q (C1 , C2 ) =

N −1

d (zi , zi+1 ) ri+1 { ≥ 1 − r1 }.

(7)

i=1

These indexes reflect the distance of digitized code words signified by their probabilities. Their values are within a range [1, K] scaled by (1 − r1 ). The minimum value of the index is zero and is obtained when the histogram has only one point with probability one. The maximum value is attained when the histogram involves two code words with probability 0.5 each and distance equal to the maximum value of K. Thus, the ratio of within cluster to across-cluster metrics signifies a relevant validity index for each partition P of CP clusters referred to as ranked shape index (RSI), which accounts for shape similarities, as [8]:

 Cp Q (Ci )+Q (Cj ) 1

max , i, j = 1, . . . , Cp . Cp i=1 j = i Q (Ci ,Cj ) (8) 3) Gene-Network Construction: Through the gene-interaction networks, we can examine common biological outcomes implied by group of genes in each cluster and highlight those genes and processes of particular interest. Network construction approaches [11], [12] model the genetic associations between molecules highlighting hub-genes of biological importance. In the same context, biological networks [13]–[16] model directly the associations between genes so as to reveal the pairwise relations encrypted in the experimental data. In this section, we introduce an approach for modeling the direct genetic interactions based on kernel density estimation (KDE), as an attempt to model the nonlinear effect of gene interactions. KDE [17]–[20] is a nonparametric approach that estimates the probability density function of a random variable. Assume that a generic network is developed based on a limited genomic independent identically distributed dataset X= (x1 ..xn ), where xi denotes the sample i of gene X. KDE allows the estimation

RSI {P } =

of probability density f h (X) of X as follows:   n x − xi 1

 fh (X) = K nh i=1 h −1

(9)

2

1 e 2 u is a vector of symmetric positive defiwhere K(u) = 2π nite Gaussian function, n is dataset’s size of the gene X and h > 0 is a smoothing parameter, the bandwidth that controls the extent of the kernel [19]. Genes interacting with each other can be linked and organized in a network form. The gene expression over a population provides valuable information on a gene’s activity, which can be correlated with other genes as to provide a metric of organization in the network structure. Under the assumption that genes and gene-products share similarities in datasets, the problem of network construction is reduced to the examination of independence between nodes Xi and Xk , through the joint and marginal probabilities [11]:

fh (Xi , Xk ) =fh (Xi ) · fh (Xk )

(10)

where fh indicates the probability density estimate of each gene according to (9) and the right-hand side is computed by pointby-point multiplication. The comparison of the two parts of (10) can be performed through the cross correlation test, where high correlation indicates independence of the two nodes, thus low connectivity. In contrast, small correlation indicates differences between the two parts of (10) and dependence of the two nodes, thus demonstrating high interaction between Xi and Xk . The latter justifies the connection between candidate nodes since they share common activity characteristics. The determination of edge directionality in the aforementioned undirected networks is often based on the Bayesian information criterion (BIC) [21], [22]. For each node, the number of connected edges is counted and all directly connected nodes are forming a subnetwork. For each subnetwork, the BIC score is computed on both possible orientations of the edge under examination. Finally, the edge is oriented in favor of the direction with the lowest BIC value. III. RESULTS A. Experimental Setup We applied our platform on a dataset comprised of temporal gene expression patterns of pellet culture-conditioned human primary chondrocytes and human bone marrow-derived MSCs introduced by Bernstein et al. in [23]. The microarray dataset consists of human primary chondrocytes that were obtained from osteoarthritic knee joints during alloarthroplastic procedures from five female patients, and MSCs that were isolated from iliac crest bone marrow from five healthy female donors. After cultivation of both cell types in pellet culture under chondrogenic conditions, gene expression analysis data were obtained from one pooled sample (from all five patients) of each group at four different time points (days 0, 3, 7, 14). Gene expression values were compared between chondrocytes and MSCs at various time points (days 0, 3, 7, 14).

SKRETI et al.: TEMPORAL AND SPATIAL PATTERNS OF GENE PROFILES DURING CHONDROGENIC DIFFERENTIATION

803

Fig. 2. Expression profiles in three classes for the cases of MSC (right part) and chondrocyte (CHD) (left part) tissue; proposed mixed-criterion clustering (upper part), traditional expression-based clustering (lower part).

B. Implication of Clustering Schemes and Validity Indices The appropriate number of clusters has been considered both manually and using the concept of validity indices. In manual consideration, we examined the shape of temporal profiles and also the properties of clusters formed in terms of expression values and temporal changes of their samples (temporal profiles). Using the proposed mixed clustering approach, the problem under consideration induces three types of gene trends (clusters), which are illustrated by their temporal profiles in Fig. 2 (upper part), for the cases of MSCs (right part) and chondrocytes (left part). The plots in this figure illustrate the prominent shapes of each cluster expressed by their code probabilities, which are indicative of the overall trends in these clusters. More specifically, we plot the expression pattern over time (four days) for the two most often appearing trends in each cluster (according to Fig. 3). In Fig. 2 (lower part), we similarly present the clusters obtained from the expression profiles alone for MSCs (right part) and chondrocytes (left part). The traditional valid-

ity indices, like Davis–Bouldin (DB) or Dunn’s index, fail to provide a consistent argument regarding the preferred number of clusters. In particular, for the mixed clustering scheme the DB index, which somewhat resembles the philosophy of RSI at the level of expressions alone, achieves the lowest value for two clusters for both MSCs and chondrocyte tissue. For the rankedbased clustering, DB fluctuates and is minimized for quite large numbers of clusters. Finally, for the clustering based on only expression values, the DB index is minimized for three and four classes in the case of MSCs and chondrocytes, respectively [8]. In contrast, the proposed RSI index shows robust performance and indicates an optimal number of three clusters in most approaches considered. More specifically, for clustering based on expression values and the proposed mixed clustering influenced by shape, the RSI index is minimized at three classes for both the MSCs and chondrocytes datasets. The situation is slightly different for the ranked-based clustering, with the RSI obtaining smaller values at two classes. In this case, the RSI index is slightly fluctuating for increasing number of classes and

804

Fig. 3.

IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 18, NO. 3, MAY 2014

Class histograms of coded strings, matched for the cases of chondrocytes (CHD) (upper part) and MSC (lower part). TABLE I RSI INDEX FOR VARIOUS APPROACHES AND TISSUE TYPES

is actually minimized at eight classes, where each prominent coded-shape pattern appears separately in a single class. The RSI index is illustrated in Table I, along with the corresponding DB index (in parentheses) [8]. It is worth mentioning from Table I [8] that the RSI index obtains smaller values for the combined clustering scheme, whereas the DB index for the traditional clustering approach based on expression values. By means of the prominent shapes

appearing in the clusters of the proposed methodology in Fig. 2 (upper part), we can visually conclude that the proposed method results in more tight clusters than the traditional expressionbased clustering and this is better reflected on the RSI than the DB index (with the RSI obtaining smaller values). In essence, the traditional approach based on expression (Fig. 2-lower part) operates on the basis of a symmetric distance function (l2 norm) so that it is forced to include symmetric patterns as in the case of cluster 2. On the other hand, the proposed approach engages the temporal trend, so that it only favors similar-shape performance over time. This issue of symmetry is also encoded into traditional validity indices, such as the DB, through the use of symmetric distances from the expression pattern of the class centroid. The proposed approach manages to achieve the smallest RSI index for both the MSCs and chondrocytes. The histograms of the resulting clusters are illustrated in Fig. 3. Overall, the clusters formed demonstrate compactness in both the patterns of expression profiles and the histograms of the shape-coded strings (see Figs. 2 and 3, respectively). Elaborating on Fig. 3, each cluster involves strings of similar shape, whereas each string mostly appears in one cluster. With the aid of the shape codes, the clustering algorithm manages to avoid splitting one shape string into many clusters, while it also preserves the original expression-based relations of samples. The matched clusters in MSCs and chondrocytes not only have similar content in terms of shape codes but also contain comparable numbers of samples from each string. C. Biological Comparison of MSC and Chondrocyte Clusters It is interesting at this point to compare the clusters formed for MSC and chondrocyte tissue using the proposed clustering

SKRETI et al.: TEMPORAL AND SPATIAL PATTERNS OF GENE PROFILES DURING CHONDROGENIC DIFFERENTIATION

Fig. 4.

805

Visualization of clusters in 2-D; MSC (right part) and chondrocytes (left part).

approach. A visual map of the three clusters based on the first two principal directions of PCA, is depicted in Fig. 4. It appears that the two human sample cultures induce similar concentrations of genes in their corresponding three clusters. The interference (mixing) of classes exists for both MSCs and chondrocytes, even though it appears at different proportionality. Thus, our study shows good similarity of classes in MSC and chondrocyte data, which verifies the results of the original study in Bernstein et al. [23]. Nevertheless, the cross-class samples need further consideration, since class interference is quite heavy, beyond the tolerance of errors due to measurement noise. In a further attempt to compare the two cell types on a deeper biological basis, we examine the temporal profiles of common genes of MSCs and chondrocytes. In each pair of matched clusters, we focus on common genes, whose temporal profile belongs to the two most prominent code shapes in the respective class, from Fig. 3. According to this process, the number of selected genes becomes variable, i.e., 598 genes in cluster 1, 1402 in cluster 2, and 260 genes in cluster 3 (see Appendix A). WebGestalt was used for enrichment analysis of these gene sets and evaluation of the most significant gene-ontology (GO) terms for the category of BP, molecular function (MF), and cellular component (CC) (see Appendix C) as well as of the exploration of enriched terms in regards to KEGG pathways, WikiPathways, TransFac, miR, and cytogenetic analyses (Appendixes D, E, F, G, and H, respectively) [24]. WebGestalt performs the hypergeometric test for the enrichment of aforementioned terms in the selected genes, followed by the Benjamini & Hochberg method for multiple test adjustment (adjP ). The results from the GO analysis revealed enriched generic as well as specific terms for each of the clusters (see Appendix C). For instance, cluster 1 involves genes that participate in the generic cellular macromolecule metabolic process/cellular nitrogen compound metabolic process and in the specific process of RNA splicing; in the generic cellular biosynthetic process/heterocycle biosynthetic process, and in the specific process of regulation of transcription, DNA-dependent. Similarly,

cluster 2 consists of genes with roles in the generic singleorganism process, cell–cell signaling, cell–cell adhesion, detection of stimulus and in the specific processes of G-protein coupled receptor signaling pathway, regulation of inhibitory postsynaptic membrane potential, ion transport, and synaptic transmission. Cluster 3 involves genes that take part in generic processes such as CC organization or biogenesis at cellular level, cell cycle and in the specific processes of nuclear division, CC disassembly involved in execution phase of apoptosis, and M phase of mitotic cell cycle. In the category of MFs, the enriched terms include RNA binding, ubiquitin binding, chromatin binding in cluster 1, and signaling receptor activity such as GABA-A receptor activity and G-protein coupled receptor activity, as well as sequence-specific DNA binding transcription factor activity, and voltage-gated ion channel activity in cluster 2. Finally, in the category of CCs, the enriched terms involve intracellular membrane-bounded organelle, nucleus, nuclear lumen, nucleolus, intracellular nonmembrane-bounded organelle, ribonucleoprotein complex, and histone deacetylase complex in cluster 1; intrinsic to membrane, cell periphery, extracellular region, synapse, and cation channel complex in cluster 2; intracellular organelle part, cytoplasm, kinetochore, organelle envelope, chromosome, mitochondrion, and microtubule associated complex in cluster 3. The KEGG analysis determined significant enriched terms in cluster 1 (protein processing in endoplasmic reticulum, endocytosis, Wnt signaling pathway, MAPK signaling pathway, and apoptosis); in cluster 2 (metabolic pathways, cell adhesion molecules, glycolysis/gluconeogenesis, fatty acid metabolism, PPAR signaling pathway, and regulation of actin cytoskeleton); and in cluster 3 [citrate cycle (TCA cycle), DNA replication, pyruvate metabolism, and cell cycle] (see Appendix D). The Wikipathways enrichment analysis revealed also significant terms such as apoptosis, TGF beta signaling pathway, senescence and autophagy, and mRNA processing for cluster 1; GPCRs, Class A Rhodopsin-like, G protein signaling pathways, and neural crest differentiation for cluster 2; and Wnt

806

IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 18, NO. 3, MAY 2014

Signaling pathway and pluripotency, mitochondrial LC-fatty acid beta-oxidation, cholesterol biosynthesis, regulation of actin cytoskeleton, and DNA damage response for cluster 3 (see Appendix E). Taken together, using the WebGestalt for the GO, KEGG, and Wikipathways analyses, we found an enrichment in many terms that are consistent with the findings reported in [23] (see Appendixes C, D, and E). The detailed results of all three analyses and also of TransFac, miR, and Cytogenic band analyses can be found in the Appendixes C–H. The latter analyses may add important insights in our study, but they are not utilized here. Following our findings for the first cluster, both MSCs and chondrocytes involve enriched major BPs (see Appendix C), such as metabolism (adjP = 6.31 × 10−11 ) and gene expression (adjP = 1.36 × 10−09 ). These BPs are evolving with time in the dynamic multistage process of chondrogenesis [25], thus explaining the changing performance of this class illustrated in Fig. 2 (upper-left column), which is also supported by recent gene expression and proteomic studies using human bone marrow derived MSCs [26], [27] and dedifferentiated human articular chondrocytes [28]. Regarding the enrichment of the second cluster, both MSCs and chondrocytes are involved in signal transduction and cell–cell signaling. The major enriched BP observed in this cluster is the G-protein coupled receptor signaling (adjP = 7.02 × 10−10 ) (see Appendix C), which is strongly associated to endochondral bone growth and remodeling and has been reported to activate in short time intervals (minutes to hours) [29], [30]. This can partially explain the lack of significant change over long time in the temporal profiles observed in Fig. 2 (upper-middle column). Finally, the third cluster involves CC organization (adjP = 0.0132), and cell cycle (adjP = 0.0139) (see Appendix C), which have been associated to chondrogenic differentiation [25], [31]. These two elements are characterized from both active and quiescent periods and are most responsible for the converging trend of the two cell types observed in Fig. 2 (upper-right column). The evaluation of human MSCs cell-cycle and differentiation in micromass culture has shown that cell cycle phases do not significantly change after day 7, thus supporting our temporal profile for this cell type [32]. Overall, the matched gene-clusters of the two cell types (MSCs and chondrocytes) express similar temporal trends, which can be explained by the enrichment analysis of the genes responsible for this performance. Nevertheless, the interactions of these genes in particular cellular processes during chondrogenic differentiation might be different in each cell type, as demonstrated by the subsequent gene-network analysis. D. Analysis of Biological Networks Within Clusters In order to validate the biological properties of the extracted MSC and chondrocyte clusters, we proceed with network construction to highlight genes of biological importance. In addition, we examine if the extracted network structure is consistent with the expression profiles of Fig. 2, through a higher level of process interpretation. Toward this direction, we focus on the same a subset of common genes was isolated for each group of

TABLE II NUMBER OF CONNECTIONS PER CLUSTER FOR STRONG INTERACTION

the MSC and chondrocyte genes for both cell types, as in the previous enrichment analysis of the mixed clustering methodology; 598 genes for the first cluster, 1402 for the second, and 260 genes for the third cluster. The KDE algorithm was separately applied on the gene expressions for each of the six clusters so as to identify data-induced similarities and differences between the MSC and chondrocyte genes in same and across separate clusters, respectively. The KDE approach as presented in Section II-C3 is based on correlations of pairs of genes within each cluster. In our experimental problem, it derives many different genetic structures at each correlation level. For each produced network per cluster, we isolate only the strongest connections in order to focus on significant gene interactions. This is achieved for small levels of cross-correlation between the two members of (10), since smaller values of correlation indicate stronger dependence between the genes. Therefore, for each of the six clusters, we consider only a range of correlation levels th, and Table II shows the number of connections per cluster for strong interaction (th ≤ 0.1), and somewhat weaker limits (0.3 ≤ th ≤ 0.7). For correlation levels smaller than th = 0.1,we make the following observations pertaining to strong interactions. 1) The subnetwork structures of MSCs and chondrocytes are similar. 2) Only a subset of the studied genes per cluster has strong interconnections with other genes, with the prominent relations observed in cluster 1 and 2 of MSCs, and in cluster 3 of both chondrocytes and MSCs. 3) For the case of chondrocytes in clusters 1 and 2, only few genes are strongly interconnected (in fact, the genes in the first chondrocyte cluster do not interact even at higher levels of correlation). 4) For both cell types in cluster 2, the generated interactions are formed from a limited number of interacting genes. Overall, these strong interactions observed in MSC clusters are explained through the in vitro chondrogenesis, which is expected in a pellet culture system as described in the work of Bernstein et al. [23]. Furthermore, the lack of strong interaction in chondrocytes likely implies that they have already reached their differentiation state and remain as persistent cartilage [25]. The subsequent consideration of gene interactions in the interval (0.3 ≤ th ≤ 0.7) from low to higher correlation levels, where each cell type in each cluster tends to reach the total number of observed genes (i.e., cluster 1–598, cluster 2–1402, cluster

SKRETI et al.: TEMPORAL AND SPATIAL PATTERNS OF GENE PROFILES DURING CHONDROGENIC DIFFERENTIATION

807

Fig. 5. “Intersection genes” of MSCs and chondrocytes for cluster 1 (a), cluster 2 (b), and cluster 3 (c). In cluster 3, all common genes but four are included in the intersection set.

3–260), may uncover valuable biological information regarding the affected annotated genes. For example, the annotation according to GO terms reveal BPs such as signal transduction (Wnt signaling network—an important regulator of chondrocyte development) in MSCs and response to stress in chondrocytes for Cluster 1, while indicates discrete phases of cell cycle (e.g., mitosis) in MSCs and chondrocytes for Cluster 3 (see Appendix C). Thus, the use of these networks at varying thresholds can unfold detailed subprocesses during differentiation of both MSCs and chondrocytes. At the level of many interactions, the networks generated from all observed genes show a complex structure that is unique for each cell type in each cluster (see Appendix B). In this case, we observe that only in cluster 3, the MSC and chondrocyte network structures have similar formations, with many “intersection genes” (as explained in Appendix B). In an attempt to further investigate the common network structures between the MSC and chondrocyte genes, we compute the intersections between matched clusters. Fig. 5 presents the subnetwork intersections (at a level of weaker interactions as to involve all genes) from the MSC and chondrocyte networks,

for the three cluster pairs. The intersections are presented in groups of common network degree, which as a topology metric reflects the number of links of each node with all other nodes in the network. Especially in biological networks, the degree is an important metric that highlights genes with high connectivity playing an important role in disease development [33]. We summarize the following observations. 1) The intersection of MSCs and chondrocytes includes only one pair of genes for cluster 1, eleven genes for cluster 2, and almost all (256 from 260) genes for cluster 3 (see Fig. 5). 2) Distinct “intersection genes” with high degree (hub genes) play significant role in network structure in each cell type of cluster 1 and 2: a) in cluster 1, two transcription factors (ATF2activating transcription factor 2, HES4-hairy and enhancer of split 4), and the EMC9 (ER membrane protein complex subunit 9) appear to have many interactors in chondrocytes, whereas JMY (junction mediating and regulatory protein, p53 cofactor) and

808

IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 18, NO. 3, MAY 2014

TNFAIP8 (tumor necrosis factor, alpha induced protein 8) play important roles in MSCs, b) in cluster 2, the GPR98 (G protein-coupled receptor 98) seems to be a crucial gene in MSCs, whereas a similar role is assumed by FAM221A (family with sequence similarity 221, member A) in chondrocytes. Both these molecules appear to be quite important and control the entire differentiation mechanisms of cluster 2. Recall that G protein-coupled receptor signaling changes in small time intervals, which cannot be captured by the sampling intervals in our dataset, thus explaining the lack of significant changes in cluster 2 for both MSCs and chondrocytes. 3) The large number of “intersection genes” in cluster 3 does not actually imply similar interaction forms. The common interactions of the two network structures in cluster 3 form only the 1/5th of the total interactions. This observation implies that the same generic cell-cycle operation (universal process according to Goldar et al. [34]) is achieved by different gene interactions in the two cell types. Based on the aforementioned observations, we illustrate that a common number of genes via successive interconnections achieves a complex structure for each cell type in each cluster, which forms a specific spatial pattern for each cell type that can support its temporal profile. Moreover, the same genes behave differently in each cell type despite their similar temporal profiles, forming different interactions and generating different spatial patterns, especially in clusters 1 and 2. The biological analysis of “intersection genes” provides important information of their major functional roles in each cell type. In general, the matched network interconnections might give important hints about the affected signaling pathways in chondrocytes and MSCs during differentiation. An important side effect of the proposed clustering scheme based on both expression and temporal profiles is its potential in grouping similar gene network structure of MSCs and chondrocytes, as in cluster 3 that summarizes the universal characteristics of cell cycle process [34]. Concluding, our proposed methodology allows monitoring of the temporal trends of a specific number of genes and their spatial projection through their gene network structures. Applying our methodology in Bernstein’s dataset [23], we demonstrate temporal and spatial profiles that are specific for MSCs and chondrocytes during chondrogenic differentiation in 3-D environment. As a final point, we notice that the proposed form of coding the shape of time-varying profiles (in Section II-A) is not unique. Other metrics could be explored within the same framework, which may further improve the clustering performance. IV. CONCLUSION Statistical evaluation of temporal profiles based on gene expression patterns play an important role in biological processes or conditions. We introduce a clustering method for this purpose which is based on the expression patterns but is also influenced by temporal changes. We compared the results of our platform with clustering based on expression profiles, or rank-based clus-

tering based solely on temporal changes. The proposed use of temporal changes to guide the clustering of expression profiles can better support the biological performance and can compare the two cell types on the basis of their biological role. We derived three clusters in each cell type and compared the content of these classes in terms of histograms explaining their individual expression and temporal profiles. For statistical evaluation, we introduce a validity measure that takes under consideration these temporal changes. For biological validation, we performed an enrichment analysis of prominent genes in each cluster. Through this analysis, we verified the relevance of results from the perspective of biological processes and found genetic alterations during chondrogenic differentiation for both cell types. The proposed methodology builds on the results of Bernstein et al. [23] and enhances the validity of derived genes by including yet another characteristic concerning the temporal expression pattern. Furthermore, we propose and implement a criterion for partition evaluation that engages the temporal profile of clusters, which is more appropriate for data with a consistent sequential progression in time. Through this approach, we derive compact clusters with similar temporal trends. To support this point, we have performed comparisons in both the statistical level (similarity of the MSCs and chondrocytes) and the biological significance. The wide range of enrichment analysis that was explored by WebGestalt enhances the validity of biological interpretation of the results, and can be eventually used as to guide wet-lab experimentalists. Furthermore, following a biological network approach, we established that despite the similar temporal and biological behavior of genes in matched clusters of MSC and chondrocyte cells, these genes may be forming different spatial interaction patterns (networks) in each cell type. Thus, even though we may detect certain statistical similarities, these need to be also consider from a deeper biological point of view as they may be due to different biological processes or different molecular interactions. Our proposed platform provides a consistent tool that facilitates both the statistical and biological validation of temporal profiles, using statistical support, enrichment analysis, and validation through spatial gene-network structures. APPENDIX Please notice that color version of the figures are provided on the online version of the paper. With respect to the implementations of algorithms, they are available upon request. Appendixes A (gene lists), B (MSC and chondrocyte networks), and C (enriched GO terms for BP, MF and CC), as well as Appendixes D–H (enriched terms of KEGG, WikiPathways, TransFac, miR, and cytogenetic analyses), can be found on http://www.display.tuc.gr/skret.temporal-spatialpatterns-study/. REFERENCES [1] S. Imoto and S. Miyano, “Statistical analysis o small set of time-ordered gene expression data using linear splines,” Bioinformatics, vol. 18, no. 11, pp. 1477–1485, Apr. 2002. [2] H. Liu, S. Tarima, A. S. Borders, T. V Getchell, M. L. Getchell, and A. J. Stromberg, “Quadratic regression analysis for gene discovery and

SKRETI et al.: TEMPORAL AND SPATIAL PATTERNS OF GENE PROFILES DURING CHONDROGENIC DIFFERENTIATION

[3]

[4] [5] [6]

[7]

[8] [9] [10] [11]

[12]

[13] [14] [15] [16] [17]

[18] [19]

pattern recognition for non-cyclic short time-course microarray experiments,” BMC Bioinform., vol. 6, no. 106, pp. 1–69, Apr. 2005. S. D. Peddada, E. K. Lobenhofer, L. Li, C. A. Afshari, C. R. Weinberg, and D. M. Umbach, “Gene selection and clustering for time-course and dose-response microarray experiments using order-restricted inference,” Bioinformatics, vol. 19, no. 7, pp. 834–841, Nov. 2003. S. G. Yi, Y. J. Joo, and T. Park, “Rank-based analysis for the time-course microarray data,” J. Bioinform. Comput. Biol., vol. 7, no. 1, pp. 75–91, Sep. 2009. S. G. Yi, Y. J. Joo, and T. Park, “Rank-based clustering analysis for the time-course microarray data,” J. Bioinform. Comput. Biol., vol. 7, no. 1, pp. 75–91, Aug. 2009. S. C. Madeira, M. C. Teixeira, I. S´a-Correia, and A. L. Oliveira, “Identification of regulatory modules in time series gene expression data using a linear time biclustering algorithm,” IEEE /ACM Trans. Comput. Biol. Bioinform., vol. 7, no. 1, pp. 153–165, Oct. 2007. A. V. Carreiro, O. Anunciac¸a˜ o, J. A. Carric¸o, and S. C. Madeira, “Prognostic prediction through biclustering-based classification of clinical gene expression time series,” J. Integr. Bioinform., vol. 8, no. 3, no. 175, pp. 1– 17, Jan. 2011. G. Skreti, E. S. Bei, and M. Zervakis, “Shape-influenced clustering of dynamic patterns of gene profiles,” in Proc. IEEE Eng. Med. Biol. Soc Conf., Jan. 2012, vol. 2012, pp. 1238–1241. L. Ji and K. L. Tan, “Mining gene expression data for positive and negative co-regulated gene clusters,” Bioinformatics, vol. 20, no. 16, pp. 2711– 2718, Apr. 2004. L. Ji and K. L. Tan, “Identifying time-lagged gene clusters using gene expression data,” Bioinformatics, vol. 21, no. 4, pp. 509–516, Sep. 2005. K. D. Kalantzaki, E. S. Bei, M. Garofalakis, and M. Zervakis, “Biological interaction networks based on sparse temporal expansion of graphical models,” in Proc. 12th IEEE Int. Conf. BioInform. BioEngineering, 2012, pp. 460–465. W. Jiang, X. Li, S. Rao, L. Wang, L. Du, C. Li, C. Wu, H. Wang, Y. Wang, and B. Yang, “Constructing disease-specific gene networks using pairwise relevance metric: Application to colon cancer identifies interleukin 8, desmin, and enolase 1 as the central elements,” BMC Syst. Biol., vol. 2, no. 72, pp. 1–15, Aug. 2008. N. Friedman, M. Linial, I. Nachman, and D. Pe’er, “Using Bayesian networks to analyze expression data,” J. Comput. Biol., vol. 7, no. 3–4, pp. 601–620, Jan. 2000. J. Hu, J. Wan, L. Hackler, D. J. Zack, and J. Qian, “Computational analysis of tissue-specific gene networks: Application to murine retinal functional studies,” Bioinformatics, vol. 26, no. 18, pp. 2289–2297, Jul. 2010. F. Browne, H. Wang, H. Zheng, and F. Azuaje, “A knowledge-driven probabilistic framework for the prediction of protein-protein interaction networks,” Comput. Biol. Med., vol. 40, no. 3, pp. 306–317, Jan. 2010. S. Bulashevska, A. Bulashevska, and R. Eils, “Bayesian statistical modelling of human protein interaction network incorporating protein disorder information,” BMC Bioinform., vol. 11, no. 46, pp. 1–15, Jan. 2010. C. C. Wu, S. Asgharzadeh, T. J. Triche, and D. Z. D’Argenio, “Prediction of human functional genetic networks from heterogeneous data using RVM-based ensemble learning,” Bioinformatics, vol. 26, no. 6, pp. 807– 813, Feb. 2010. Y.-Q. Qiu, S. Zhang, X.-S. Zhang, and L. Chen, “Detecting disease associated modules and prioritizing active genes based on high throughput data,” BMC Bioinform., vol. 11, no. 26, pp. 1–12, Jan. 2010. H. Wang, D. Mirota, and G. D. Hager, “A generalized Kernel Consensusbased robust estimator,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 1, pp. 178–184, Jan. 2010.

809

[20] R. Boscolo, H. Pan, and V. P. Roychowdhury, “Independent component analysis based on nonparametric density estimation,” IEEE Trans. Neural Netw. /a Publ. IEEE Neural Netw. Counc., vol. 15, no. 1, pp. 55–65, Jan. 2004. [21] E. C. Neto, C. T. Ferrara, A. D. Attie, and B. S. Yandell, “Inferring causal phenotype networks from segregating populations,” Genetics, vol. 179, no. 2, pp. 1089–1100, Apr. 2008. [22] X. W. Chen, G. Anantha, and X. Wang, “An effective structure learning method for constructing gene networks,” Bioinformatics, vol. 22, no. 11, pp. 1367–1374, Mar. 2006. [23] P. Bernstein, C. Sticht, A. Jacobi, C. Liebers, S. Manthey, and M. Stiehler, “Expression pattern differences between osteoarthritic chondrocytes and mesenchymal stem cells during chondrogenic differentiation,” Osteoarthritis Cartilage, vol. 18, no. 12, pp. 1596–1607, Sep. 2010. [24] B. Zhang, S. Kirov, and J. Snoddy, “WebGestalt: an integrated system for exploring gene sets in various biological contexts,” Nucleic Acids Res., vol. 33, pp. W741–W748, Apr. 2005. [25] A. Woods, G. Wang, and F. Beier, “Regulation of chondrocyte differentiation by the actin cytoskeleton and adhesive interactions,” J. Cell. Physiol., vol. 213, no. 1, pp. 1–8, Oct. 2007. [26] B. Rocha, V. Calamia, J. Mateos, P. Fern´andez-Puente, F. J. Blanco, and C. Ruiz-Romero, “Metabolic labeling of human bone marrow mesenchymal stem cells for the quantitative analysis of their chondrogenic differentiation,” J. Proteome Res., vol. 11, no. 11, pp. 5350–5361, Sep. 2012. [27] H. J. Yoo, S. S. Yoon, S. Y. Park, E. Y. Lee, E. B. Lee, J. H. Kim, and Y. W. Song, “Gene expression profile during chondrogenesis in human bone marrow derived mesenchymal stem cells using a cDNA microarray,” J. Korean Med. Sci., vol. 26, no. 7, pp. 851–858, Jul. 2011. [28] M. M. J. Caron, P. J. Emans, M. M. E. Coolsen, L. Voss, D. A. M. Surtel, A. Cremers, L. W. Van Rhijn, and T. J. M. Welting, “Redifferentiation of dedifferentiated human articular chondrocytes: comparison of 2D and 3D cultures,” Osteoarthritis Cartilage, vol. 20, no. 10, pp. 1170–1178, Oct. 2012. [29] A. M. Newman, N. B. Gallo, L. S. Hancox, N. J. Miller, C. M. Radeke, M. A. Maloney, J. B. Cooper, G. S. Hageman, D. H. Anderson, L. V Johnson, and M. J. Radeke, “Systems-level analysis of age-related macular degeneration reveals global biomarkers and phenotype-specific functional networks,” Genome Med., vol. 4, no. 2, p. 16, Feb. 2012. [30] J. J. Linderman, “Modeling of G-protein-coupled receptor signaling pathways,” J. Biol. Chem., vol. 284, no. 9, pp. 5427–5431, Feb. 2009. [31] P. S. Mathieu and E. G. Loboa, “Cytoskeletal and focal adhesion influences on mesenchymal stem cell shape, mechanical properties, and differentiation down osteogenic, adipogenic, and chondrogenic pathways,” Tissue Eng. Part B Rev., vol. 18, no. 6, pp. 436–444, Nov. 2012. [32] J. W. Yang, N. De Isla, C. Huselstein, M. N. Sarda-Kolopp, Y. P. L. N. Li, O. Y. Jing-Ping, J. F. Stoltz, and A. Eljaafari, “Evaluation of human MSCs cell cycle, viability and differentiation in micromass culture,” Biorheology, vol. 43, no. 3–4, pp. 489–496, 2006. [33] J. J. Cai, E. Borenstein, and D. A. Petrov, “Broker genes in human disease,” Genome Biol. Evol., vol. 2, pp. 815–825, Oct. 2010. [34] A. Goldar, M. C. Marsolier-Kergoat, and O. Hyrien, “Universal temporal profile of replication origin activation in eukaryotes,” PLoS One, vol. 4, no. 6, p. e5899, Jan. 2009.

Authors’ photographs and biographies not available at the time of publication.

Temporal and spatial patterns of gene profiles during chondrogenic differentiation.

Clustering analysis based on temporal profile of genes may provide new insights in particular biological processes or conditions. We report such an in...
1MB Sizes 0 Downloads 4 Views