Methods in Molecular Biology DOI 10.1007/7651_2015_240 © Springer Science+Business Media New York 2015

Classification and Clustering on Microarray Data for Gene Functional Prediction Using R Liliana Lo´pez Kleine, Rosa Montan˜o, and Francisco Torres-Avile´s Abstract Gene expression data (microarrays and RNA-sequencing data) as well as other kinds of genomic data can be extracted from publicly available genomic data. Here, we explain how to apply multivariate cluster and classification methods on gene expression data. These methods have become very popular and are implemented in freely available software in order to predict the participation of gene products in a specific functional category of interest. Taking into account the availability of data and of these methods, every biological study should apply them in order to obtain knowledge on the organism studied and functional category of interest. A special emphasis is made on the nonlinear kernel classification methods. Keywords: Microarrays, Functional prediction, Multivariate data analysis, Clustering, Classification

1

Introduction The methods presented here use clustering and classification methods in order to determine groups of genes with functional relationships. Although both types of methods are used to construct groups (of genes in this case), clustering methods construct them without using previous knowledge (unsupervised classification) and classification methods using previous knowledge (supervised learning).

2

Materials The procedures described here for gene functional prediction based on microarrays can be applied to all kinds of gene expression data organized in an n  p table containing the amount of RNA messenger as shown in Table 1. Moreover, this kind of table can be obtained from RNA sequencing data after reads are mapped on the genes of the organism of interest. Using RNA sequencing technique the amount of RNA transcripts represents also the RNA quantity. Both raw data tables need to be normalized and transformed in order to make them comparable before further processing and data analysis (1, 2). All procedures are presented in R (3).

Liliana Lo´pez Kleine et al.

Table 1 Typical microarray data table Gene ID

Microarray condition (1)

Gene (1)

RNA quantity

Microarray condition (p)

Gene (n)

Fig. 1 Prediction of new genes belonging to the virulence category through classification

3

Methods Below we describe briefly all multivariate clustering and classification methods we have found useful for functional prediction from microarray data. We use the example of prediction of virulence factors throughout the chapter as shown in Fig. 1. Nevertheless, these methods are useful also for other functional categories of interest and other organisms, for which genomic data is available. We have concluded in previous studies (4–7) that more than one method should be used and coincident predictions should be taken into account in order to emit biological hypothesis and plan further in silico or wet-lab validation experiments.

3.1 Functional Data (Known Gene Categories) and Training Sets

This kind of method is straightforward for classifying genes into two categories (e.g., virulence factors and not virulence factors, immunity related genes and not). These categories need to be constructed prior to applying the here proposed methods. They can be constructed based on literature or extracted from genomic databases. They should be represented as a vector indicating for each of the genes to which category it belongs. For supervised classification like support vector machine classification (SVM) and linear discriminant analysis (LDA), we used a training set. The genes belonging to the training set are chosen at random from the two known categories and should represent approximately a third part of all genes. The ratio between both classes in the overall data set should be maintained in the training set.

Microarray Classification for Functional Prediction

3.2 Preprocessing Microarray Data

Several preprocessing methods for gene expression data exist. Any of them can be used in order to normalize gene expression data and to make experiments comparable. We recommend using the method proposed by Huber et al. (8). Below the code with an example data contained in Huber’s vsn package. This package is an R package and makes a part of the Bioconductor packages ((9), www.bioconductor.org) developed especially for gene expression data. source("http://bioconductor.org/biocLite.R") #installation of this package from Bioconductor biocLite("vsn") library(vsn) # this library contains Huber’s (2003) method implemented citation("vsn") # here is how you should cite the library if you use it data("lymphoma") #this is an available microarray data set in R we are going to use to illustrate all methods class(lymphoma) # this command returns the type of object; this is a special object for microarray data dim(exprs(lymphoma)) #returns the dimension of the data table containing the gene expression data boxplot(exprs(lymphoma)) #constructs a boxplot of gene expression values for each of the 16 samples par(mfrow¼c(1,2)) #prepares graphic window for two plots hist(exprs(lymphoma)[,1],main¼"green") # plots a histogram of the first sample hist(exprs(lymphoma)[,2],main¼"red") # plots a histogram of the second sample lym2¼justvsn(lymphoma) #applies normalization method and creates a new table meanSdPlot(lym2, ranks¼TRUE) #shows the result of normalization plotting mean against variance; #higher mean values should not implicate higher standard deviation (sd). #A horizontal red line is expected. boxplot(exprs(lym2)) #boxplots after normalization par(mfrow¼c(1,2)) hist(exprs(lym2)[,1]) #histograms after normalization hist(exprs(lym2)[,2])

3.3 Clustering Methods

Clustering methods allow grouping observations, with the aim of reducing the variability inside each cluster and identifying homogeneous subgroups. Therefore, the observations inside each group will present similar characteristics, essentially numerical. This methodology has been widely used in many applications and is classified as a multivariate technique in statistics. A key element of this analysis is the similarity metric to be used in order to quantify how similar individuals (here genes) are based on the data at hand. The most common is the Euclidean metric.

Nevertheless, other popular distance metrics are worth mentioning: Manhattan, Canberra, and Binary. Details about these measures can be found in the books edited by Rencher and Christensen (10) and Izenman (11). For the illustration of these methods, we use the Lymphoma data set presented in the previous subsection. 3.3.1 Hierarchical Methods

Agglomerative Hierarchical clustering methods are the most common ones for unsupervised classification. In this method each of the n observations (here genes) starts as one different cluster, and on each iteration, several pairs of clusters are merged until, as a final stage, a big cluster is formed. The most recommended criterion, is the “Ward” criterion, which is highly optimal when groups present a spherical behavior. It is based on variance reduction in each new cluster created during iteration, conducting to an optimal number of clusters minimizing variability inside clusters (10). Other choice to perform an unsupervised clustering is the Divisive Analysis. This algorithm initially starts with all observations in one single cluster and divides at each step the clusters until each cluster contains just one single observation (12). It is called DIANA in most of the references and is one of a few representatives of the divisive hierarchical approach for clustering analysis. The main elements for its implementation are the dissimilarity matrix and the clustering fusion criteria. A rule to define the number of clusters using this technique is proposed by Mojena (13) and can be represented by a parallel line that cuts the distance axis of the dendrogram. A correct number of clusters are those with distances less than a constant. This constant is computed from the mean plus k times the standard deviation of the distances used to form all groups. Milligan and Cooper (14) suggested k ¼ 1.25 as the most satisfactory criterion. Given below is the code in order to apply the methods described here to the example data of the vsn package (8). Even when this is not the methodology to obtain the most homogeneous clusters, it is possible to use it to detect initial number of clusters and their mean vectors (cluster centers or centroids) and use them as initial information useful in other better procedures. lymph_express

Classification and Clustering on Microarray Data for Gene Functional Prediction Using R.

Gene expression data (microarrays and RNA-sequencing data) as well as other kinds of genomic data can be extracted from publicly available genomic dat...
320KB Sizes 0 Downloads 7 Views