A fast hierarchical clustering algorithm for large-scale protein sequence data sets.

Computers in Biology and Medicine 48 (2014) 94–101

Contents lists available at ScienceDirect

Computers in Biology and Medicine journal homepage: www.elsevier.com/locate/cbm

A fast hierarchical clustering algorithm for large-scale protein sequence data sets Sándor M. Szilágyi a,1, László Szilágyi b,c,n,2 a

Petru Maior University, Department of Informatics, Str. Nicolae Iorga Nr. 1, 540088 Tîrgu Mureş, Romania Budapest University of Technology and Economics, Department of Control Engineering and Information Technology, Magyar tudósok krt. 2, H-1117 Budapest, Hungary c Sapientia University of Transylvania, Faculty of Technical and Human Sciences, Şoseaua Sighişoarei 1/C, 540485 Tîrgu Mureş, Romania b

art ic l e i nf o

a b s t r a c t

Article history: Received 13 November 2013 Accepted 25 February 2014

TRIBE-MCL is a Markov clustering algorithm that operates on a graph built from pairwise similarity information of the input data. Edge weights stored in the stochastic similarity matrix are alternately fed to the two main operations, inflation and expansion, and are normalized in each main loop to maintain the probabilistic constraint. In this paper we propose an efficient implementation of the TRIBE-MCL clustering algorithm, suitable for fast and accurate grouping of protein sequences. A modified sparse matrix structure is introduced that can efficiently handle most operations of the main loop. Taking advantage of the symmetry of the similarity matrix, a fast matrix squaring formula is also introduced to facilitate the time consuming expansion. The proposed algorithm was tested on protein sequence databases like SCOP95. In terms of efficiency, the proposed solution improves execution speed by two orders of magnitude, compared to recently published efficient solutions, reducing the total runtime well below 1 min in the case of the 11,944 proteins of SCOP95. This improvement in computation time is reached without losing anything from the partition quality. Convergence is generally reached in approximately 50 iterations. The efficient execution enabled us to perform a thorough evaluation of classification results and to formulate recommendations regarding the choice of the algorithm's parameter values. & 2014 Elsevier Ltd. All rights reserved.

Keywords: Protein sequence clustering Markov clustering Markov processes Efficient computing Sparse matrix

1. Introduction Protein families are defined as groups of molecules with relevant sequence similarity [1,2], the members of which are likely to serve similar or identical biological purposes [3]. The identification of protein families is generally performed by clustering algorithms, mostly supported by pairwise similarity or dissimilarity measures [4]. Well established properties of some proteins in a family may be reliably transferred to other members whose functions are not well known [5]. The classification of protein sequences has a wide variety of methods [6], including motif-based classification [7], string kernels [8], data mining techniques [9–11], string weighting [12], word segmentation [13], feature hashing [14], and trees [15].

n Corresponding author at: Sapientia University of Transylvania, Faculty of Technical and Human Sciences, Şoseaua Sighişoarei 1/C, 540485 Tîrgu Mureş, Romania. Tel.: þ40 265 208170; fax: þ 40 265 206211. E-mail addresses: [email protected] (S.M. Szilágyi), [email protected] (L. Szilágyi). 1 Tel./fax: þ 40 265 262275. 2 Tel.: þ36 1 463 4027; fax: þ 36 1 463 2699.

http://dx.doi.org/10.1016/j.compbiomed.2014.02.016 0010-4825 & 2014 Elsevier Ltd. All rights reserved.

One of the greatest obstacle for these methods represents the multi-domain structures of many protein families [16]. Several ultra-fast clustering scheme have been introduced recently [17– 22] and successfully employed in protein sequence or interaction grouping [23], but also in the clustering of large biochemical and biological networks [24], and documents [25]. TRIBE-MCL is an efficient clustering method based on Markov chain theory [26], introduced by Enright et al [4]. TRIBE-MCL assigns a graph structure to the protein set such a way that each protein has a corresponding node. Edge weights are stored in the so-called similarity matrix S, which acts as a (column or row) stochastic matrix. At any moment, edge weight sij reflects the posterior probability that protein i and protein j have a common evolutionary ancestor. Initial edge weights in the graph are obtained via pairwise alignment of the sequences, performed by the BLAST search method [27], preferred because of the sparse nature of the provided similarity values. TRIBE-MCL is an iterative algorithm, performing in each loop two main operations on the similarity matrix: inflation and expansion. Further operations like column or row normalization and matrix symmetrization are included to serve the stability and robustness of the algorithm, and to enforce the probabilistic

S.M. Szilágyi, L. Szilágyi / Computers in Biology and Medicine 48 (2014) 94–101

constraint. Inflation raises each element of the similarity matrix to power r, which is a previously established fixed inflation rate. Due to the constraint r 41, inflation favors higher similarity values in the detriment of lower ones. Expansion, performed by raising matrix S to the second power, is aimed to favor longer walks along the graph. Matrix symmetrization has a double role: it maintains the largest value within a column (or row) on the diagonal of the matrix, and eliminates the non-zero similarity values that fall below a previously defined threshold value ε. Clusters are obtained as connected subgraphs in the graph. Convergence reportedly occurs within a few dozens of iterations. TRIBE-MCL is a divisive hierarchical clustering algorithm: once separated, it never connects isolated subgraphs to each other again, it produces the clusters by dividing existing ones into several smaller groups. This study has the main goal to introduce an efficient implementation of the TRIBE-MCL algorithm using a so-called sparse supermatrix (SSM) data structure, and to investigate the behavior of TRIBE-MCL via thorough tests using the large protein set of the SCOP95 database [28]. The remainder of this paper is structured as follows: Section 2 takes into account the functional details of the TRIBE-MCL algorithm and presents our motivations to introduce modifications. Section 3 presents the details of the proposed efficient TRIBE-MCL algorithm. Section 4 thoroughly evaluates the behavior of the proposed method. Section 5 discusses the achieved results and outlines the role of each parameter, while Section 6 concludes this study.

2. Related works TRIBE-MCL in its conventional, easily implementable form has a theoretical complexity of Oðn3 Þ, and needs up to 6 h to perform a single loop on a data set containing 104 proteins. Although this inefficient solution equally works with any kind of pairwise similarity data, it is prohibitively slow. Accelerating the performance starts with the choice of similarity criterion. For example, BLAST gives us a virtually symmetrical similarity matrix with large amount of zero values. These properties of the matrix are the primary source of acceleration. Since the zero values do not influence any of the TRIBE-MCL operations, it is possible to reduce the computational load by avoiding (skipping) the additions and multiplications with null arguments.

95

2.2. Sparse matrices Besides not having to compute additions and multiplications with null arguments, the large amount of zero values in the similarity matrix need not be stored during the TRIBE-MCL iterations. Sparse matrices are suitable data structures for such efficient implementations: while computing the normalization of a column or row, zero elements are not added to the sum, thus reducing the number of additions. In fact, a zero element can only change to non-zero during the expansion. But also in case of the expansion, zero elements in the input do not affect the outcome of any element of the output matrix. The rest of this section refers to the version of sparse matrix where non-zero elements of columns are stocked in a chained list, ordered by row coordinate. Thus the sparse matrix has an array of list head pointers, each one pointing to the first non-zero element of the corresponding column. Each non-zero element is represented by the structure ðrow; value; nextÞ. The latter variable in the structure is a pointer to the next non-zero element in the column. Inflation requires a single parsing of each column and thus the power computation is only performed for non-zero elements. The normalization needs to parse each column twice: first it computes the sum of each column and then it divides all non-zero elements by the sum of the column. Assuring matrix symmetry is more complicated, because it requires searching for the transposed of each non-zero element. Expansion requires a new sparse matrix for the output. During the computation of the expanded matrix, the elements of each column are determined in such an order, that new non-zero elements are always placed at the end of the list. That is why, it is worth to have a pointer to the tail of the column list as well. Further on, as expansion is computed right after having made the similarity matrix approximately symmetrical, we may find the element sij as n

n

k¼1

k¼1

¼ ∑ sik skj ∑ sik sjk ; sðnewÞ ij

ð1Þ

which is easier to compute as columns are way easier to parse than rows in this data structure. Using such an implementation can reduce the total runtime 100–200 times, but clustering 104 proteins still requires a couple of hours [31].

2.1. Matrix splitting

3. Materials and methods

It may occur at any progress level of the algorithm, that a certain column and the corresponding row of S remains with a single non-zero value, which is situated on the diagonal of the matrix. At this point, the node (and protein) corresponding to that certain row can be declared an item isolated in a separate cluster, having no more influence upon the other nodes in the further iterations. Such rows (and columns) can be removed from S, thus reducing the size of the matrix and the computational load of late iterations. This way TRIBE-MCL may be accelerated up to 2 times, depending on input data and parameter values [29]. This idea has recently evolved to the matrix splitting solution, which started from the idea that isolated subgraphs do not get into further interactions with other nodes. This way, when the graph gets torn into isolated subgraphs, which usually occurs after 5–8 iterations, the initial big matrix can be reorganized (via row or column reordering) into several small diagonally situated blocks, and the further computations may neglect the computations with similarities situated outside the blocks, as they are all zero. This way late iterations of TRIBE-MCL can speed-up 104 times, and the total runtime becomes 20–50 times shorter [30].

In this paper we introduce an efficient implementation of the TRIBE-MCL algorithm, with the aim of significantly reducing its computational load, without harming the outcome of classification. A special set of data structures will be introduced to facilitate the quick execution of the main operations, and the features of the similarity matrix will be fully exploited to achieve an extremely quick performance. The proposed method will be tested on the proteins of the SCOP95 database. 3.1. The SCOP95 database The SCOP (Structural Classification of Proteins) database [28] contains protein sequences in order of tens of thousands, hierarchically classified into classes, folds, superfamilies and families [32,33]. The SCOP95 database involved in this study, is a subset of SCOP (version 1.69), which contains n ¼11,944 proteins, exhibiting a maximum similarity of 95% among each other. Pairwise similarity and distance matrices (BLAST [27], Smith–Waterman [34], Needleman–Wunsch [35], PRIDE [36], etc.) are available at the Protein Classification Benchmark Collection [37,38]. In this study

96


Fig. 1. Schematic representation of the utilized sparse matrix model: (a) one non-zero element of the matrix and (b) array containing concatenated rows of the matrix.

we employed the BLAST similarity measure, because that one suppresses low similarities, thus contributing to the desired computational load reduction. In the initial similarity matrix there are 506,475 non-zero values, indicating that an average node in the graph is connected to 41 other nodes. 3.2. Data structures Let us introduce the data structures employed by the proposed method. All TRIBE-MCL approaches need two instances of the similarity matrix. Inflation, normalization, and symmetrization can be performed in a single matrix, but the expansion needs separate matrix instance for the input and the output data. All approaches presented in Section 2 use two matrices of the same type. The proposed approach employs a sparse supermatrix and a normal two-dimensional array. To assure an optimal execution of the operations, the SSM is implemented in a novel structure as shown in Fig. 1. Each nonzero element of the sparse matrix (sij) is stored together with an approximated value of its symmetrically situated element in the matrix (t ij sji ). Rows are stored concatenated in an array of structures, each such structure describing a non-zero element in the matrix. The whole array and each row is aligned to a memory address divisible by 32 bytes via padding, which supports efficient memory addressing. The starting address of each row is stored in a separate array of row heads. This whole data structure assures easy parsing of the matrix for all operations. With the exception of expansion, none of the operations increases the zero elements in any row. Normalization and inflation keeps the amount of non-zero elements constant, while symmetrization neglects values under the chosen threshold ε. During the expansion, the sparse matrix is completely rewritten, as presented in Section 3.6. 3.3. Initialization The input data is fed to the algorithm in a file that contains the number of proteins, and a list of the non-zero similarity values together with their row and column coordinate, ordered by row and column. In the initialization step, these values are introduced into the proposed sparse matrix structure. Similarity values are fed to a normalization step before reaching the main loop.

Table 1 The proposed efficient implementation of the expansion operation. Set s ij ¼ 0 8 i; j ¼ 1…n For any k ¼ 1…n For each i A rowk For each j A rowk with j Z i s ij ’s ij þ t ki skj s ji ’s ij End For j End For i End For k

3.5. Normalization and symmetrization Normalization requires parsing the proposed sparse matrix structure twice. In a first step, the sum of the similarity values is computed in each row. Let us denote by si the sum of values in row i, computed as

si ¼ ∑ sij j A rowi

8 i ¼ 1…n;

ð3Þ

where j iterates along the non-zero values stored in row i. In the second step, each element in the sparse matrix is divided by the corresponding sum: sðnewÞ ¼ sij =si ij t ðnewÞ ¼ t ij =sj ij

8 i ¼ 1…n; 8 j A rowi :

ð4Þ

New similarity values overwrite the old ones in the same instance of sparse matrix. The approximate symmetry of the similarity matrix S is performed by an iterative process in the main loop, situated between inflation and expansion. In each main loop, the symmetrization step is performed q times, and each symmetrization step is followed by a normalization. One symmetrization step requires a single parsing of the sparse matrix, and computes the following: ( pffiffiffiffiffiffiffiffiffi sij t ij if sij t ij Z ε2 sðnewÞ ¼ 8 i ¼ 1…n; ij 0 otherwise ð5Þ 8 j A rowi : ðnewÞ ðnewÞ ¼ sij t ij

3.4. Inflation

Parameter q determines how accurate is the approximation of matrix symmetry, while ε is responsible for neglecting unimportant low values of similarity.

To perform the inflation operation, the sparse matrix is parsed row by row and for each element present in the rows, the sij similarity value and its approximated transposed correspondent are both raised to the r-th power.

3.6. Expansion

¼ srij sðnewÞ ij t ðnewÞ ¼ t rij ij

8 i ¼ 1…n;

8 j A rowi ;

ð2Þ

where r is the inflation rate. New values are stored in the same sparse matrix, overwriting the old values.

Expansion is the only operation in the whole algorithm, which requires a second instance of similarity matrix to store the output of the operation. In our approach, this second matrix is a normal two-dimensional array initialized with zero values before starting the expansion. Let us denote the elements of this output matrix by s ij , i; j ¼ 1…n.


To achieve an efficient matrix multiplication, we turn to the following approximation: 8 i; j ¼ 1…n n

n

k¼1

k¼1

s ij ¼ ∑ sik skj ∑ t ki skj :

ð6Þ

The above formula inspires the expansion algorithm summarized in Table 1. The efficiency in this approach is assured by the fact, that only one of the nested three loops of matrix multiplication has to iterate over the n proteins, while the other two usually iterate in order of dozens of cycles. Finally, the values obtained in the variables s ij 8 i; j ¼ 1…n are transcribed into a new SSM structure. At this point, right after expansion, the similarity matrix is symmetrical. 3.7. Algorithm Let us summarize the proposed SSM-based approach of the TRIBE-MCL algorithm: 1. Initialize the parameters of the algorithm with the desired values: inflation rate r,

2. 3. 4. 5. 6. 7. 8.

similarity threshold ε, symmetrization steps in each loop q. Load initial similarity and store it in a SSM. Normalize the values in SSM as described in Section 3.5. Perform inflation as described in Section 3.4 Normalize and symmetrize the SSM in q steps as indicated in Section 3.5, beginning and ending with normalization. Perform the efficient squaring of the SSM as presented in Section 3.6 Go back to step 4 unless convergence is achieved. Clusters are obtained as isolated subgraphs in the final similarity graph.

At the convergence point, all isolated subgraphs are complete with approximately equal similarity values within the group.

4. Results 4.1. Efficiency All efficiency tests were carried out on a PC with quad core Intel i7 processor running at 3.4 GHz frequency and 16GB RAM memory, using a single core of the microprocessor. We have employed the proposed algorithm to classify the whole set of 11,944 proteins in the SCOP95 database. The

97

hierarchical structure of the SCOP database was only used to select input data and verify the final partition quality. Partitioning only used the pairwise similarity data. In the following we will concentrate on the evaluation of the processing speed of the proposed accelerated approach, which has a similar behavior to the one of the sparse matrix version [31], but benchmark numbers are significantly improved. Fig. 2(a) presents the duration of each of the first fifty iterations in case of inflation rate fixed at r ¼1.5 and some selected values of the similarity threshold ε. Fig. 2(b) plots the duration of each iteration for threshold value set at ε ¼ 10 3 , and selected values of the inflation rate. As long as most nodes of the graph are connected together, namely in the first 5–10 iterations, the computational load is higher, but it drastically falls thereafter, being virtually constant and low from the 15th loop. Fig. 3 exhibits the details of the computational load of the first 10 iterations in various circumstances. In this order, Fig. 3 (a) shows the comparison of different inflation rates at fixed value of the similarity threshold ε ¼ 10 3 . Further on, in Fig. 3 (b)–(d) the inflation rate is fixed at r ¼1.3, r ¼1.5, and r ¼ 1.7, respectively, and the threshold value varies from 10 4 to 5 10 3 . Lower inflation rates imply heavier computational load in the first iterations, and the number of such slowly preformed loops is also larger. Small non-zero values are kept in the matrix for a greater number of iterations, if the similarity threshold is low. The computational load in steady state is hardly influenced by any of the parameters. Overall runtime represents the most important indicator of efficiency. Fig. 4(a) and (b) exhibits the total runtime of 50 iterations, in various circumstances. In Fig. 4(a), the total runtime is visualized against inflation rate, for various values of the similarity threshold. In Fig. 4(b), the total runtime is plotted against threshold ε, while inflation rate is set at various constant levels. Generally the total duration of the clustering process is higher for low similarity thresholds, because neglectable non-zero similarities get eliminated later. Similarly, a lower inflation rate needs more iterations than a higher one to tear the graph into isolated subgraphs. However, if the similarity threshold grows above a critical level (in case of SCOP95 this happens around ε ¼ 0:005), TRIBE-MCL gets a strange behavior as it eliminates a lot of non-zero similarities at once and may find itself with rows and columns full of zeros. This latter case should be avoided, as the partition is damaged. Besides parameters r and ε and the number of graph nodes n, there is one more thing that highly influences the overall runtime of the TRIBE-MCL algorithm: the density of the input matrix. In order to prove this, we have divided the SCOP95 data set into two subsets of almost equal number of nodes, one of them containing n1 ¼ 5763

Fig. 2. Duration of first 50 iterations: (a) for various threshold values at fixed inflation rate r ¼ 1.5 and (b) for various inflation rates at fixed threshold value ε ¼ 10 3 .

98


Fig. 3. Duration of the first ten iterations: (a) plotted against inflation rate, at ε ¼ 10 3 ; plotted against threshold ε, at inflation rate (b) r ¼1.3; (c) r ¼ 1.5; (d) r ¼ 1.7.

Fig. 4. Total duration of 50 iterations: (a) plotted against inflation rate, at various values of threshold ε ¼ 10 3 and (b) plotted against threshold ε, at various fixed inflation rates r.

proteins, all belonging to families of cardinality Z10, while the other containing the rest of n2 ¼ 6181 proteins. Although the number of nodes is higher in the second subgraph, the number of edges is 385,416 vs. 55,900. We have recorded the overall runtime of the clustering process of both subsets of proteins. Results are reported in Fig. 5 and Table 2. The initially more dense matrix needs up to 2 times more processing time, the ratio slightly depending on the inflation rate. For the more dense matrix, the most time consuming iteration is the second one, while for the other the third loop is the longest. Differences in the processing time reduce in late iterations, but they persist in the steady state loops as well. Having reduced the total runtime to less than 1 min, the proposed approach can be called quicker by two orders of magnitude, compared to previous efficient approaches [30,31].

4.2. Partition quality The main goal of protein clustering is the reveal hidden similarities among proteins. When evaluating the output partition quality, one can count the number of mixed clusters (those which contain proteins from two or more different families) and their cardinality. We have shown in the previous work [29] that the inflation rate is the main factor to influence the amount of mixed clusters. The approach proposed here computes approximately the same partitions as the conventional TRIBE-MCL, in a seriously more efficient way. Having drastically reduced the execution time enabled us to perform a very detailed evaluation of the amount and nature of the mixed clusters in the output partition using the whole set of 11,944 proteins of SCOP95.


99

Fig. 5. Duration of the first 50 iterations for two protein subsets, in case of fixed inflation rates and constant similarity threshold ε ¼ 10 3 . The subsets of large and small families consisted of n1 ¼ 5763 and n2 ¼ 6181 proteins, respectively.

Fig. 7. Relative duration of one SSM iteration with respect to an iteration of TribeMCL, plotted against matrix density observed at the beginning the loop.

Table 2 Benchmark times in milliseconds for the two protein subsets, for various fixed

hidden similarities among proteins of various known or unknown origins. Depending on the level and amount of mixtures we are looking for, the ideal inflation rate for the SCOP95 data set should be chosen in the range 1.4–1.7.

inflation rates and constant similarity threshold ε ¼ 10 3 . Data set

Inflation rate r Total runtime Longest loop Steady state loop

Small families 1.3 1.5 1.7

4629 4256 4244

210 174 163

82 81 80

Large families 1.3 1.5 1.7

8863 6994 6002

1380 1084 723

118 111 102

Fig. 6. The amount of proteins in mixed clusters, plotted against inflation rate, at constant threshold value ε ¼ 10 3 . Mixtures at the level of classes, folds, superfamilies and families are present up to r¼ 1.44, r ¼1.52, r ¼1.72, and r¼ 2.35, respectively.

Proteins in SCOP95 are organized in four-level hierarchy: classes, folds, superfamilies and families. There are 7 classes in the SCOP95 version involved in this study, each of them containing several folds, the majority of which contain several superfamilies that are structured into families of proteins. Clusters established by TRIBE-MCL are called pure if all members belong to the same family. Alternately, there can be mixed clusters, which contain proteins of various families. Further on, mixed clusters may be mixtures of any of the four levels: e.g. if a mixed cluster contains proteins of different folds of a single class, then the mixture is called of the level of folds. The partition quality of a clustering algorithm may be characterized by the number of mixtures it produces, and their distribution among the four levels. We have performed a detailed evaluation of the mixed clusters produced by the proposed method. Results are exhibited in Fig. 6. The amount of mixed proteins highly depends on the inflation rate r, and in a very modest way on the similarity threshold ε. As an effort to extract biologically useful information, it is recommendable to choose such an inflation rate, which produces mixtures and reveals

5. Discussion The proposed algorithm has three main parameters: inflation rate r, similarity threshold ε, and the number of repetition steps q performed during matrix symmetrization. Inflation and expansion are two operations working against each other: while inflation favors high values in the similarity matrix helping the elimination of lower ones, expansion tends to introduce new non-zero values. The only parameter to control the trade-off between these two operations is the inflation rate r. The main property of the final partition controlled by inflation rate is the granularity. The similarity threshold ε has a strong effect upon the processing time and a weak one on the partition granularity. Tiny values of this threshold enable the expansion to fill the similarity matrix with lots of non-zero values, leading to very long processing time. On the other hand, huge values can destroy the partition in a single iteration, eliminating too many relevant non-zero values at once. As we have shown in this study, there is a wide interval of threshold values which make the TRIBE-MCL suitable for efficient data processing. We recommend choosing ε ¼ m 1 , where m is the largest amount of input data we wish to allow to stay together in a single cluster. As in SCOP95, the largest protein family has 559 members, there is absolutely no need to make ε less than 10 3. With such approximations it is possible to keep the total runtime at a very reasonable level. Earlier MCL solutions (e.g. [39]) proposed a so-called pruning process to limit the number of non-zero values within each column. In our interpretation that process needed to sort the values of those columns, which contained more non-zeros after expansion than the predefined limit, in order to find the exact threshold. However, such operations may damage the efficiency of the clustering, without any guaranteed contribution to the partition quality. Matrix symmetrization serves the stability of TRIBE-MCL. Without this step there is no guarantee that the largest similarity values in columns or rows always remain on the matrix diagonal, and after several loops some diagonal values might be eliminated, which one can interpret as a certain protein is not in the same cluster with itself. We have found the algorithm with q ¼2 stable enough to avoid such cases. Even with this setting the symmetry is not guaranteed as it is possible to have sij =sji ratios around 1.3 in early iterations, but the final partition is stable, it would not be different if we used more symmetrization steps.

100


Fig. 8. Relative total runtime of the proposed SSM approach with respect to the duration of conventional Tribe-MCL, in case of 100% initial matrix density. Input similarity values were given by the Smith–Waterman algorithm [34].

All versions of the presented Markov clustering algorithm require the storage of two instances of the similarity matrix. For example, the naive and the matrix splitting approaches use two bi-dimensional arrays, while the sparse matrix representation approach can handle both matrices with memory saving sparse matrices. The proposed SSM approach is situated between these extreme cases, using one instance of normal and another instance of sparse matrix, the latter with a certain amount of padding. Thus the memory used by the proposed approach represents 60–70% of the storage used by the matrix splitting version, and about 5–6 times more than the storage required by the fully sparse version. Any personal computer (PC) having 2GB free memory is suitable to perform the proposed efficient algorithm on the whole SCOP95 database. The upper limit of processable data with currently available PCs is approximately 50k proteins. Fig. 9 compares the necessary storage space of SSM approach and conventional Tribe-MCL, in case of using dense input matrices. As long as the number of proteins in the data set does not exceed 216 ¼ 65; 536, the column index for each element of the sparse matrix can fit in a 2-byte integer; otherwise 4-byte integers are required for optimal memory addressing. Below 65,536 input proteins, the storage space required by SSM is less than the memory used by the conventional approach, as long as the matrix density never exceeds 40%. Above 65,536 input proteins this limit becomes 33%.

6. Conclusion

Fig. 9. The storage space required by the proposed SSM approach compared to the storage space of the conventional Tribe-MCL (horizontal line at 100%), plotted against the maximum density of the similarity matrix achieved during the whole process. These graphs were drawn based on theoretical considerations.

In order to establish the limitations of the proposed algorithm, a series of tests were carried out using a dense similarity matrix. Similarity values were computed using the Smith–Waterman sequence alignment method [34], which gave us an input matrix of 100% density. We have employed the proposed SSM method at different values of the inflation rate and threshold fixed at ε ¼ 10 4 (to make the elimination of low values difficult). Fig. 7 relates on the relative duration of individual iterations compared to the duration of a single loop of the conventional Tribe-MCL approach (the latter being represented by the horizontal dotted line at height 100%), plotted against the density of the similarity matrix observed at the starting point of the loop. The duration of individual loops of the SSM approach apparently grows somewhat quicker than linearly up to 50% matrix density, above which the characteristics become linear. SSM needs less time than the conventional Tribe-MCL as long as matrix density stays below 80%. This surprising score is certainly caused by the more efficient usage of cache memories in case of the SSM approach, as this algorithm does not need to reach distant data during matrix squaring (because at every moment it only needs to access a single row of the sparse matrix) or symmetrization. Fig. 8 indicates the total runtime of the SSM approach for various values of the inflation rate, compared to the total duration of the conventional Tribe-MCL, each performed on a 100%-density similarity matrix. The proposed method performs quicker in every tested case. In order to achieve the right amount of clusters from the Smith–Waterman matrix, we needed to set r A ½1:8; 2:0. The latter tests ran approximately 10 times quicker with SSM than with conventional Tribe-MCL.

In this paper we have proposed an efficient approach to the graph-based TRIBE-MCL clustering method, a useful tool in protein sequence classification. The proposed approach proved extremely efficient, especially in case of using sparse input data. With this novel formulation, the overall processing time of the whole SCOP95 database reduces to approximately 30 s, which represents a 100-times speed-up ratio compared to previous efficient formulations. This speed-up is achieved without relevant damage of the partition quality. The achieved efficiency enabled us to perform a detailed evaluation of the nature of the output partition and establish rules for parameter setting. In case of the SCOP95 database, the most useful biological information is extracted in case of inflation rates in range 1.4–1.7, while the most convenient similarity threshold should be chosen between 0.002 and 0.003. Further enhancement of the algorithm's efficiency may be achieved via parallel implementation in CPUs or GPUs.

Conflict of interest statement The authors declare no conflicts of interest.

Acknowledgments The work of S.M. Szilágyi was funded by the project “Transnational Network for Integrated Management of Postdoctoral Research in Communicating Sciences. Institutional building (postdoctoral school) and fellowships program (CommScie)” – POSDRU/ 89/1.5/S/63663, financed under the Sectoral Operational Program Human Resources Development 2007–2013. The work of L. Szilágyi was funded by the Hungarian National Research Funds (OTKA), Project no. PD103921, and the Hungarian Academy of Sciences under the János Bolyai Fellowship Program.


References [1] W.M. Fitch, Aspects of molecular evolution, Ann. Rev. Genet. 7 (1973) 343–380. [2] M.O. Dayhoff, The origin and evolution of protein superfamilies, Fed. Proc. 35 (1976) 2132–2138. [3] H. Hegyi, M. Gerstein, The relationship between protein structure and function: a comprehensive survey with application to the yeast genome, J. Mol. Biol. 288 (1999) 147–164. [4] A.J. Enright, S. van Dongen, C.A. Ouzounis, An efficient algorithm for largescale detection of protein families, Nucl. Acids Res. 30 (2002) 1575–1584. [5] A. Heger, L. Holm, Towards a covering set of protein family profiles, Prog. Biophys. Mol. Biol. 73 (2000) 321–337. [6] S. Tsoka, C.A. Ouzounis, Recent developments and future directions in computational genomics, FEBS Lett. 480 (2000) 42–48. [7] K. Blekas, D.I. Fotiadis, A. Likas, Motif-based protein sequence classification using neural networks, J. Comput. Biol. 12 (1) (2005) 64–82. [8] N.M. Zaki, S. Deris, R. Illias, Application of string kernels in protein sequence optimization, Appl. Bioinform. 4 (1) (2005) 45–52. [9] R. Saidi, M. Maddouri, E.M. Nguifo, Protein sequences classification means of feature extraction with substitution matrices, Int. J. Database Manag. Syst. 11 (175) (2010) 1–13. [10] S. Saha, R. Chaki, Application of data mining in protein sequence classification, Int. J. Database Management Systems 4 (5) (2012) 103–118. [11] S. Blasiak, H. Rangwala, K.B. Laskey, A family of feed-forward models for protein sequence classification, in: Proceedings of the European Conference on Machine Learning and Principles of Knowledge Discovery and Data Mining, Bristol, UK, 2012, pp. 419–434. [12] N.M. Zaki, S. Deris, R. Illias, Protein sequences classification based on string weighting scheme, Int. J. Comput. Internet Manag. 13 (1) (2005) 50–60. [13] Y. Yang, B.L. Lu, W.Y. Yang, Classification of protein sequences based on word segmentation methods, in: Proceedings of the 6th Asia Pacific Bioinformatics Conference, Kyoto, Japan, 2008, pp. 177–186. [14] C. Caragea, A. Silvescu, P. Mitra, Protein sequences classification using feature hashing, Prot. Sci. 10 (Suppl. 1—S14) (2012) 1–8. [15] R. Busa-Fekete, A. Kocsor, S. Pongor, Tree-based algorithms for protein classification, in: A. Kelemen, A. Abraham, Y. Chen (Eds.), Computational Intelligence in Bioinformatics, Studies in Computational Intelligence, vol. 94, 2008, pp. 165–182. [16] R.F. Doolittle, The multiplicity of domains in proteins, Annu. Rev. Biochem. 64 (1995) 287–314. [17] T. Nepusz, R. Sasidharan, A. Paccanaro, SCPS: a fast implementation of a spectral method for detecting protein families on a genome-wide scale, BMC Bioinform. 11 (120) (2010) 1–13. [18] V. Miele, S. Penel, L. Duret, Ultra-fast sequence clustering from similarity networks with SiLiX, BMC Bioinform. 12 (116) (2011) 1–9. [19] W.Z. Li, L.M. Fu, B.F. Niu, S.T. Wu, J. Wooley, Ultrafast clustering algorithms for metagenomic sequence analysis, Brief. Bioinform. 13 (2012) 656–668. [20] G. Santini, H. Soldano, J. Pothier, Automatic classification of protein structures relying on similarities between alignments, BMC Bioinform. 13 (233) (2012) 1–13.

101

[21] C.J. Wu, A. Kalyanaraman, W.R. Cannon, pGraph: efficient parallel construction of large-scale protein sequence homology graphs, IEEE Trans. Parallel Distrib. Syst. 23 (2012) 1923–1933. [22] M. Hauser, C.E. Mayer, J. Söding, kClust: fast and sensitive clustering of large protein sequence databases, BMC Bioinform. 14 (248) (2013) 1–12. [23] W. Verwoerd, A new computational method to split large biochemical networks into coherent subnets, BMC Syst. Biol. 5 (25) (2011) 1–22. [24] P. Jiang, M. Singh, SPICi: a fast clustering algorithm for large biological networks, Bioinformatics 26 (2010) 1105–1111. [25] T. Theodosiou, N. Darzentas, L. Angelis, C.A. Ouzounis, PuReD-MCL: a graphbased PubMed document clustering methodology, Bioinformatics 24 (2008) 1935–1941. [26] S.R. Eddy, Profile hidden Markov models, Bioinformatics 14 (1998) 755–763. [27] S.F. Altschul, T.L. Madden, A.A. Schaffen, J. Zhang, Z. Zhang, W. Miller, D.J. Lipman, Gapped BLAST and PSI-BLAST: a new generation of protein database search program, Nucl. Acids Res. 25 (1997) 3389–3402. [28] Structural Classification of Proteins Database, Available at: 〈http://scop. mrc-lmb.cam.ac.uk/scop〉. [29] L. Szilágyi, L. Medvés, S.M. Szilágyi, A modified Markov clustering approach to unsupervised classification of protein sequences, Neurocomputing 73 (13–15) (2010) 2332–2345. [30] L. Szilágyi, S.M. Szilágyi, Efficient Markov clustering algorithm for protein sequence grouping, in: Proceedings of 35th Annual International Conference of IEEE EMBS, Osaka, Japan, 2013, pp. 639–642. [31] L. Szilágyi, S.M. Szilágyi, Fast implementations of Markov clustering for protein sequence grouping, in: International Conference on Modeling Decisions in Artificial Intelligence, Barcelona, Spain, 2013, Lecture Notes in Artificial Intelligence, vol. 8234, pp. 214–225. [32] L. Lo Conte, B. Ailey, T.J. Hubbard, S.E. Brenner, A.G. Murzin, C. Chothia, SCOP: a structural classification of protein database, Nucl. Acids Res. 28 (2000) 257–259. [33] A. Andreeva, D. Howorth, J.M. Chadonia, S.E. Brenner, T.J.P. Hubbard, C. Chothia, A.G. Murzin, Data growth and its impact on the SCOP database: new developments, Nucl. Acids Res. 36 (2008) D419–D425. [34] T.F. Smith, M.S. Waterman, Identification of common molecular subsequences, J. Mol. Biol. 147 (1981) 195–197. [35] S.B. Needleman, C.D. Wunsch, A general method applicable to the search for similarities in the amino acid sequence of two proteins, J. Mol. Biol. 48 (1970) 443–453. [36] Z. Gáspári, K. Vlahovicek, S. Pongor, Efficient recognition of folds in protein 3D structures by the improved PRIDE algorithm, Bioinformatics 21 (2005) 3322–3323. [37] Protein Classification Benchmark Collection, Available at: 〈http://net.icgeb.org/ benchmark〉. [38] P. Sonego, M. Pacurar, S. Dhir, A. Kertész-Farkas, A. Kocsor, Z. Gáspári, J.A.M. Leunissen, S. Pongor, A protein classification benchmark collection for machine learning, Nucl. Acids Res. 35 (2007) D232–D236. [39] S. van Dongen, Graph clustering by flow simulation (Ph.D. thesis), Univ. Utrecht, 2000.

Fast Nonparametric Density-Based Clustering of Large Data Sets Using a Stochastic Approximation Mean-Shift Algorithm.

Fast clustering algorithm for large ECG data sets based on CS theory in combination with PCA and K-NN methods.

A clustering algorithm for multivariate longitudinal data.

A fast clustering algorithm for data with a few labeled instances.

Hierarchical link clustering algorithm in networks.

A Fast Cluster Motif Finding Algorithm for ChIP-Seq Data Sets.

FctClus: A Fast Clustering Algorithm for Heterogeneous Information Networks.

Hierarchical trie packet classification algorithm based on expectation-maximization clustering.

HPC-CLUST: distributed hierarchical clustering for large sets of nucleotide sequences.

A geometric clustering algorithm with applications to structural data.

SPARSE INTEGRATIVE CLUSTERING OF MULTIPLE OMICS DATA SETS.

KSC-N: Clustering of Hierarchical Time Profile Data.

Fast algorithm for relaxation processes in big-data systems.

A hierarchical model for clustering m(6)A methylation peaks in MeRIP-seq data.

Hierarchical ordering with partial pairwise hierarchical relationships on the macaque brain data sets.

Fast and Accurate Protein False Discovery Rates on Large-Scale Proteomics Data Sets with Percolator 3.0.

An improved fuzzy c-means clustering algorithm based on shadowed sets and PSO.

ClusTrack: feature extraction and similarity measures for clustering of genome-wide data sets.

Selection of representative protein data sets.

A novel artificial bee colony based clustering algorithm for categorical data.

A hybrid algorithm for clustering of time series data based on affinity search technique.

pyGCluster, a novel hierarchical clustering approach.

A fast T1 algorithm.

PMS6: a fast algorithm for motif discovery.