IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 7, JULY 2014

3025

Scalable Similarity Search With Topology Preserving Hashing Lei Zhang, Yongdong Zhang, Senior Member, IEEE, Xiaoguang Gu, Member, IEEE, Jinhui Tang, Member, IEEE, and Qi Tian, Senior Member, IEEE

Abstract— Hashing-based similarity search techniques is becoming increasingly popular in large data sets. To capture meaningful neighbors, the topology of a data set, which represents the neighborhood relationships between its subregions and the relative proximities between the neighbors of each subregion, e.g., the relative neighborhood ranking of each subregion, should be exploited. However, most existing hashing methods are developed to preserve neighborhood relationships while ignoring the relative neighborhood proximities. Moreover, most hashing methods lack in providing a good result ranking, since there are often lots of results sharing the same Hamming distance to a query. In this paper, we propose a novel hashing method to solve these two issues jointly. The proposed method is referred to as topology preserving hashing (TPH). TPH is distinct from prior works by also preserving the neighborhood ranking. Based on this framework, we present three different TPH methods, including linear unsupervised TPH, semisupervised TPH, and kernelized TPH. Particularly, our unsupervised TPH is capable of mining semantic relationship between unlabeled data without supervised information. Extensive experiments on four large data sets demonstrate the superior performances of the proposed methods over several state-of-the-art unsupervised and semisupervised hashing techniques. Index Terms— Similarity search, approximate nearest neighbor search, binary hashing, topology preserving hashing.

I. I NTRODUCTION

W

ITH the explosive proliferation of various kinds of data, especially the ever-increasing availability of

Manuscript received October 7, 2013; revised February 24, 2014 and May 6, 2014; accepted May 7, 2014. Date of publication May 21, 2014; date of current version June 9, 2014. This work was supported in part by the National High Technology Research and Development Program of China under Grant 2014AA015202, in part by the National Natural Science Foundation of China under Grant 61303151, Grant 61273247, and Grant 61271428, and in part by the National Key Technology Research and Development Program of China under Grant 2012BAH39B02. The work of Q. Tian was supported in part by the National Science Foundation of China under Grant 61128007, in part by the Army Research Office under Grant W911NF-12-1-0057, in part by the Faculty Research Awards through the NEC Laboratories America Inc., Princeton, NJ, USA, and in part by the 2012 UTSA START-R Research Award, National Nature Science Foundation of China. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Carlo S. Regazzoni. L. Zhang, Y. Zhang, and X. Gu are with the Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing 100190, China (e-mail: [email protected]; [email protected]; [email protected]). J. Tang is with the Nanjing University of Science and Technology, Nanjing 210094, China (e-mail: [email protected]). Q. Tian is with the University of Texas at San Antonio, Texas, TX 78249 USA (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TIP.2014.2326010

Fig. 1. The essence of our proposed TPH. (a) A data point p and its neighbors. (b) In neighborhood-preserving hashing (h 1 ), the neighborhood relationship is preserved, but the ranking of neighbors is lost. (c) In topologypreserving hashing (h 2 ), besides the neighborhood relationship, the ranking is also preserved. The hash functions are depicted as linear projections for the sake of illustration.

visual data in a variety of interesting domains, fast indexing and content-based search for large dataset have attracted a significant attention. The most basic but essential problem is the nearest neighbor search (NN). As the traditional linear NN search becomes computationally prohibitive when the dataset is large or the individual similarities are too expensive to compute, a number of efficient search techniques have been developed, such as KD-Tree [1] and Locality Sensitive Hashing (LSH) [2]. Recently, binary hashing [3]–[11] is becoming increasingly popular for efficient approximate nearest neighbor (ANN) search due to its good search and storage efficiency. Given a dataset, hashing methods map each data point to a binary code and perform bit-wise operations to search neighbors. The neighborhood structure should be preserved in order to capture meaningful neighbors. In many cases, the real-world dataset often lies on a low-dimensional manifold embedded in a high-dimensional space [5], thus the manifold structure should be exploited when hashing. To characterize a manifold, its topology, which represents the neighborhood relationships between subregions and the relative proximities between the neighbors of subregions, is essential [12]. A lot of hashing methods [4], [5], [8]–[11], [13] have been developed to preserve the neighborhood relationships (we call neighborhood-preserving) by mapping similar points to close codes. In these methods, the neighborhood relationships are incorporated with the learning process, while the relative proximities between the neighbors, e.g. the rankings, are not. As a result, these methods cannot preserve the data topology well. As illustrated in Fig. 1(b), for a data point p and its neighbors q1, q2 , q3, q4 , the Hamming embeddings {h 1 (qi )}4i=1 are still neighbors of h 1 ( p). However, as the neighborhood ranking is ignored, the ranking between neighbors of p is not preserved, e.g. the relative ranking between q1 and q2 is changed.

1057-7149 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

3026

To preserve the topology of a dataset, a straightforward way is preserving the original distances. Many hashing methods [2], [14]–[17] have been developed for this goal (we call distance-preserving). In these methods, the Hamming distance between binary codes are used as a reconstruction of the distance between data points in the original data (kernel) space. Since the Hamming distance is discrete and bounded by the code length, the original distance to Hamming distance mapping is many-to-one, which indicates the non-optimality of the reconstructions. Besides, to some extent, the exact measure of distance depends not only on the manifold itself, but also on a given embedding of the manifold. However, most distance functions make no distinction between the manifold and its surrounding empty space. Therefore, even the original distances may not be effective to reveal the data topology, not to mention the reconstructed ones. As a result, these distancepreserving methods cannot preserve data topology well either. Another limitation of most hashing methods is lacking in providing a good result ranking. As most hashing methods simply adopt Hamming distance as similarity metric in search process, there can be a lot of results sharing the same distance to a query, posing a critical issue for similarity search, e.g. kNN search, where ranking is important. Moreover, most hashing methods ignore ranking information in hash codes learning, making this limitation severer. Weighted Hamming distance [6], [18]–[20] have been developed to alleviate this limitation, however, they are all post-processing algorithms while the ranking information is already lost too much after embedding. Apparently, if the ranking information is encoded in the codes, this limitation can be better alleviated. In this paper, we propose a Topology Preserving Hashing (TPH) method to solve the above two issues jointly. As stated in [12], a manifold can be entirely characterized by giving the relative proximities between its subregions. Many local geometric properties can be used to reveal the relative proximities, e.g. local angles used in locally linear embedding (LLE) [21] and local distances. In this work, we use distances to reveal the relative proximities, which turn out to be the ranking of neighbors of each subregion. Therefore, the essence of our Topology Preserving Hashing is to not only preserve the neighborhood relationships, but also preserve the neighborhood rankings. As shown in Fig. 1(c): TPH ensures that h 2 (q1) and h 2 (q2 ) are neighbors of h 2 ( p), and h 2 (q1) is still a nearer neighbor of h 2 ( p) in Hamming space. Our method can be considered as a tight version of neighborhoodpreserving and a loose version of distance-preserving. The main contributions of this paper are briefly outlined as follows: 1. We propose a novel hashing method to learn the Hamming embeddings of a dataset, such that not only the neighborhood relationship, but also the neighborhood ranking of each data point, is preserved after embedding. 2. The semantic label information can be easily leveraged in the learning process of our method, extending it to a semi-supervised method to capture semantic neighbors. 3. Experimental results demonstrate the superiority of our method as compared with other state-of-the-art unsupervised and semi-supervised methods. Moreover,

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 7, JULY 2014

the limitation of most hashing methods, which is lacking in providing a good result ranking, is also well alleviated. This paper is an extension of our previous conference publication [22]. Beyond the conference paper, we include a more sophisticated formulation of the learning process of TPH, and two novel extensions: semi-supervised TPH (STPH) and kernelized TPH (KTPH). We also conduct more experiments to better validate the effectiveness of our method, and give more amplified discussions on the learning process. The rest of this paper is organized as follows. Some existing similarity search methods are discussed in Section II. In Section III, we present the mathematical formulation of our TPH, and propose an unsupervised linear hash function learning algorithm. Section IV discusses the learning process and Section V extends the unsupervised TPH by leveraging semantic information and kernel embedding. Section VI describes our experiments. Finally, Section VII concludes this paper. II. R ELATED W ORK High-dimensional similarity search is a fundamental problem in many content-based search systems and also widely exists in many related application areas, such as machine learning, computer vision, information retrieval and data mining. One classical kind of methods to address this problem is the tree-based index, such as KD-Tree [1]. These methods usually partition the data space recursively, and perform exact similarity search in the low-dimension feature space. However, these methods cannot work well for high-dimensional data since their performances degrade significantly to linear scan as the dimensionality increases [23]. Therefore, treebased indexes are not preferable in high-dimensional search problems. Another kind of ANN search algorithm is based on vector quantization, such as kmeans LSH [24] and Product Quantization (PQ) [25], [26]. The key of these methods is the compositionality. In PQ, by dividing each data into several subspaces and expressing data in terms of recurring parts, the representational capacity of PQ grows exponentially in the number of subspaces. In these methods, each data point is represented by a reconstructed cluster center and the search process is performed in the original data space. As a result, the search process is time-consuming even with an inverted file indexing [25]. Recently, hashing based method has been widely used for similarity search and related applications [27]–[30] as it allows constant-time search [31]. A lot of hashing methods have been proposed, and in general, these methods can be roughly divided into two main categories: data-independent methods and data-dependent methods. One representative data-independent methods is Locality Sensitive Hashing (LSH) [2], [17], [32]. In [2] and [32], the hash functions are just simple random projections, which are independent of dataset. Anti-Sparse Coding (ASC) [33] improves LSH based on the spread representations of a dataset, but it requires the bit number larger than the data dimensionality. Super-bit LSH (SBLSH) [17] orthogonalizes the random projections and guarantees that an unbiased estimate

ZHANG et al.: SCALABLE SIMILARITY SEARCH WITH TPH

of the angle similarity is theoretically supported. Theoretically, it is guaranteed that the original distances are asymptotically preserved in the Hamming space with increasing code length, hence LSH-related methods usually require long codes to achieve good precision. However, long codes result in low recall since the collision probability of similar points mapped to close codes decreases exponentially as the code length increases. As a result, LSH-related methods usually construct multi-tables to ensure a reasonable probability that a query will collide with its near neighbors in at least one table, which leads to long query time and increases the memory occupation. Another representative methods is Shift-Invariant Kernel Hashing (SIKH) [15]. It is a distribution-free method based on random features mapping [34] for shift-invariant kernels, and the expected Hamming distance is related to the distance in a kernel space. Similar to LSH, SIKH also needs relatively long code to ensure good performance [10]. These data-independent methods do not explore the dataset structure for hash function learning and usually need long hash codes or many hash tables to achieve a satisfactory performance, as a result, in practice, data-independent methods are often less effective than data-dependent methods. Recently, many data-dependent methods, which focus on learning hash functions from dataset, have been developed to learn more compact codes. In PCA-Hashing (PCAH) [35], the eigenvectors corresponding to the largest eigenvalues of the dataset covariance matrix are used to form the projection matrix for hashing. An extension of PCAH, Iterative Quantization (ITQ) [10] is proposed to learn an orthogonal rotation matrix to refine the initial PCA-projection matrix to minimize the quantization error of mapping the data from original data space to Hamming space. Isotropic Hashing (IsoH) in [36] learns the projection functions which can produce projected dimensions with isotropic variances. The spectral graph partitioning strategy is employed to develop new kinds of hashing schemes, such as Spectral hashing (SPH) [4], [37]. SPH uses the simple analytical eigenfunction solution of 1-D Laplacians as the hash function. Anchor Graph Hashing (AGH) [5] applies the similar formulation of SPH for hash codes generation, while its neighborhood graph is constructed in a novel way such that it can be applied to large scale dataset. To maximize the independence between different hash functions, Random Maximum Margin Hashing (RMMH) [7] constructs hash functions by using large margin classifiers with arbitrarily sampled data points that are randomly separated into two sets. In [16], Kernelized Locality Sensitive Hashing (KLSH) is proposed to address the limitation that the original LSH method cannot apply for high-dimensional kernelized data when the underlying kernel embedding is unknown. In KLSH, the Hamming distance is an approximation of the distance between data points in a kernel space. As these unsupervised hashing methods are not capable of capturing semantic neighbor, many supervised methods have been proposed [3], [8], [9], [14]. Semantic Hashing [3] adopts a deep belief networks based on restricted Boltzmann machine to learn hash functions that map similar points to close codes. In [14], Binary Reconstruction Embedding (BRE) is proposed to minimize the reconstruction error between the original

3027

feature distance and the Hamming distance. In BRE, the normalized Hamming distance is used as a reconstruction of the original distance. By replacing the original distances with semantic similarities, BRE can be extended to learn the hash functions in a supervised manner. In [9], Semi-Supervised Hashing (SSH) uses both labeled and unlabeled data for hash function learning. As an extension of SSH, Sequential Projection Learning Hashing (SPLH) [38] learns the hash function in a sequential manner such that a new function tends to minimize the errors made by the previous one. In [39], Semi-Supervised Discriminant Hashing (SSDH) is proposed to learn hash functions where the labeled data are used to maximize the separability between codes associated with different classes while unlabeled data are used for regularization. In Kernel-based Supervised Hashing (KSH) [8], by leveraging label information of a dataset and using the equivalence between optimizing the code inner products and the Hamming distances, KSH could map the dataset to compact binary codes whose Hamming distances are minimized on similar pairs and simultaneously maximized on dissimilar pairs. The paradigm of hashing generally consists of two steps: dimension reduction and 0/1 quantization. To capture meaningful neighbors, the neighborhood structure of a given dataset should be preserved. Therefore, the dimension reduction step is a key factor and significant research focus is devoted to this topic. Dimension reduction is a kind of manifold learning problem. The neighborhood relationships between subregions and the relative proximities between neighbors of each subregion, are both essential [12] for effective dimension reduction. Many hashing methods have been developed for neighborhood-preserving [5], [9], [10] or distance-preserving [14], [16], [17]. The former ones ignore the neighborhood rankings, and the latter ones turn out to support and bolt a manifold with rigid steel beam [12] while in many cases, the optimal embedding of a manifold needs some flexibility: some subregions should be locally stretched or shrunk to embed them into a low-dimensional space where the topology can be well preserved. Therefore, these hashing methods cannot preserve the dataset topology well. In manifold learning area, it is true that a manifold can be entirely characterized by giving the relative or comparative proximities [12], e.g. a first region is close to a second one but far from a third one. Comparative information between distances, like inequalities or rankings, suffices to characterize a manifold for any embeddings [12]. Moreover, for many similarity search problems, the relative rankings of results are more important than their actual similarities to a query. Based on the above introduction, we can conclude that more effective Hamming embeddings can be learned by incorporating the dataset topology with the learning process. Furthermore, if the local topology is well preserved in Hamming space, the ambiguity caused by ranking with Hamming distance can be better alleviated. III. T OPOLOGY P RESERVING H ASHING This section details the formulation our Topology Preserving Hashing (TPH) method. We first present the formulation

3028

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 7, JULY 2014

of TPH, and then we propose an unsupervised linear hash function learning algorithm for out-of-sample extension. A. Formulation of Topology Preserving We begin with the definition of local topology preserving [40]: For data points x1 , x2 , x3 , x4 and their embeddings { yi = φ(xi )}4i=1 . The embedding φ : x → y is called local topology preserving if the following condition holds: ˜ y1 , y2 ) ≤ d( ˜ y3 , y4 ) if d(x1, x2 ) ≤ d(x3 , x4 ), then d( where d and d˜ are the distance metrics in the original data space and the embedded space respectively. In this work, we relax this definition, such that only the local neighborhood topology is preserved. This relaxation gives us more flexibility to learn optimal Hamming embeddings. We call this neighborhood topology preserving: Definition 3.1: For a data point x and its neighbors x1 , x2 in a metric space (M, dM ) (dM is the distance metric in M) and their Hamming embeddings y, y1 , y2 ∈ {−1, 1}m learned by hash function H (x). H (x) is called neighborhood topology preserving if the following condition holds: if dM (x, x1) ≤ dM (x, x2), then d H ( y, y1 ) ≤ d H ( y, y2 ) where d H ( y1 , y2 ) is the Hamming distance between y1 , y2 . This definition says that for a data point x and its neighbors x1 , x2 , if x1 is a nearer neighbor of x, then H (x1) is also a nearer neighbor of H (x), which means the local topology of the neighborhood of x in the original data space is well preserved in Hamming space by H (x). Given a dataset consists of n data points X = {xi }ni=1 ⊂ Rd in a metric space (M, dM ), our goal is to learn a set of m-bit Hamming embeddings Y = { yi }ni=1 ⊂ {−1, 1}m of X in a neighborhood topology preserving manner, which requires neighborhood ranking preserving as well as neighborhood relationship preserving. In the following, we call the matrix whose each column is an element of a dataset as the matrix form of the dataset. The matrix form of X is denoted as X ∈ Rd×n and the matrix form of Y is denoted as Y ∈ {−1, 1}m×n . The Hamming code of each xi is denoted as (i) (i) (i) yi = (y1 , y2 , . . . , ym )T . 1) Neighborhood Ranking Preserving: Learning Hamming embeddings of a dataset based on the above definition only considers the data topology locally, this may cause severe overfitting [27]. As a result, the learning process has to exploit the global structure as well. To fulfill this, for each data point xi ∈ X , we simply add several data points which are not neighbors into its original neighbor set.1 We call this new “neighbor set” topo-neighbor set and denote it by Ni ⊂ X . The union of each xi and its topo-neighbors is called topo-training set and denoted as X = X ∪ni=1 Ni , its matrix form is denoted as X  (Since Ni ⊂ X , we have X = X ). We define the following objective function measuring the empirical accuracy of neighborhood ranking preserving:   1  sgn dM (xi , x j1 ) − dM (xi , x j2 ) R(Y ) = 2 x i x j ,x j ∈Ni  1 2  sgn d H ( yi , y j1 ) − d H ( yi , y j2 ) (1) 1 In this work, the neighbor set of a data is a kNN set.

For two neighbors xs , xt of xi , the summed term in Eq. (1) would be +1 if the relative ranking between ys and yt in Hamming space is consistent with that between xs and xt in the original space, or -1 otherwise. Hence, R(Y ) is a reasonable measure function. To make R(Y ) easy to optimize, we replace the signs of summed terms with their signed magnitudes. This relaxation is quite intuitive in the sense that for a data point x and its neighbors x1 , x2 , it not only desires dM (x, x1) − dM (x, x2) and d H (x, x1) − d H (x, x2) to have the same signs in order to preserve the local topology, but also the larger dM (x, x1) than dM (x, x2), the larger d H (x, x1) than d H (x, x2). Meanwhile, if dM (x, x1) − dM (x, x2) and d H (x, x1) − d H (x, x2) have different signs, it also desires |d H (x, x1) − d H (x, x2)| small in order to alleviate this local topology deviation. For brevity, we use di j to denote dM (xi , x j ) and h i j to denote d H ( yi , y j ). With this relaxation, R(Y ) (1) becomes: R(Y ) =

1 (di j1 − di j2 )(h i j1 − h i j2 ) 2

(2)

j1 , j2

i

For each xi , we define a neighborhood distance difference matrix Di ∈ Rni ×ni , n i = |Ni |, where Di ( j˜1 , j˜2 ) = di j1 − di j2 . Here the subscripts j˜1, j˜2 ∈ [1, n i ] are the indices of x j1 , x j2 in the topo-neighbor set Ni of xi . We can see that Di is an antisymmetric matrix (DiT = −Di ). Meanwhile, the Hamming distance h i j can be expressed as: hi j =

m−

m

(i) ( j ) k=1 yk yk

(3)

2

By substituting Eq. (3) into Eq. (2) and omitting the constant coefficient, we can rewrite R(Y ) (2) as:     (i) ( j ) (j ) R(Y ) = Di ( j˜1 , j˜2 ) y y 2 − y (i) y 1 k

i

=

j1 j2

 k



yk(i)

i

 k

i



k

k

k

k (j ) Di ( j˜1, j˜2 )yk 2

j2 j1  (i) yk j1 j2

(j ) Di ( j˜1 , j˜2 )yk 1

(4)

Since Di is an antisymmetric matrix, we have  ˜ ˜ ˜ ˜ ˜ ˜ j1 Di ( j1 , j2 ) = − j2 Di ( j1 , j2 ). Denote j1 Di ( j1 , j) as ci j , R(Y ) (4) can be written more compactly as:    (i)    (i) (j ) (j ) R(Y ) = yk ci j 2 yk 2 − yk (−ci j1 )yk 1 k

=2

i

j2

 k

i

k ( j) (i) yk ci j yk

i

j1

(5)

j

Objective function (5) can be expressed in a compact matrix form by defining a weight matrix Γr ∈ Rn×n incorporating the pairwise weights from X (= X ) as following: we first calculate ci j for each xi ⎧  ⎨ (dis − di j ) : x j ∈ Ni (6) ci j = x s ∈ N i ⎩0 : otherwise

ZHANG et al.: SCALABLE SIMILARITY SEARCH WITH TPH

3029

then we normalize each ci j to [−1, 1]:

In the following, we would call τi j the topo-weight and Γt the Topo-Weighting matrix. In Section IV, we will show some promising potentials of Γt , especially that it automatically “mines” the semantic relationship between unlabeled data.

ci j − mini −1 maxi − mini mini = min ci j ci j = 2 · j

maxi = max ci j

(7)

j

Γr is defined as Γr (i, j ) = ci j . The reason of this normalization will be stated in Section IV. After omitting the constant coefficient, the objective function (5) is represented as: R(Y ) = Tr{Y Γr Y T }

(8)

In the neighborhood ranking preserving objective function R(Y ) (5), (8), each ci j weights each summed term  (i) ( j ) k yk yk , we call it the rank-weight, as it indicates the relative ranking between the neighbors of data point xi , i.e. xi is closer to one neighbor than the other one (detailed analysis will be given in Section IV). The matrix Γr is called Rank-Weighting matrix in the following. 2) Neighborhood Relationship Preserving: Maximizing R(Y ) is to preserve the neighborhood ranking of each data point. To preserve the data topology, we also have to preserve the neighborhood relationships as done in [4] and [5]. Suppose wi j is the similarity between xi , x j in the original data space (wi j = exp{−di2j /σs2 } [4], where σs is the average pairwise distance). We call wi j a simi-weight and also normalize it to [−1, 1] by wi j = 2wi j − 1. After defining a SimiWeighting matrix Γs ∈ Rn×n , Γs (i, j ) = wi j , learning Y for neighborhood relationship preserving is formulated as [4]:  S(Y ) = arg min wi j h i j (9) Y = arg min Y ∈{−1,1}m×n

Y

ij

Again, by substituting Eq. (3) into Eq. (9) and omitting the constant coefficients and terms, S(Y ) is rewritten as: S(Y ) = −Tr{Y Γs Y T }

(10)

3) Overall Optimization Objective: In summary, the optimal Hamming embedding Y should not only maximize R(Y ) (8), but also minimize S(Y ) (10). Therefore, given a dataset X, we formulate the learning of topology preserving embedding Y as an optimization problem: Y = arg

max

Y ∈{−1,1}m×n Tr(Y Γt Y T )

O(Y )

O(Y ) = Γt = (1 − γ )Γr + γ Γs s.t. Y 1n×1 = 0, Y Y T = n Im

(11) (12)

where γ ∈ [0, 1] relatively weights the importance of neighborhood ranking preserving and neighborhood relationship preserving in the optimization procedure. The constraint Y 1n×1 = 0 is imposed to maximize the information of each bit, which occurs when the dataset are balanced partitioned on each bit. Another constraint Y Y T = n Im forces m bits to be mutually uncorrelated to minimize redundancy among bits. In O(Y ) (11), each element τi j of Γt is a linear combination of the normalized rank-weight ci j and simi-weight wi j : τi j = (1 − γ )ci j + γ wi j

(13)

B. Learning Linear Hashing Functions By solving problem (11), we would learn the optimal Hamming embedding Y that preserves the original neighborhood topology. However, this is only a transductive formulation, i.e. Y cannot be generalized to a new data directly. We can use the Nyström method [41] to deduce the codes for the data from the codes of training set and the similarity matrix between the data and training samples, but this operation is as expensive as exhaustive NN searching. Therefore, we try to solve this out-of-sample problem by introducing the affine embedding: y = H (x) = sgn(W T x + b), where W is a d × m projection matrix with each column the projection vector ωk , and b is a m-dimensional vector. Without loss  of generality, we could assume the data are zero-centered, i.e. ni=1 xi = 0. Therefore, b can be simply set to 0 and the entire encoding process can now be expressed as: Y = sgn(W T X)

(14)

The projection matrix W can be learned by introducing the consistency [42] between the codes learned by problem (11) and the codes generated by hash function (14), i.e. the outputs of hash function (14) should “make an ‘agreement’ on” the learned codes Y (11). However, incorporating the term of consistency with O(Y ) (11) would significantly increase the complexity of the optimization procedure, as it brings too many parameters that needed to be tuned. Alternately, we substitute Y in O(Y ) (11) with sgn(W T X) directly and make the problem about Y become a problem about W . The reason is that, W can now be learned efficiently, and more importantly, the learned W is still very effective as demonstrated in our experiments. After substituting Y = sgn(W T X) into O(Y ) (11), we have an optimization problem about W : W = arg max O(W ) W ∈Rd×m

(15)

where O(W ) = Tr sgn(W T X)Γt sgn(X T W ) . However, maximizing O(W ) is not easy to achieve because it is neither convex nor smooth. Motivated by the spectral methods for hashing [4], [5], [8], we adopt the Spectral Relaxation [8] scheme to approximately maximize O(W ). After applying the spectral relaxation trick to drop the sign functions involved in O(W ), (15) becomes a quadratic optimization problem: W = arg max O(W ) = arg max Tr{W T XΓt X T W } W

W

(16)

1) Information Theoretic Regularization: The optimal W can now be learned by solving problem (16). However, its learning process cannot scale well with the size of training set, as the weight matrices Γr , Γs and Γt are all n × n. Therefore, in practice, only a subset Xt ∈ X and its associated toponeighbor sets are used to form the topo-training set X to construct Γr , Γs and Γt . As a result, the objective O(W ) (16) measures only the empirical accuracy on Xt and is prone to

3030

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 7, JULY 2014

overfitting especially when the size of Xt is small compared to the entire dataset X . To get better generalization ability, we add a regularization by incorporating conditions that lead to desirable properties of hash codes, independent of the performance on the training set. From the information-theoretic point of view, it is desired that the information provided by each bit is maximized [43]. As proved in [9], maximizing the information entropy is equivalent to maximizing the variance of a bit. Therefore, the variances of hash bits are used as the regularization [9]. With the regularization term, the first constraint in (11), Y 1n×1 = 0, can be dropped. As for the second constraint, Y Y T = Im , we use the orthogonality of W (W T W = Im ) as a kind of relaxation [9]. Therefore, the overall learning process of our topology preserving hashing is formulated as the following optimization problem: W = arg max Tr W T X  Γt X T W + αTr W T X X T W W = arg max Tr W T AW

TABLE I T OPOLOGY P RESERVING H ASHING

W

s.t.

W T W = Im

(17)

where α is a positive scalar relatively weights the regularization Tr{W T X X T W } and A = X  Γt X T +α X X T . Problem (17) can be easily solved, and since A is not a symmetric matrix as Γt is not symmetric, the optimal W is given by the m eigenvectors corresponding to the m largest eigenvalues of matrix A + A T . 2) Algorithm Outline: After learning m projection vectors, one straightforward hashing method is using sgn(ωkT x) to first project X on W and then binarize Z = W T X to embed X into Hamming space. This is equivalent to 0/1 quantization after dimension reduction. However, the latter quantization step may distort the learned local topology, hence we have to find a better way to binarize Z . It is easy to see that sgn(v) is the vertex of the hypercube {−1, 1}m closest to v in terms of Euclidean distance. To preserve the learned local topology well in Hamming space, a small quantization loss sgn(v) − v2 is desirable. One way to achieve this is to rotate v and map it to the nearest vertex [10]. Note that, for any orthogonal matrix R ∈ Rm×m , Tr{RW T AW R T } = Tr{W T AW }. Therefore, we are free to orthogonal transformation to rotate the projected data Z to minimize the quantization loss [10]: Q(Y, R) = Y − R Z 2F = Y T − Z T R T 2F

(18)

where  ·  F is the Frobenius norm. Here, we use the iterative quantization procedure proposed in [10] to find a local minimum of Q(Y, R). It consists of two steps in each iteration: fix R and update Y ; fix Y and update R. In the first step, Y is simply set to sgn(R Z ). In the second step, minimization of Q(Y, R) is the classical Orthogonal Procrustes problem, the optimal R can be easily solved by computing the singular value decomposition of Z Y T = U V T and setting R T = U V T , i.e. R = V U T . For more details of this procedure, please refer to [10]. The final projection matrix is set to W R T and the embedding of X is given by sgn(RW T X). To summarize, given a dataset X ⊂ Rd , |X | = n (X is zero-centered), a subset is sampled as the training set Xt . The topo-neighbor set of each training sample is randomly sampled

and forms the topo-training set X , |X | = n  . Denote the data matrix whose each column is a data in X as X ∈ Rd×n and the matrix whose each column is a data in X as X  ∈ Rd×n , the whole procedure of TPH is outlined in Table. I. IV. D ISCUSSION In our formulation of TPH, we call Γr (i, j ) = ci j in Eq. (5) the rank-weight, as it indicates the relative ranking between  the neighbors of data point xi . As the normalized ci j ∝ xs ∈Ni dis − n i di j , we can see that, the rank-weight ci j not only depends on x j itself, but also depends other neighbors of xi , which is the local neighborhood of xi . Moreover, the farther a neighbor x j from xi , the smaller its rank-weight ci j , and based on the Rearrangement Inequality,  (i) ( j ) maximizing R(Y ) (5), (8) will make smaller, k yk yk finally make the Hamming distance between yi and y j larger. As a result, the rank-weight reflects the local neighborhood structure of xi , thus it grants TPH the ability to preserve this local neighborhood structure in Hamming space. Furthermore, the value of ci j can be positive or negative. If x j is very far from xi , ci j is negative. By contrast, if x j is very close to xi , ci j is positive. In some extent, the sign of ci j can be used as a prediction of the semantic similarity between xi , x j . In some semi-supervised hashing methods [9], [38], [39], the semantic similarity is +1 for neighbor pair and −1 for non-neighbor pair. As a result, we finally normalize each ci j to [−1, 1]. Note that, given xi and its topo-neighbors, the rankweight ci j will always be negative for some neighbors and positive for the others, which is not always appropriate in practice. For example, if all neighbors are very close to xi , the optimal semantic similarities should be all positive while the predictions given by ci j are not. To alleviate the problem,

ZHANG et al.: SCALABLE SIMILARITY SEARCH WITH TPH

we use the simi-weight wi j as a regularization. Moreover, we also add some points which are far away from xi into its neighbor set as introduced in Section III-A1. In Topo-Weighting matrix Γt , topo-weight τi j (13) is used to reveal the relationship between two data points. As both ci j and wi j are monotonically decreasing w.r.t. di j , the topoweight τi j ∈ [−1, 1] also reflects the local neighborhood structure of xi that for a far neighbor of xi , τi j will be small, while for a near neighbor, τi j will be large. Moreover, we can also use τi j to predict the “pseudo” semantic similarity between unlabeled data. For a data point x j which is relatively far from xi , it is less likely for x j to share the same semantic label with xi . On the other hand, if x j is very close to xi , x j is more likely to be a semantic neighbor of xi . In the first case we have τi j < 0, while in the second case we have τi j > 0. As a result, the sign of τi j predicts the semantic similarity between xi and x j , and its absolute value |τi j | gives the confidence of this prediction. Specially, if di j → ∞ we will have τi j ≈ −1, while if di j ≈ 0 we will have τi j ≈ 1. Therefore, the Topo-Weighting matrix Γt is capable of “mining” the semantic relationship between unlabeled data, which gives TPH the ability to capture semantic neighbor without given supervised information. This promising potential is demonstrated in Section VI-B. V. E XTENSIONS OF TPH The proposed TPH method in Section III is entirely unsupervised, although it can “mine” pseudo semantic relationship between unlabeled data, it is theoretically more effective for Euclidean neighbor search, as the neighborhood structure it captured, for a large part, is linear. As more and more similarity search problems require semantic consistent results, in this section, we propose to extend our unsupervised TPH, making it more effective to capture the non-linear neighborhood structure of a dataset and search semantic neighbors. A. Semi-Supervised TPH In our unsupervised TPH, the automatically “mined” semantic relationship between data points is only probabilistic, as the relationship is not always semantically correct. However, if the semantic label information is provided, we can use the deterministic semantic relationship between data points to construct the Topo-Weighting matrix Γt in a supervised manner, making the semantic information better incorporated with the learning process of TPH such that the semantic structure of a dataset can be captured more effectively. In this section, we present a simple yet effective way to leverage the semantic label information in the learning process of unsupervised TPH, extending it semi-supervised. This method is denoted as Semi-supervised TPH (STPH). In many (semi) supervise hashing methods [8], [9], given label information, for a data pair (xi , x j ) with same labels, the semantic similarity is +1, and for (xi , x j ) with different labels, the semantic similarity is −1. As a result, in our Topo-Weighting matrix Γt , the corresponding weight τi j should be always positive for a neighbor pair and negative for a non-neighbor pair. This can be accomplished by simply setting

3031

the simi-weight wi j to ±1, making τi j = 1 + ci j for neighbor pair and τi j = −1 + ci j for non-neighbor pair. However, since the semantic neighborhood space is often non-linear [8] while ci j (6), (7) is defined based on the linear Euclidean distance, calculating ci j with Euclidean distances between all other topo-neighbors and xi might bring too much distracted information. Therefore, we also modify the rank-weight ci j to make the final τi j only depends on xi and x j . Given dataset X along with its label information, the topoweight of each two data points in the topo-training set X is:

: x j is a neghbor of xi 1 − exp{−σs /di j }, τi j = −1 + exp{−di j /σs }, : otherwise where σs is the average pairwise distance of a sampled subset of X . With τi j , we can get the relative ranking of neighbors of xi , which is not available in the original semantic neighbor space. Although this ranking is pseudo, it still makes our STPH capture the semantic neighborhood structure effectively as demonstrated in Section VI-C. Moreover, the learning process of STPH is robust to contaminated training samples by weighting each training pair (xi , x j ) with τi j . If x j shares the same label with xi but lies very far from xi , or x j has a different label but lies very close to xi , (xi .x j ) may not be a good training pair and should be discarded in the learning process. In both two cases, we have τi j ≈ 0, which means this contaminated training pair is discarded by the learning process automatically. B. Kernelized TPH One limitation of TPH is that the linear projection hash function gives too much emphasis on linear neighborhood structure. To capture nonlinearity with the embedding process, we could use hash function with kernel plugged as in [8] and [14]. Unfortunately, as the methods that operate on the kernel matrix (Gram matrix) of a dataset scale poorly with the size of the training dataset, these operations are prohibitively expensive for large scale dataset. Therefore, to accelerate the learning process in the kernel space, we propose to first map the original data into a randomized feature space and then apply the training procedure of TPH in this feature space directly. There exist many explicit mappings for different kinds of popular kernels [34], [44], [45], in this paper, we are interested in the Gaussian kernel K (x, y) = exp{−x− y2 /(2s 2 )}, as its scale s can be used to control the neighborhood size for nearest neighbor search. To approximate the Gaussian kernel, we use the explicit random Fourier feature (RFF) mapping [34]. For a data point x, the k-th dimension of its RFF embedding is given by: √

ωk ,bk (x) = 2 cos(ωk x + bk ) where ωk is a random projection vector whose each dimension is drawn from Gaussian distribution N (0, 1/s) independently, and bk is a random phase shift drawn from uniform distribution U(0, 2π). A D-dimensional RFF embedding is give by:  T (20)

D (x) = ω1 ,b1 (x), . . . , ω D ,b D (x)

3032

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 7, JULY 2014

The inner product of the mapped data approximates the Gaussian kernel as K (x, y) ≈ T (x) ( y), and the precision increase as D increases. After mapping each data point to a D-dimensional space, we could apply the same training procedure of TPH in this D-dimensional feature space. This method is referred to as Kernelized TPH (KTPH). Another advantage of this method is that it is able to learn the codes whose length is larger than the original data dimensionality. VI. E XPERIMENTS To validate the effectiveness of the proposed hashing method, we provide results on several benchmark datasets. We first evaluate the unsupervised TPH method. Once the promise of topology preserving for hashing is established, we will then move to evaluate the Semi-supervised extension STPH and the Kernelized extension KTPH. A. Experimental Setup We evaluate our TPH and its extensions on four benchmark datasets: MINST70K,2 CIFAR10,3 SIFT1M [25] and GIST1M [25]. The MNIST70K dataset consists of 70K 28×28 images, each of which is associated with a digit label from ‘0’ to ‘9’. It is split into a database set (60K) and a query set (10K). The neighbors of each query image are defined as those images with the same digit labels. We use the grayscale intensity values of images as the features, resulting in a 784-dimensional feature space. The CIFAR10 dataset consists of 60K 32×32 images (baseset 50K, query set 10K) which are manually labeled into one of the following ten classes: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck. In our experiments, the images are represented with 512-dimensional GIST features [46], and the ground truth of each query is defined as the top 5,000 Euclidean neighbors in the evaluations of unsupervised TPH and as the images with the same class labels in the evaluations of STPH and KTPH, respectively. The SIFT1M dataset consists of almost 1M 128-dimensional SIFT features [47] and 10K queries. In our experiments, we randomly select 1,000 points from the query set as queries. The GIST1M dataset consists of almost 1M 960-dimensional GIST features [46] and 1K queries. For SIFT1M and GIST1M, we use the criterion in [38] to define the groundtruth: a returned point is considered as a true neighbor if it lies in the top 1% points closest to a query in terms of Euclidean distance. In our experiments, for Euclidean neighbor search, we compute the precisionrecall curve and the mean average precision (mAP), which is the area under the precision-recall curve, of each method as the performance metrics. For semantic neighbor search, we evaluate the semantic consistency of codes learned by different methods by using class labels as groundtruth. For this case, we report the averaged precision of top 500 ranked images for each query as in [10] and [38]. We select some representative unsupervised and semisupervised hashing methods for comparison. The unsupervised 2 http://yann.lecun.com/exdb/mnist/ 3 http://www.cs.toronto.edu/∼kriz/cifar.html

Fig. 2. Precision-recall curves on CIFAR10 with Euclidean groundtruth. All methods use 32 bits code. Fig. 2(a) gives the results of distance-preserving methods and Fig. 2(b) gives the results of neighborhood-preserving methods. Refer to Table II for the detailed mAP of these curves. (a) Recall, 32 bits. (b) Recall, 32 bits.

baselines are SPH [4], AGH [5], RMMH [7], ITQ [10], IsoH [36], LSH [2], SIKH [15], BRE [14], KLSH [16] and SBLSH [17]. Among these methods, SPH, AGH, RMMH, ITQ and IsoH are developed to preserve the original neighborhood relationships in Hamming space, while LSH, SIKH, BRE, KLSH and SBLSH are developed to approximate the original (kernel) distance with Hamming distance. The semi-supervised baselines are SSH [9], SPLH [38] and SSDH [39]. The source codes and the recommended parameter settings generously provided by the authors are used in our experiments. The training sets are randomly sampled. We randomly sample 2,000 points from the basesets of MINST70K and CIFAR10, and 3,000 points from the baseset of SIFT1M and GIST1M as the training sets for our TPH (STPH, KTPH), respectively. For fair comparison, these training sets are also used in the learning processes of BRE, KLSH, SSH, SPLH and SSDH. For each dataset, the average pairwise distance σs is calculated using a 5% sample set. The parameter γ that weights the neighborhood ranking preserving term and the neighborhood relationship preserving term is simply set to 0.2 and the regularization weight α is determined with crossvalidation for different datasets and different methods. B. Evaluation of Unsupervised TPH The unsupervised TPH is evaluated on CIFAR10, SIFT1M and GIST1M datasets with Euclidean neighbor groundtruths. Fig. 2 depicts the precision-recall curves of all methods with 32 bits codes on CIFAR10. For clarity, the results are shown in two figures. Fig. 2(a) presents the results of distancepreserving methods (LSH, SIKH, BRE, KLSH, SBLSH) and Fig. 2(b) presents the results of neighborhood-preserving methods (RMMH, SPH, ITQ, AGH, IsoH). From Fig. 2, it is clear that our TPH performs best for similarity search on this dataset. Fig. 3 gives the results on CIFAR10 with different code lengths. To avoid clutter, we only report the results of top-3 distance-preserving and top-3 neighborhood-preserving methods in this figure (and all the subsequent figures in this paper. The detailed similarity search performances can be found in Table. II). The results show that TPH consistently performs better than other state-of-the-art methods as the

ZHANG et al.: SCALABLE SIMILARITY SEARCH WITH TPH

3033

Fig. 3. Comparisons with state-of-the-art methods on CIFAR10 dataset with Euclidean groundtruth. Refer to Table II for the detailed mAP values. (a) Recall, 48 bits. (b) Recall, 64 bits. (c) Recall, 96 bits. TABLE II mAP VALUES ON CIFAR10, SIFT1M AND GIST1M U NDER D IFFERENT S ETTINGS . T HE B EST mAP A MONG A LL M ETHODS I S S HOWN IN B OLD FACE

Fig. 4. Comparisons with state-of-the-art methods on SIFT1M dataset with Euclidean groundtruth. The mAP value of each method is reported in Table II. (a) Recall, 32 bits. (b) Recall, 48 bits. (c) Recall, 64 bits. (d) Recall, 96 bits.

code length increases. Furthermore, comparing Fig. 2 with Fig. 3(a-c), we can find that even with a relatively short binary code (32 bits), the search performance of TPH is still better than many baseline methods with longer binary codes (96 bits). These comparisons demonstrate that TPH preserves the neighborhood relationships better by learning from the topology of the intrinsic manifold structure of CIFAR10. Figure 4 and Fig. 5 give the precision-recall curves on SIFT1M and GIST1M, respectively. The mAP values of all methods with different code lengths on CIFAR10, SIFT1M and GIST1M are reported in Table II. We also report recall@R [25], i.e. the proportion of queries for each of which the true nearest neighbor is ranked within the first R positions,

on SIFT1M and GIST1M in Fig. 6. Once again, we can easily find out that TPH outperforms other state-of-the-art binary hashing methods. These comparisons demonstrate the effectiveness of incorporating the neighborhood ranking information with codes learning and establish the promise of topology preserving for hashing. By comparing with the results in [25] and [26], we find that, product quantization-based methods outperform all hashingbased methods in terms of Recall@R for the same code length (e.g. PQ [25] scores at 0.8 on SIFT1M with 64 bits code, R = 100). This is mainly due to that, in quantization-based method, each data is represented by a reconstructed cluster center in the original data space, while in hashing-based

3034

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 7, JULY 2014

Fig. 5. Comparisons with state-of-the-art methods on GIST1M dataset with Euclidean groundtruth. The mAP value of each method is reported in Table II. (a) Recall, 32 bits. (b) Recall, 48 bits. (c) Recall, 64 bits. (d) Recall, 96 bits.

Fig. 6.

Evaluations of Euclidean nearest neighbor recall as a function of the number of retrieved results on SIFT1M and GIST1M.

method, each data is simply represented by a Hamming code. Therefore, hashing-based method would have much larger quantization error than quantization-based method, which degrades its performance for Euclidean neighbor search. Since product quantization is a lookup-based method, different from the binary hashing methods discussed in this paper, we do not fully test this method. As discussed in the Introduction, hashing-based methods have their own advantages in terms of speed and space. In Section IV, we claim that our unsupervised TPH can “automatically mine” the semantic relationship between unlabeled data. Here we report the performance of semantic neighbor search to demonstrate this promising potential of TPH. The comparisons results of semantic neighbor precision of top 500 retrieved images on MINST70K and CIFAR10 are given in Fig. 7 (a) and (b), respectively. As shown,

our unsupervised TPH significantly outperforms other state-of-the-art unsupervised methods under different code lengths on both two datasets, especially on CIFAR10. Its performance on MINST70K is still the best, but not as significant as that on CIFAR10. To explain this, we give the results of linear search using Euclidean distance. We can see that this l2 baseline method also gives reasonable good semantic neighbor search performances on MINST70K, which indicates the linearity of neighborhood structure of MINST70K. As a result, many unsupervised baselines also give good performances on this dataset. However, as the neighborhood structure of CIFAR10 is highly non-linear, these unsupervised baselines fail to capture semantic neighbors. Fig. 11 gives the qualitative retrieval results for several sample query images, we can see that our unsupervised TPH gives a reasonable good semantic search performance. Moreover, by comparing Fig. 7 (a) and (b)

ZHANG et al.: SCALABLE SIMILARITY SEARCH WITH TPH

3035

Fig. 7. Label precision of top 500 retrieved images at different code lengths on MINST70K and CIFAR10 using semantic neighbor as groundtruth. The legends of (a) and (b) are given in (b) and the legends of (c) and (d) are given in (d). Fig. 7(a) and (b) give the comparisons with state-of-the-art unsupervised methods, and we can see that our unsupervised TPH significantly outperforms other state-of-the-art unsupervised methods. Fig. 7(c) and (d) give the comparisons with state-of-the-art semi-supervised methods. Our semi-supervised TPH again gives the best performance. Moreover, by comparing Fig. 7(a) and (b) with Fig. 7(c) and (d), our unsupervised TPH gives almost the same, even better, performance as compared with other state-of-the-art semi-supervised methods. This finding validates the promising potential of our TPH that it can automatically “mine” the semantic relationship between unlabeled data.

Fig. 8. Average precision of top 500 retrieved results of Kernelized TPH (KTPH) at different code lengths.

with Fig. 7(c) and (d), we can find that the unsupervised TPH gives almost the same, or even better, performance as compared with several state-of-the-art semi-supervised hashing methods. This finding validates that our TPH can automatically “mine” the semantic relationship between unlabeled data, which grants TPH the ability to capture semantic neighbor without supervised information.

CIFAR10, whose neighborhood structure is highly non-linear. Fig. 11 gives the qualitative retrieval results. The retrieval quality of STPH (the first row) is much better. The semantical consistency of the retrieval results is significantly improved. These comparisons demonstrate that with label information incorporated, our TPH can capture the semantic neighborhood structure more effectively. From the results we can see that, SSH [9] also gives a good performance. By comparing the learning processes of STPH and SSH, we find that, the main difference between STPH and SSH lies in their label matrices, which are used to indicate the relationship between data points. In SSH, only the semantic relationship is revealed by its label matrix (−1 for non-neighbor pair and +1 for neighbor pair) while the neighborhood ranking information is discarded. By contrast, the Topo-Weighting matrix Γt (12) in STPH not only reveals the semantic relationship by the weight sign, but also reveals the relative neighborhood ranking by the weight value. Therefore, these comparison results also validate the importance of neighborhood ranking information and further establish the promise of our topology preserving essence in hashing.

C. Leveraging Label Information: Semi-Supervised TPH

D. Hashing With Kernel Embedding: Kernelized TPH

In this section, we show the performance of semantic neighbor search of semi-supervised TPH (STPH) by leveraging label information. Our experiments are carried out on MINST70K and CIFAR10 with semantic neighbor groundtruths. The Topo-Weighting matrix Γt is constructed as introduced in Section V-A, such that for a neighbor pair, its topo-weight is τi j = 1 − exp{−σs /di j }, while for a non-neighbor pair, its topo-weight is τi j = −1 + exp{−di j /σs }. For comparison, three state-of-the-art semi-supervised hashing methods, SSH [9], SPLH [38] and SSDH [39] are selected as baselines, and these methods use the same training set as our STPH. Figure 7 shows the averaged precision at top 500 retrieved images. For reference, we also include the performances of all unsupervised baseline methods. As shown in this figure, STPH achieves the best performance under all settings, especially on

As the main purpose of kernel embedding is to capture the nonlinearity of neighborhood structure, we evaluate Kernelized TPH (KTPH) for semantic neighbor search on MINST70K and CIFAR10. For the RFF mapping, we set the scale s of the Gaussian kernel to the average pairwise distance of a 5% sample set of each dataset, and the dimension D of the random feature space to 2000. The σs of wi j in topo-weight (13) is set to the average pairwise distance of a 5% sample set in the embedded feature space. Moreover, the semantic label information can also be leveraged in the learning process of KTPH, giving us a Semi-supervised KTPH (SKTPH). One advantage of kernel embedding is that it allows us to learn binary codes whose length is larger than the original data dimensionality (784 of MINST70K and 512 of CIFAR10), thus we report the results up to 1,024 bits. Fig. 8 gives the quantitative semantic neighbor search results, and Fig. 11 gives

3036

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 7, JULY 2014

Fig. 9. Topology preserving evaluations on CIFAR10, SIFT1M and GIST1M with Euclidean groundtruths. (a)(b)(c) give the precision of top retrieved results as a function of the number of retrieved results. (d) gives the nearest neighbor recall at rank 100 on SIFT1M and GIST1M as a function of bit number. Our TPH consistently gives better ranking results. These comparison results demonstrate that the results ranking ambiguity is better alleviated in TPH.

Fig. 10.

Topology preserving evaluations for semantic neighbor search on MINST70K and CIFAR10.

the qualitative retrieval results. We can see that, our KTPH and its semi-supervised extension SKTPH both give good performances. On MINST70K, the performance of KTPH is almost the same as that of TPH, this may due to the fact that the neighborhood structure of MINST70K is linear, thus the kernel embedding cannot bring significant performance gains. By contrast, on CIFAR10, KTPH performs almost the same as STPH, which demonstrates the efficacy of kernel embedding for non-linear neighborhood structure capturing. In [48], the kernel PCA technique is employed for data embedding. After each data is embedded into a highdimensional space, PQ [25] is used to facilitate fast ANN search. We select this algorithm as a baseline and denote it by KPCA + PQ in experiments. Results given in Fig. 8 demonstrate the efficacy of our method. We also compare

KTPH with SIKH [15], since the latter is also based on RFF. To learn a m bits code, SIKH first compute a m dimensional RFF embedding of a data and then binaries each dimension after adding a random threshold. By contrast, our KTPH starts with a 2,000 dimensional RFF embedding and then applies the topology preserving hashing procedure to get a m bits code. Based on the results of Fig. 8, we can see that, this process makes a significant big difference for semantic label precision, and our KTPH significantly outperforms SIKH. E. Evaluation of Topology-Preserving To test whether our TPH can alleviate the results ranking ambiguity that widely exists in most hashing methods, we also evaluate the performance of topology preserving. We use

ZHANG et al.: SCALABLE SIMILARITY SEARCH WITH TPH

3037

Fig. 11. Image search results on CIFAR10 (Top-30). Each method uses 32 bits codes. Our unsupervised TPH (the second row) gives a reasonable good semantic neighbor search performance. Moreover, the retrieval quality of STPH (the first row) is much better. The semantical consistency and ranking of the retrieval results are both improved.

Precision@N [19], i.e. precision of top-N retrieved results based on Hamming distance, as the performance metric, since for two hashing methods returning same number of neighbors for a query, if one gives a better ranking for the neighbors, it would have a higher precision, which means it preserves the local data topology better. Fig. 9(a), (b), (c) gives the results on CIFAR10, SIFT1M and GIST1M with Euclidean groundtruths, and Fig. 10 gives the results on MINST70K and CIFAR10 with semantic groundtruths. In Fig. 9(d), we report the nearest neighbor recall at rank 100 on SIFT1M and GIST1M. These comparisons show that our TPH consistently gives better ranking results for both Euclidean neighbor search and semantic neighbor search. Qualitative ranking result can be found in Fig. 11. As shown, the ranking of the retrieval results of our STPH (TPH) is significantly better. These results demonstrate that the result ranking ambiguity is better alleviated in TPH. VII. C ONCLUSION To capture meaningful neighbors, most hashing methods are developed to preserve the neighborhood relationships while ignoring the neighborhood rankings. In this paper, we show that the neighborhood ranking information is as important as neighborhood relationship for learning to hashing, and propose a novel hashing method, Topology Preserving Hashing (TPH), by incorporating the neighborhood ranking information with hash function learning. Our approach is distinct from prior works by not only preserving the neighborhood relationships between data points, but also preserving the neighborhood rankings between neighbors of data points. After formulating the unsupervised TPH, we develop two novel extensions.

The Semi-supervised TPH (STPH) is proposed to leverage semantic label information for better semantic neighbor search, and the Kernelized TPH (KTPH) is proposed to capture the nonlinearity of neighborhood structure. The experimental results show superior performances of all the proposed methods, especially for semantic neighbor search. In the future, we would like to investigate more sophisticated techniques to solve the topology-preserving optimization problem. R EFERENCES [1] J. L. Bentley and B. Labo, “K-d trees for semidynamic point sets,” in Proc. 6th SOCG, 1990, pp. 187–197. [2] A. Andoni and P. Indyk, “Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions,” in Proc. 47th Annu. IEEE Symp. FOCS, Oct. 2006, pp. 459–468. [3] R. Salakhutdinov and G. E. Hinton, “Semantic hashing,” Int. J. Approx. Reasoning, vol. 50, no. 7, pp. 969–978, Jul. 2009. [4] Y. Weiss, A. B. Torralba, and R. Fergus, “Spectral hashing,” in Proc. NIPS, 2008, pp. 1753–1760. [5] W. Liu, J. Wang, S. Kumar, and S.-F. Chang, “Hashing with graphs,” in Proc. 28th ICML, 2011, pp. 1–8. [6] M. Jain, H. Jégou, and P. Gros, “Asymmetric hamming embedding: Taking the best of our bits for large scale image search,” in Proc. 19th ACM Int. Conf. Multimedia, 2011, pp. 1441–1444. [7] A. Joly and O. Buisson, “Random maximum margin hashing,” in Proc. IEEE Conf. CVPR, Jun. 2011, pp. 873–880. [8] W. Liu, J. Wang, R. Ji, Y.-G. Jiang, and S.-F. Chang, “Supervised hashing with kernels,” in Proc. IEEE Conf. CVPR, Jun. 2012, pp. 2074–2081. [9] J. Wang, S. Kumar, and S.-F. Chang, “Semi-supervised hashing for large-scale search,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 12, pp. 2393–2406, Dec. 2012. [10] Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin, “Iterative quantization: A procrustean approach to learning binary codes for large-scale image retrieval,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 12, pp. 2916–2929, Dec. 2012. [11] P. Li, M. Wang, J. Cheng, C. Xu, and H. Lu, “Spectral hashing with semantically consistent graph for image indexing,” IEEE Trans. Multimedia, vol. 15, no. 1, pp. 141–152, Jan. 2013.

3038

[12] J. Lee and M. Verleysen, Nonlinear Dimensionality Reduction. New York, NY, USA: Springer-Verlag, 2007. [13] A. Joly and O. Buisson, “A posteriori multi-probe locality sensitive hashing,” in Proc. 16th ACM Int. Conf. Multimedia, 2008, pp. 209–218. [14] B. Kulis and T. Darrell, “Learning to hash with binary reconstructive embeddings,” in Proc. 23rd Annu. Conf. NIPS, Jul. 2009, pp. 1042–1050. [15] M. Raginsky and S. Lazebnik, “Locality-sensitive binary codes from shift-invariant kernels,” in Proc. NIPS, 2009, pp. 1509–1517. [16] B. Kulis and K. Grauman, “Kernelized locality-sensitive hashing,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 6, pp. 1092–1104, Jun. 2012. [17] J. Ji, J. Li, S. Yan, B. Zhang, and Q. Tian, “Super-bit locality-sensitive hashing,” in Proc. NIPS, 2012, pp. 108–116. [18] Y.-G. Jiang, J. Wang, X. Xue, and S.-F. Chang, “Query-adaptive image search with hash codes,” IEEE Trans. Multimedia, vol. 15, no. 2, pp. 442–453, Feb. 2013. [19] L. Zhang, Y. Zhang, J. Tang, K. Lu, and Q. Tian, “Binary code ranking with weighted hamming distance,” in Proc. IEEE Conf. CVPR, Jun. 2013, pp. 1586–1593. [20] Y. Zhang, L. Zhang, and Q. Tian, “A prior-free weighting scheme for binary code ranking,” IEEE Trans. Multimedia, vol. 16, no. 4, pp. 1127–1139, Jun. 2014. [21] S. Roweis and L. Saul, “Nonlinear dimensionality reduction by locally linear embedding,” Science, vol. 290, no. 5500, pp. 2323–2326, Dec. 2000. [22] L. Zhang, Y. Zhang, J. Tang, X. Gu, J. Li, and Q. Tian, “Topology preserving hashing for similarity search,” in Proc. 21st ACM Int. Conf. Multimedia, 2013, pp. 123–132. [23] R. Weber, H.-J. Schek, and S. Blott, “A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces,” in Proc. 24th Int. Conf. VLDB, 1998, pp. 194–205. [24] L. Paulevé, H. Jégou, and L. Amsaleg, “Locality sensitive hashing: A comparison of hash function types and querying mechanisms,” Pattern Recognit. Lett., vol. 31, no. 11, pp. 1348–1358, Aug. 2010. [25] H. Jegou, M. Douze, and C. Schmid, “Product quantization for nearest neighbor search,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 1, pp. 117–128, Jan. 2011. [26] M. Norouzi and D. J. Fleet, “Cartesian k-means,” in Proc. IEEE Conf. CVPR, Jun. 2013, pp. 3017–3024. [27] Y. Liu, F. Wu, Y. Yang, Y. Zhuang, and A. Hauptmann, “Spline regression hashing for fast image search,” IEEE Trans. Image Process., vol. 21, no. 10, pp. 4480–4491, Oct. 2012. [28] W. Zhou, Y. Lu, H. Li, and Q. Tian, “Scalar quantization for large scale image search,” in Proc. 20th ACM Int. Conf. Multimedia, 2012, pp. 169–178. [29] M. Li and V. Monga, “Robust video hashing via multilinear subspace projections,” IEEE Trans. Image Process., vol. 21, no. 10, pp. 4397–4409, Oct. 2012. [30] L. Zheng, S. Wang, Z. Liu, and Q. Tian, “Packing and padding: Coupled multi-index for accurate image retrieval,” in Proc. CVPR, Feb. 2014. [31] B. Stein, “Principles of hash-based text retrieval,” in Proc. 30th Annu. Int. Conf. ACM SIGIR, 2007, pp. 527–534. [32] L. Qin, W. Josephson, Z. Wang, M. Charikar, and K. Li, “Multi-probe LSH: Efficient indexing for high-dimensional similarity search,” in Proc. 33rd Int. Conf. VLDB, 2007, pp. 950–961. [33] H. Jégou, T. Furon, and J.-J. Fuchs, “Anti-sparse coding for approximate nearest neighbor search,” in Proc. IEEE ICASSP, Mar. 2012, pp. 2029–2032. [34] A. Rahimi and B. Recht, “Random features for large-scale kernel machines,” in Proc. NIPS, 2007, pp. 1177–1184. [35] X.-J. Wang, L. Zhang, F. Jing, and W.-Y. Ma, “AnnoSearch: Image auto-annotation by search,” in Proc. IEEE Comput. Soc. Conf. CVPR, Jun. 2006, pp. 1483–1490. [36] W. Kong and W.-J. Li, “Isotropic hashing,” in Proc. NIPS, 2012, pp. 1655–1663. [37] P. Li, M. Wang, J. Cheng, C. Xu, and H. Lu, “Spectral hashing with semantically consistent graph for image indexing,” IEEE Trans. Multimedia, vol. 15, no. 1, pp. 141–152, Jan. 2013. [38] J. Wang, S. Kumar, and S.-F. Chang, “Sequential projection learning for hashing with compact codes,” in Proc. 27th ICML, 2010, pp. 1127–1134. [39] S. Kim and S. Choi, “Semi-supervised discriminant hashing,” in Proc. IEEE 11th ICDM, Dec. 2011, pp. 1122–1127. [40] D. Luo, C. H. Ding, F. Nie, and H. Huang, “Cauchy graph embedding,” in Proc. 28th ICML, 2011, pp. 553–560. [41] Y. Bengio, O. Delalleau, N. Le Roux, J.-F. Paiement, P. Vincent, and M. Ouimet, “Learning eigenfunctions links spectral embedding and kernel PCA,” Neural Comput., vol. 16, no. 10, pp. 2197–2219, 2004.

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 7, JULY 2014

[42] D. Zhang, F. Wang, and L. Si, “Composite hashing with multiple information sources,” in Proc. 34th Int. ACM SIGIR Conf., 2011, pp. 225–234. [43] S. Baluja and M. Covell, “Learning to hash: Forgiving hash functions and applications,” J. Data Mining Knowl. Discovery, vol. 17, no. 3, pp. 402–430, Dec. 2008. [44] F. Perronnin, J. Sánchez, and Y. Liu, “Large-scale image categorization with explicit data embedding,” in Proc. IEEE Conf. CVPR, Jun. 2010, pp. 2297–2304. [45] A. Vedaldi and A. Zisserman, “Efficient additive kernels via explicit feature maps,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 3, pp. 480–492, Mar. 2012. [46] A. Oliva and A. B. Torralba, “Modeling the shape of the scene: A holistic representation of the spatial envelope,” Int. J. Comput. Vis., vol. 42, no. 3, pp. 145–175, May 2001. [47] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, Nov. 2004. [48] A. Bourrier, F. Perronnin, R. Gribonval, P. Pérez, and H. Jégou, “Nearest neighbor search for arbitrary kernels with explicit embeddings,” INRIA, Rennes, France, Tech. Rep. RR-8040, 2012. Lei Zhang received the B.E. degree from the Department of Electronics Science and Technology, University of Science and Technology of China, Hefei, China, in 2009. He is currently pursuing the Ph.D. degree with the Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China. His research interests include multimedia retrieval and copy detection.

Yongdong Zhang (M’08–SM’13) received the Ph.D. degree in electronic engineering from Tianjin University, Tianjin, China, in 2002. He is currently a Professor with the Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China. His current research interests are in the fields of multimedia content analysis and understanding, multimedia content security, video encoding, and streaming media technology. He has authored over 100 refereed journal and conference papers. He was a recipient of the Best Paper Awards in PCM 2013, ICIMCS 2013, and ICME 2010, and the Best Paper Candidate in ICME 2011. He serves as an Editorial Board Member of Multimedia Systems Journal and Neurocomputing. Xiaoguang Gu (M’13) received the Ph.D. degree in computer science from the University of Chinese Academy of Sciences, Beijing, China. He is currently an Assistant Professor with the Institute of Computing Technology, Chinese Academy of Sciences. His research interests encompass a number of topics in computer vision and multimedia computing, including large scale vision retrieval, video analysis and tracking, and object recognition. He was a recipient of the Best Paper Award at the IEEE International Conference on Multimedia and Expo in 2010. Jinhui Tang (M’08) is currently a Professor with the School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China. He received the B.E. and Ph.D. degrees from the University of Science and Technology of China, Hefei, China, in 2003 and 2008, respectively. His research interests include large scale multimedia search, social media mining, and computer vision. He serves as an Editorial Board Member of Pattern Analysis and Applications, Multimedia Tools and Applications, Information Sciences, and Neurocomputing, a Technical Committee Member for about 30 international conferences, and a reviewer for about 30 prestigious international journals. He was a corecipient of the Best Paper Award at ACM Multimedia 2007, PCM 2011, and ICIMCS 2011.

ZHANG et al.: SCALABLE SIMILARITY SEARCH WITH TPH

Qi Tian (M’02–SM’03) received the B.E. degree in electronic engineering from Tsinghua University, Beijing, China, in 1992, the M.S. degree in electrical and computer engineering from Drexel University, Philadelphia, PA, USA, in 1996, and the Ph.D. degree in electrical and computer engineering from the University of Illinois at UrbanaChampaign, Champaign, IL, USA, in 2002. He is currently a Professor with the Department of Computer Science, University of Texas at San Antonio, San Antonio, TX, USA. He took oneyear faculty leave at Microsoft Research Asia, Beijing, from 2008 to 2009.

3039

Dr. Tian’s research interests include multimedia information retrieval and computer vision. He has authored over 210 refereed journal and conference papers. He was a recipient of the Best Paper Awards in PCM 2013, MMM 2013, and ICIMCS 2012, the Top 10% Paper Award in MMSP 2011, the Best Student Paper in ICASSP 2006, the Best Paper Candidate in PCM 2007, and the ACM Service Award in 2010. He is the Guest Editor of the IEEE T RANSACTIONS ON M ULTIMEDIA, Journal of Computer Vision and Image Understanding, Pattern Recognition Letter, EURASIP Journal on Advances in Signal Processing, Journal of Visual Communication and Image Representation, and is in the Editorial Board of the IEEE T RANSACTIONS ON C IRCUIT AND S YSTEMS FOR V IDEO T ECHNOLOGY , Multimedia Systems Journal, Journal of Multimedia, and Journal of Machine Visions and Applications.

Scalable similarity search with topology preserving hashing.

Hashing-based similarity search techniques is becoming increasingly popular in large data sets. To capture meaningful neighbors, the topology of a dat...
4MB Sizes 2 Downloads 5 Views