Unsupervised Local Feature Hashing for Image Similarity Search

The potential value of hashing techniques has led to it becoming one of the most active research areas in computer vision and multimedia. However, most existing hashing methods for image search and retrieval are based on global feature representations, which are susceptible to image variations such as viewpoint changes and background cluttering. Traditional global representations gather local features directly to output a single vector without the analysis of the intrinsic geometric property of local features. In this paper, we propose a novel unsupervised hashing method called unsupervised bilinear local hashing (UBLH) for projecting local feature descriptors from a high-dimensional feature space to a lower-dimensional Hamming space via compact bilinear projections rather than a single large projection matrix. UBLH takes the matrix expression of local features as input and preserves the feature-to-feature and image-to-image structures of local features simultaneously. Experimental results on challenging data sets including Caltech-256, SUN397, and Flickr 1M demonstrate the superiority of UBLH compared with state-of-the-art hashing methods.


I. INTRODUCTION
L EARNING to hash has received substantial attention due to its potential in various applications such as data mining, pattern recognition, and information retrieval [1]- [7]. Compact hashing enables significant efficiency gains in both storage and retrieval speed for large-scale databases. Generally speaking, greedy-searching-based retrieval on a data set with N samples is infeasible because linear complexity O(N) is not scalable to realistic applications on large-scale data. Meanwhile, most vision tasks also suffer from the curse of dimensionality, because visual descriptors usually have hundreds or even thousands of dimensions. Due to above reasons, hashing techniques are proposed to effectively embed data from a high-dimensional feature space into a similaritypreserved low-dimensional Hamming space where an approximate nearest neighbor (ANN) of a given query can be found with sublinear time complexity. Currently, both conventional unsupervised and supervised hashing algorithms are primarily designed for global representations, e.g., GIST [8]. For realistic visual retrieval tasks, however, these global hashing techniques cannot cope with different complications appearing in the images such as cluttering, scaling, occlusion, and change of lighting conditions. However, these aspects are more invariant in local featuresbased representations such as bag-of-features [9], [10], since such representations are statistical distributions of image patches and tend to be more robust in challenging and noisy scenarios. Fig. 1 illustrates the comparison of a global representation-based hashing method (GIST+ITQ) and our proposed local feature-based hashing method [SIFT+unsupervised bilinear local hashing (UBLH)] for relatively complex scene retrieval. We can observe that the top four retrieved images via GIST+ITQ only contain the desert. They are indeed relevant to the query image, but not exactly what we want to search. While, the top four retrieved images via our approach, i.e., SIFT+UBLH, include all the detailed information, i.e., man, camel, and desert, as in the query image. The possible reason of the difference on the above retrieval task is that global representation-based hashing methods are This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/ good at extracting the global intensity, color, texture, and gradient information of images, but will ignore the detailed information in the query image without analyzing the intrinsic geometric property of local features. This problem may heavily limit the effectiveness on applications that demand more accurate retrieval results for complex scene/object images. Thus, inspired by advantages (e.g., invariance for cluttering, scaling, and occlusion) of local representations, in this paper, we intend to develop a local feature-based hashing method for improving the retrieval results. If keypoints are well detected, local hash codes are able to avoid the limitations such as background variations, occlusions, and shifts in global representations.
In this paper, we propose an UBLH framework for large-scale visual similarity search, in which the feature-tofeature (F2F) and image-to-image (I2I) structures are successfully combined and preserved together. Specifically, the F2F structure considers the pairwise relationship between local features in the original feature space, which is always considered in manifold learning techniques [11]- [16]. From a higher-level aspect, I2I structure reflects the connection between images when each of them is represented by a set of local features. In particular, the I2I distance can provide a feasible way to measure the connection between two images, which is derived from [17]. It measures the distance between two images using the set representation of local features of the image. Since the raised problem of UBLH is regarded as nonconvex and discrete, our objective function is then optimized via an alternate way with relaxation.
Furthermore, motivated by [18]- [20], a bilinear projection is employed to make the algorithm more efficient. To be specific, the bilinear projection applies two projection matrices to local features, which have much smaller sizes than the original single projection matrix. Since the computational complexity of eigen decomposition is cubic degree on the dimension of the matrix, the effect of applying smaller matrices is quite conspicuous. Beyond that, most local features are based on histograms and bins, e.g., an SIFT feature is computed from 16 histograms, each of which has eight bins. Therefore, the bilinear scheme can explore two different kinds of data structure from the views of histograms and bins simultaneously. More crucially, when local features are transformed from the vector form to the matrix form, a factorization of integral for dimensionality is needed. These two different views provide a natural factorization.
The outline of our proposed UBLH is demonstrated in Fig. 2. Considering that our method is specifically designed for local feature-based hashing, the original Hamming ranking and Hamming table cannot be directly applied on local features for visual indexing. Thus, in this paper, we also introduce an image indexing/searching scheme called local hashing voting (LHV), which has been demonstrated to be efficient and accurate for image similarity search in our experiments.
This paper aims at unsupervised linear (bilinear) hashing for local features, which makes UBLH effective and practical for real-world applications without class label information. With the bilinear projection learning, the complexity of the eigen decomposition, which is the cubic form of the dimensionality, will be significantly reduced. Once the projections are learned, they can be efficiently used on the test data. Additionally, UBLH simultaneously preserves the F2F and I2I structures which can be regarded as the local and global structures respectively in the original feature space.

II. RELATED WORK
In terms of bilinear hashing, Gong et al. [18] applied the bilinear scheme to the global representations by minimizing the angles between the rotated features and the corresponding binary codes. Although this scheme can effectively solve the high-dimensional hashing problem with less computational complexity, it still has some deficiency in the optimization process. Specifically, the angle between the vector in a continuous space and that in a discrete space would bring quantization errors in the optimization. Moreover, their scheme lacks considering the relationship among features.
To explore hashing in the early time, random projections are always used to construct randomized hash functions. A most well-known representative is locality-sensitive hashing (LSH) [21], [22], which can preserve similarity information and map data points close in a Euclidean space to similar codes. It is theoretically guaranteed that as the code length increases, the Hamming distance between two codes will asymptotically approach the Euclidean distance between their corresponding data points. Furthermore, a kernel trick, which allows the use of a wide class of similarity functions, was combined with LSH to generalize LSH with arbitrary kernel functions [23]. Beyond that, principled linear projections such as PCA hashing (PCAH) [24] and its rotational variant have been introduced for better quantization rather than random projection hashing. Spectral hashing (SpH) [25] was proposed to preserve the data locality relationship to keep neighbors in the input space as neighbors in the Hamming space. Anchor graphs hashing (AGH) [26] is adopted to obtain tractable low-rank adjacency matrices for efficient similarity search. Kernel reconstructive hashing [27] was proposed to preserve the similarity defined by an arbitrary kernel using compact binary code. Compressed hashing (CH) [28] has been effectively applied for large-scale data retrieval tasks as well. All these hashing techniques mentioned above are regarded as unsupervised methods which may lead to worse retrieval precision for the data sets with noise. To achieve better results, researchers have developed supervised hashing methods which could attain higher search accuracy, since the label information is involved in the learning phase. A simple supervised hashing method is linear discriminant analysis hashing (LDAH) [29] which can tackle supervision via easy optimization but still lacks adequate performance due to the use of orthogonal projection in hash functions. Beyond that, some more complicated methods have been proposed such as binary reconstructive embeddings (BRE) [30], minimal loss hashing (MLH) [31], and kernel supervised hashing (KSH) [32]. Although these supervised methods can achieve promising results, they impose difficult optimization with slow training mechanisms. It is noteworthy that all of methods mentioned above only can be utilized with global representations.
An early work of applying local features to image detection and retrieval was proposed in [33]. Based on LSH, Joly and Buisson [34] proposed a multiprobe LSH for ANN search to improve the local feature-based retrieval tasks [35]. Another ANN algorithm was introduced in [36] to speed up the searching algorithm and find the best algorithm configuration for various data sets. Although a hybrid hashing method for SIFT descriptors was proposed in [37], the relationships between local features are not included in the code learning phase.
The main work for embedding local features to the Hamming space was proposed in [38]. In particular, two schemes are introduced to improve the standard bag-ofwords (BoW) model: 1) a Hamming embedding (HE) which provides binary signatures to refine visual words and 2) a weak geometric consistency constraint with the geometrical transformation. Both methods can significantly improve the final performance for retrieval tasks. Furthermore, a coupled multi-index framework was proposed for accurate image retrieval [39]. Beyond that, a selective match kernel approach [40] has also been developed to incorporate matching kernels sharing the best properties of HE and vector of locally aggregated descriptors (VLAD). Another related work based on [38] can be seen in [41], which introduces a color binary descriptor being calculated in either a global or a local form.
However, all the above embedding methods mainly focus on the retrieval techniques rather than the learning procedure of the binary coding for large-scale hashing. Besides, these methods are not fully linear, which limits their efficiency and applicability for large-scale data sets. In fact, one of the most related work using bilinear projection on local feature hashing can be found in [42], which is regarded as a supervised learning method for image similarity search.

III. UNSUPERVISED BILINEAR LOCAL HASHING
In this section, we first introduce the bilinear scheme [18] to present our algorithm. Then we illustrate how the F2F and I2I structures are preserved in UBLH. An alternate optimization is used for learning the bilinear projections for hash codes.

A. Notations and Problem Statement
We are given N local features . , x im i } to represent its local feature set. Bilinear projection is to multiply projection matrices on both sides of data. It can explore the matrix structure of features to enhance the effectiveness of projection. First, we factor integer D as D = D 1 × D 2 . Then we reorganize vector x i into matrix X i ∈ R D 1 ×D 2 such that vec(X i ) = x i , where vec(·) represents the vectorization of a matrix. And we also have the inverse map of vectorization vec −1 (x i ) = X i , since the vectorization is a one-to-one correspondence if D 1 and D 2 are given. To make the transformation more efficient, in this matrix form of local features, we define our hash function using two matrices 1 In fact, we notice that vec( T where ⊗ is the Kronecker product, thus a bilinear projection is simply a special case of the single matrix projection which can be decomposed as = 2 ⊗ 1 . Besides, it is easy to show that if 1 and 2 are orthogonal, i.e., T 1 1 = I d 1 ×d 1 and T 2 2 = I d 2 ×d 2 , then is orthogonal, as well. The bilinear projection leads to a more efficient eigen decomposition on matrices with much smaller sizes D 1 × D 1 and D 2 × D 2 rather than D 1 D 2 × D 1 D 2 for single projection. Additionally, the space complexity for bilinear projections is O(D 2 1 + D 2 2 ), while the single one needs O((D 1 × D 2 ) 2 ). Besides, since most of the local features are represented as concatenated histogram vectors, they can be intrinsically decomposed by two data structures. For instance, 128-dim SIFT is computed on 4 × 4 grids and for each grid a 8-bin histogram is calculated. In this way, a 128-dim SIFT is formed by concatenating 16 × 8-bin histograms. Thus, for SIFT feature, we can naturally decomposed it via 16 × 8 in our bilinear codes learning.
Note that during the learning stage, we use {−1, +1} to represent the output of hash functions and employ centralized data In the indexing phase, we use {0, 1} to represent codes for hash lookup.

B. Feature-to-Feature Structure Preserving
To obtain meaningful hash codes for local features, let us first consider the geometric structure of the entire local feature set F = {X 1 , . . . , X N }. We are concerned about the individual relationship between local features in the high-dimensional space, which should also be retained in the lower-dimensional space. Specifically, for similar (dissimilar) pairs, their distance is expected to be minimized (maximized) in the Hamming space. Since the class labels are unavailable for unsupervised method, we first employ K-means clustering on F to obtain some weak label information. Then the pairwise label of (X i , X j ) is defined as ij = +1, X i and X j are in the same cluster −1, otherwise.
Since different pairs have different importance in the embedding, for pair (X i , X j ), we assign a weight which is related to the pairwise distance with parameter σ where · is Frobenius norm. We can find that w F ij ∈ (0, 1) and for a positive pairwise label, w F ij is decreasing as the distance X i −X j increases and vice versa. In other words, the positive pair is more important when they are close to each other, and the negative pair is more important when they are far away from each other. We denote P = {(i, j)|X i , X j ∈ F}. Therefore, preserving the F2F structure is to maximize The above function reaches its maximum value when w F ij ij H(X i ) and H(X j ) are similarly sorted due to the rearrangement inequality [43].

C. Image-to-Image Structure Preserving
Now we take a higher level connection, i.e., the relationship between images, into account since source information is also crucial to local features. For image i, we still use X i to represent the local feature set {X i1 , . . . , X im i } in matrix form. Derived from [17], the I2I distance from image i to image j is defined as where NN j (X) is the nearest neighbor of the local feature X in image j. Although the number of local features in one image is much smaller than N, the nearest neighbor search (NN-search) for all images is still time-consuming. We hope to use the cluster information in the above F2F section for the reduction of complexity. We denote the clusters of the K-means clustering on F by C 1 , . . . , C K . Without loss of generality, supposing the local features of image j are in C 1 , . . . , C K 1 and the order of distances from corresponding centroids to X ∈ X i is from nearest to farthest, then the range of NN-search in X j is reduced to (C 1 ∪ · · · ∪ C (K 1 ) δ ) ∩ X j , where 0 < δ < 1 and · is the ceiling function. This reduction of range is based on the assumption that the centroid of the cluster where the true neatest neighbor locates is also close to X. In fact, it holds when K → N. After the reduction of searching range, the average complexity is reduced from O(N 2 ) to O(NK 1+δ ) and we only need to compute the distances from X to the cluster centroids, which has been done in the K-means.
In a general situation, d ij = d ji . Thus, to ensure symmetry, we update the I2I distance as Via a Gaussian function, we have the following I2I similarity: where σ I is the smooth parameter. After applying UBLH, we have the I2I distance in the Hamming space To preserve the I2I structure of the original space, a reasonable objective function is to minimize The above function reaches the minimum value when { D ij } and {w I ij } are oppositely sorted due to the rearrangement inequality [43]. With the F2F part in (4) and orthogonal constraints on 1 and 2 , i.e., T 1 1 = I and T 2 2 = I, we have the final optimization problem where λ is the balance parameter.

D. Alternate Optimization Via Relaxation
In this section, we derive the projections of the optimization problem (10). Motivated by [25] and [32], to gain an optimal solution, we first relax the discrete sign function to a real-valued continuous function by using its signed magnitude, i.e., sgn(x) ≈ x. In this case, the objective function of the F2F part, i.e., (4) becomes Besides, we also make a statistical approximation on the computation of projected I2I distances due to the large number of local features. In other words, we exchange the operation of NN-search and H(·) for all X ∈ X i during the optimization, i.e., 2 . In fact, the pairwise structure has been preserved in the F2F objective function (4), which ensures the correctness of the exchange operation. Hence, the projected d ij in (5) becomes where X j ik := X ik − NN j (X ik ), k = 1, . . . , m i , i, j = 1, . . . , n. And we also have the similar derivation for d ji . Then the projected I2I distance D ij can be written as Since it is a nonconvex optimization problem, to the best of our knowledge, there is no direct way to output the projections 1 and 2 simultaneously. We derive an alternate iteration algorithm to update one projection when given the other, i.e., we optimize 1 when 2 is fixed and we fix 1 to update 2 iteratively. Combined with (11) and (13), let us denote the objective function by Algorithm 1 Unsupervised Bilinear Local Hashing Input: The local feature set F of training images, the number of centroids K in the K-means, the parameter δ for the NN-search in the I2I distance and the balance parameter λ. Output: The bilinear projection matrices 1  By simple algebraic derivation, we have the following form: and are two symmetric matrix-valued functions with their codomains R D 1 ×D 1 and R D 2 ×D 2 , respectively. Consequently, for fixed 1 , the optimal 2 is given as the eigenvectors corresponding to the largest d 2 eigenvalues of M 1 ( 1 ), and likewise, for fixed 2 , the optimal 1 is given as the eigenvectors corresponding to the largest d 1 eigenvalues of M 2 ( 2 ). Although the number of the local features is usually huge, the sizes of our final matrices M 1 and M 2 used for decomposition are small enough (D 1 and D 2 are always less than 100). This property mainly guarantees the efficiency and feasibility. Therefore, for t = 0, we randomly initialize (t) 2 ; for the tth step, we have the update rules For the tth step (t ≥ 1), we have the following inequality: Then the above alternate iteration converges. In practice, we stop the iteration when the difference |L( )| is less than a small threshold or the number of iteration reaches a maximum. We summarize UBLH in Algorithm 1.

IV. INDEXING VIA LOCAL HASHING VOTING
Once the bilinear projection matrices { 1 , 2 } are obtained, we can easily embed the training data into binary hash codes by (1). And for a query local feature x, its hash code is obtained by with the input of centralized data, where X is the matrix form of x. For an upcoming query, a common way to find the similar samples in the training set by using Hamming distance ranking. However, for our local feature hashing scenario, traditional linear search (e.g., Hamming distance ranking) with complexity O (N) is not fast any more, since N denotes the total number (at least 3M for a large-scale database) 1 of local features. To accomplish the local feature-based visual retrieval, in this paper, we introduce a fast indexing scheme via LHV as shown in Fig. 3. We first build the Hamming lookup table For the query hash code H(vec −1 (q i )), store all the possible image indices falling into the Hamming lookup table within Hamming radius r; 6: end for 7: Vote and accumulate the times of each image's indices appearing and rank them in decreasing order; 8: return All the relevant images as the retrieved results.
(also known as the hashing table) into our LHV scheme. Given a query, we can find the bucket of corresponding hash codes in near constant time O (1), and return all the data in the bucket as the retrieval results.
After construction of the Hamming lookup table over the training set, we store the corresponding indices for the hash codes of all local features. For instance, given a bucket with hash code "1100101001," we store the indices of the images, which contain the same local feature hash code with this bucket. In this way, we search the hash code H(vec −1 (q i )) for each local feature q k ∈ Q in the query image Q = {q 1 , . . . , q m } over the Hamming lookup table within Hamming radius r and return the possible images' indices. It is noteworthy that the same bucket in the Hamming lookup table may store the indices from different images. Finally, we vote and accumulate the times of each image's indices appearing in relevant buckets and then rank them in decreasing order. The final retrieved samples are returned according to the relevant ranking generated by LHV, which is depicted in Algorithm 2.

V. COMPLEXITY ANALYSIS
The time complexity of UBLH mainly contains three parts. The first part is computing the F2F weight w F ij , which costs O(|P|D + NKTD) time, where T is the number of iterations in the K-means. The second part is constructing the I2I similarity w I ij . Using the reduction strategy in NN-search, the average time complexity of this part is O (NK 1+δ D). The last part is the eigen decomposition for the bilinear projection matrices via alternate optimization. The updates of 1 where N T is the number of the iteration for alternate optimization. In the experiments, N T is always less than 10.

VI. EXPERIMENTS
In this section, the proposed UBLH algorithm is evaluated for the image similarity search problem. Three realistic image data sets are used in our experiments: Caltech-256 [44], SUN397 [45], and Flickr 1M. 2 The Caltech-256 data set consists of 30 607 images associated with 256 object categories. We further randomly choose 1000 images as the query set and the rest of data set is regarded as the training set. The SUN397 data set contains 108 754 scene images in total from 397 wellsampled categories with at least 100 images per category. Seventy samples are randomly selected from each category to construct the training set and the rest of samples are the query set. Thus, there are total numbers of 27 790 and 80 964 in the training set and query set, respectively. For the Flickr 1M data set, it contains one million Web images collected from the Flickr. We take 1K images as the queries by random selection and use the remaining to form the gallery database. Considering the huge cost of computation, in this experiment, only 150 000 randomly selected samples from the gallery database form the training set. Furthermore, for image searching tasks, given an image, we would like to describe it with a set of local features extracted from it. In our experiments, we extract 128-D SIFT as the local feature to describe the images and then learn to hash these local descriptors with all compared methods. Particularly, considering the computational cost, we limit the maximum number of local features extracted from one image with 700.
In the querying phase, using LHV as the retrieval tool, a returned point is regarded as a neighbor if it lies in the top ranked 200, 200, and 1000 points for Caltech-256, SUN397, and Flickr 1M, respectively. Specifically in LHV, we only consider the local hash codes lying in the buckets that fall within a small Hamming radius r = 2 (following [25]) in the Hamming lookup table which is constructed using the training codes. 2 http://www.multimedia-computing.de/wiki/Flickr1M We evaluate the retrieval results in terms of the mean average precision (MAP) and the precision-recall curve by changing the number of top ranked points in LHV. Additionally, we also report the training time and the test time (the average searching time used for each query) for all methods. Our experiments are completed using MATLAB 2013a on a server configured with a 12-core processor and 128 GB of RAM running the Linux OS.
For our UBLH, to obtain the weak label information, the parameter K of the K-means in the proposed method for each data set is selected from one of {300, 400, . . . , 1200} with the step of 100 by 10-fold cross-validation on the training sets. The parameter δ is always fixed to 0.5. Besides, we set D 1 = 16 and D 2 = 8 for the transformation of 128-D SIFT local features (see Section III). Additionally, the optimal balance parameter λ is chosen from one of {0.05, 0.1, . . . , 0.5} with the step of 0.05 via the training sets, as well.

B. Results and Analysis
We demonstrate MAP curves on the Caltech-256, SUN397, and Flickr 1M data sets compared with different methods in Fig. 4. All the results are calculated via the proposed LHV ranking algorithm under the same setting. From the general tendency, accuracies on the SUN397 data set are lower than those on the other two data sets, since more categories and large intraclass variations exist in SUN397. Our UBLH algorithm consistently outperforms all the compared methods in every length of code. Especially on the Caltech256 data set, the improvement is near 5% between UBLH and the top supervised method KSH on each code length. Beyond that, we can observe that due to the available label information in the learning phases, the supervised methods, such as KSH, BRE, and MLH, always achieve better performance than the compared unsupervised methods on all three data sets. Interestingly, the results of LDAH always climb up then go down when the length of code increases. The same tendency also appears with BRE, KSH, and PCAH. The two unsupervised hashing methods, ITQ and SpherH, generally outperform other compared unsupervised methods, while achieve worse accuracies than the supervised methods. Our UBLH achieves dramatically better performance than all other unsupervised methods and also reaches higher accuracies compared with the supervised ones over three data sets. This is because we consider the geometry structure of local features and the global relationship between images simultaneously. Besides, in Table II  To make the comparison more convincing, some hashing schemes based on global representations are also included in our comparison. For all three data sets, we first use the K-means scheme to construct the codebooks with size of 500 and 1000, respectively, and then encode SIFT features into global representations via VLAD [49], which are proved to be more discriminative than original BoW representations. After that, two best performed hashing methods, i.e., KSH and BRE, are used to learn the hash codes on these global representations. Additionally, we also list the search performance via directly using the global feature GIST with KSH and BRE. In Fig. 5, it shows that our local hashing method UBLH with LHV achieves better results than the global representationbased hashing schemes with the ordinary hash table (r = 2)   for retrieval. Moreover, precision-recall curves of all the compared methods on these data sets with the code length of 96 bits are presented in Fig. 6 as well. From all these figures, we can further discover that, for all three data sets, UBLH achieves significantly better performance than other unsupervised methods and still slightly outperforms supervised ones by comparing the MAP and area under the curve. Fig. 8 illustrates the convergence of the proposed ULBH on Caltech-256 with the code length of 32. We can clearly observe , where t is the number of iteration, dramatically drop down. Additionally, in Fig. 9, we also compare the performance of UBLH with respect to the parameter K in the K-means and balance parameter λ via cross-validation on the training sets.
In addition, to illustrate the effectiveness of F2F and I2I terms in our method, we compare the algorithm only using the F2F term or the I2I term in Fig. 7. The results indicate that preserving the I2I similarity is more effective than preserving the similarity between features. Meanwhile, combining them together could always gain better performance. Finally, the training time and test time for different algorithms on three data sets are also illustrated in Table I. Considering the training time, supervised methods always need more time for the hash learning except for LDAH. In particular, BRE and MLH spend the most time to train hash functions. The random projection-based algorithms are relatively more efficient, especially the LSH. Our UBLH costs significantly less time than KSH, BRE, MLH, and CH but more than other methods for training. In terms of the test phase, LSH, LDAH, and PCAH are the most efficient methods due to the simple matrix multiplication or thresholding in binary code learning, while AGH has the comparable searching time with SpH. Our UBLH costs similar time as WTA. More details can be seen in Table I. VII. SCALABILITY FOR VERY LARGE DATA SETS For more realistic retrieval on very large-scale data sets (e.g., the Google database), the proposed approach would become time-consuming for training due to a huge number of local features generated and involved in our computation. To avoid the heavy burden of computation and make our method practical in such cases, we can use anchor point quantization (APQ) to reduce the computational complexity of our UBLH. Inspired by [50], we can first extract the anchor points from all local features via clustering techniques (e.g., K-means) as we have stated in our algorithm. Then, each local feature in the training set can be quantized to one anchor point. In this way, we replace all local features with their corresponding anchor points in the training phase. In particular, F2F preserving can be effectively transferred to anchor point to anchor point preserving. Similarly, we can also use anchor points for I2I preserving. Thus, APQ can be applied on very large-scale data collections to enable more efficient training than directly using a huge number of original local features.

VIII. CONCLUSION
In this paper, we have presented a novel unsupervised hashing framework, namely UBLH, to learn highly discriminative binary codes on local feature descriptors for large-scale image similarity search. The bilinear property of UBLH lets it explore the matrix representation of local features. Considering the pairwise and source information of local features, as a result, the F2F and I2I structures have been simultaneously preserved in UBLH. We have systematically evaluated our method on the Caltech-256, SUN397, and Flickr 1M data sets and shown promising results compared to state-of-the-art hashing methods.