Visible to the public Biblio

Filters: Keyword is approximate nearest neighbor search  [Clear All Filters]
2020-05-22
Yang, Jiacheng, Chen, Bin, Xia, Shu-Tao.  2019.  Mean-Removed Product Quantization for Approximate Nearest Neighbor Search. 2019 International Conference on Data Mining Workshops (ICDMW). :711—718.
Product quantization (PQ) and its variations are popular and attractive in approximate nearest neighbor search (ANN) due to their lower memory usage and faster retrieval speed. PQ decomposes the high-dimensional vector space into several low-dimensional subspaces, and quantizes each sub-vector in their subspaces, separately. Thus, PQ can generate a codebook containing an exponential number of codewords or indices by a Cartesian product of the sub-codebooks from different subspaces. However, when there is large variance in the average amplitude of the components of the data points, directly utilizing the PQ on the data points would result in poor performance. In this paper, we propose a new approach, namely, mean-removed product quantization (MRPQ) to address this issue. In fact, the average amplitude of a data point or the mean of a date point can be regarded as statistically independent of the variation of the vector, that is, of the way the components vary about this average. Then we can learn a separate scalar quantizer of the means of the data points and apply the PQ to their residual vectors. As shown in our comprehensive experiments on four large-scale public datasets, our approach can achieve substantial improvements in terms of Recall and MAP over some known methods. Moreover, our approach is general which can be combined with PQ and its variations.
Rattaphun, Munlika, Prayoonwong, Amorntip, Chiu, Chih- Yi.  2019.  Indexing in k-Nearest Neighbor Graph by Hash-Based Hill-Climbing. 2019 16th International Conference on Machine Vision Applications (MVA). :1—4.
A main issue in approximate nearest neighbor search is to achieve an excellent tradeoff between search accuracy and computation cost. In this paper, we address this issue by leveraging k-nearest neighbor graph and hill-climbing to accelerate vector quantization in the query assignment process. A modified hill-climbing algorithm is proposed to traverse k-nearest neighbor graph to find closest centroids for a query, rather than calculating the query distances to all centroids. Instead of using random seeds in the original hill-climbing algorithm, we generate high-quality seeds based on the hashing technique. It can boost the query assignment efficiency due to a better start-up in hill-climbing. We evaluate the experiment on the benchmarks of SIFT1M and GIST1M datasets, and show the proposed hashing-based seed generation effectively improves the search performance.
2018-06-11
Yang, C., Li, Z., Qu, W., Liu, Z., Qi, H..  2017.  Grid-Based Indexing and Search Algorithms for Large-Scale and High-Dimensional Data. 2017 14th International Symposium on Pervasive Systems, Algorithms and Networks 2017 11th International Conference on Frontier of Computer Science and Technology 2017 Third International Symposium of Creative Computing (ISPAN-FCST-ISCC). :46–51.

The rapid development of Internet has resulted in massive information overloading recently. These information is usually represented by high-dimensional feature vectors in many related applications such as recognition, classification and retrieval. These applications usually need efficient indexing and search methods for such large-scale and high-dimensional database, which typically is a challenging task. Some efforts have been made and solved this problem to some extent. However, most of them are implemented in a single machine, which is not suitable to handle large-scale database.In this paper, we present a novel data index structure and nearest neighbor search algorithm implemented on Apache Spark. We impose a grid on the database and index data by non-empty grid cells. This grid-based index structure is simple and easy to be implemented in parallel. Moreover, we propose to build a scalable KNN graph on the grids, which increase the efficiency of this index structure by a low cost in parallel implementation. Finally, experiments are conducted in both public databases and synthetic databases, showing that the proposed methods achieve overall high performance in both efficiency and accuracy.

Liu, Yingfan, Cheng, Hong, Cui, Jiangtao.  2017.  PQBF: I/O-Efficient Approximate Nearest Neighbor Search by Product Quantization. Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. :667–676.
Approximate nearest neighbor (ANN) search in high-dimensional space plays an essential role in many multimedia applications. Recently, product quantization (PQ) based methods for ANN search have attracted enormous attention in the community of computer vision, due to its good balance between accuracy and space requirement. PQ based methods embed a high-dimensional vector into a short binary code (called PQ code), and the squared Euclidean distance is estimated by asymmetric quantizer distance (AQD) with pretty high precision. Thus, ANN search in the original space can be converted to similarity search on AQD using the PQ approach. All existing PQ methods are in-memory solutions, which may not handle massive data if they cannot fit entirely in memory. In this paper, we propose an I/O-efficient PQ based solution for ANN search. We design an index called PQB+-forest to support efficient similarity search on AQD. PQB+-forest first creates a number of partitions of the PQ codes by a coarse quantizer and then builds a B+-tree, called PQB+-tree, for each partition. The search process is greatly expedited by focusing on a few selected partitions that are closest to the query, as well as by the pruning power of PQB+-trees. According to the experiments conducted on two large-scale data sets containing up to 1 billion vectors, our method outperforms its competitors, including the state-of-the-art PQ method and the state-of-the-art LSH methods for ANN search.
Zhang, Peng-Fei, Li, Chuan-Xiang, Liu, Meng-Yuan, Nie, Liqiang, Xu, Xin-Shun.  2017.  Semi-Relaxation Supervised Hashing for Cross-Modal Retrieval. Proceedings of the 2017 ACM on Multimedia Conference. :1762–1770.
Recently, some cross-modal hashing methods have been devised for cross-modal search task. Essentially, given a similarity matrix, most of these methods tackle a discrete optimization problem by separating it into two stages, i.e., first relaxing the binary constraints and finding a solution of the relaxed optimization problem, then quantizing the solution to obtain the binary codes. This scheme will generate large quantization error. Some discrete optimization methods have been proposed to tackle this; however, the generation of the binary codes is independent of the features in the original space, which makes it not robust to noise. To consider these problems, in this paper, we propose a novel supervised cross-modal hashing method—Semi-Relaxation Supervised Hashing (SRSH). It can learn the hash functions and the binary codes simultaneously. At the same time, to tackle the optimization problem, it relaxes a part of binary constraints, instead of all of them, by introducing an intermediate representation variable. By doing this, the quantization error can be reduced and the optimization problem can also be easily solved by an iterative algorithm proposed in this paper. Extensive experimental results on three benchmark datasets demonstrate that SRSH can obtain competitive results and outperform state-of-the-art unsupervised and supervised cross-modal hashing methods.
2017-08-02
Wang, Min, Zhou, Wengang, Tian, Qi, Zha, Zhengjun, Li, Houqiang.  2016.  Linear Distance Preserving Pseudo-Supervised and Unsupervised Hashing. Proceedings of the 2016 ACM on Multimedia Conference. :1257–1266.

With the advantage in compact representation and efficient comparison, binary hashing has been extensively investigated for approximate nearest neighbor search. In this paper, we propose a novel and general hashing framework, which simultaneously considers a new linear pair-wise distance preserving objective and point-wise constraint. The direct distance preserving objective aims to keep the linear relationships between the Euclidean distance and the Hamming distance of data points. Based on different point-wise constraints, we propose two methods to instantiate this framework. The first one is a pseudo-supervised hashing method, which uses existing unsupervised hashing methods to generate binary codes as pseudo-supervised information. The second one is an unsupervised hashing method, in which quantization loss is considered. We validate our framework on two large-scale datasets. The experiments demonstrate that our pseudo-supervised method achieves consistent improvement for the state-of-the-art unsupervised hashing methods, while our unsupervised method outperforms the state-of-the-art methods.

2017-05-16
Yan, Ting-Kun, Xu, Xin-Shun, Guo, Shanqing, Huang, Zi, Wang, Xiao-Lin.  2016.  Supervised Robust Discrete Multimodal Hashing for Cross-Media Retrieval. Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. :1271–1280.

Recently, multimodal hashing techniques have received considerable attention due to their low storage cost and fast query speed for multimodal data retrieval. Many methods have been proposed; however, there are still some problems that need to be further considered. For example, some of these methods just use a similarity matrix for learning hash functions which will discard some useful information contained in original data; some of them relax binary constraints or separate the process of learning hash functions and binary codes into two independent stages to bypass the obstacle of handling the discrete constraints on binary codes for optimization, which may generate large quantization error; some of them are not robust to noise. All these problems may degrade the performance of a model. To consider these problems, in this paper, we propose a novel supervised hashing framework for cross-modal retrieval, i.e., Supervised Robust Discrete Multimodal Hashing (SRDMH). Specifically, SRDMH tries to make final binary codes preserve label information as same as that in original data so that it can leverage more label information to supervise the binary codes learning. In addition, it learns hashing functions and binary codes directly instead of relaxing the binary constraints so as to avoid large quantization error problem. Moreover, to make it robust and easy to solve, we further integrate a flexible l2,p loss with nonlinear kernel embedding and an intermediate presentation of each instance. Finally, an alternating algorithm is proposed to solve the optimization problem in SRDMH. Extensive experiments are conducted on three benchmark data sets. The results demonstrate that the proposed method (SRDMH) outperforms or is comparable to several state-of-the-art methods for cross-modal retrieval task.