Mean-Removed Product Quantization for Approximate Nearest Neighbor Search

Submitted by grigby1 on Fri, 05/22/2020 - 2:31pm

Title	Mean-Removed Product Quantization for Approximate Nearest Neighbor Search
Publication Type	Conference Paper
Year of Publication	2019
Authors	Yang, Jiacheng, Chen, Bin, Xia, Shu-Tao
Conference Name	2019 International Conference on Data Mining Workshops (ICDMW)
Keywords	approximate nearest neighbor search, Cartesian product, codewords, data point, high-dimensional space, high-dimensional vector space, information retrieval, low-dimensional subspaces, mean-removed product quantization, mean-removed vector quantization, Measurement, memory usage, Metrics, nearest neighbor search, nearest neighbour methods, neural nets, PQ, product quantization, pubcrawl, quantisation (signal), residual vectors, retrieval speed, scalar quantizer, search problems, subspaces, vector quantisation, Vectors
Abstract	Product quantization (PQ) and its variations are popular and attractive in approximate nearest neighbor search (ANN) due to their lower memory usage and faster retrieval speed. PQ decomposes the high-dimensional vector space into several low-dimensional subspaces, and quantizes each sub-vector in their subspaces, separately. Thus, PQ can generate a codebook containing an exponential number of codewords or indices by a Cartesian product of the sub-codebooks from different subspaces. However, when there is large variance in the average amplitude of the components of the data points, directly utilizing the PQ on the data points would result in poor performance. In this paper, we propose a new approach, namely, mean-removed product quantization (MRPQ) to address this issue. In fact, the average amplitude of a data point or the mean of a date point can be regarded as statistically independent of the variation of the vector, that is, of the way the components vary about this average. Then we can learn a separate scalar quantizer of the means of the data points and apply the PQ to their residual vectors. As shown in our comprehensive experiments on four large-scale public datasets, our approach can achieve substantial improvements in terms of Recall and MAP over some known methods. Moreover, our approach is general which can be combined with PQ and its variations.
DOI	10.1109/ICDMW.2019.00107
Citation Key	yang_mean-removed_2019

Groups:

Science of Security VO