Visible to the public Mean-Removed Product Quantization for Approximate Nearest Neighbor Search

TitleMean-Removed Product Quantization for Approximate Nearest Neighbor Search
Publication TypeConference Paper
Year of Publication2019
AuthorsYang, Jiacheng, Chen, Bin, Xia, Shu-Tao
Conference Name2019 International Conference on Data Mining Workshops (ICDMW)
Keywordsapproximate nearest neighbor search, Cartesian product, codewords, data point, high-dimensional space, high-dimensional vector space, information retrieval, low-dimensional subspaces, mean-removed product quantization, mean-removed vector quantization, Measurement, memory usage, Metrics, nearest neighbor search, nearest neighbour methods, neural nets, PQ, product quantization, pubcrawl, quantisation (signal), residual vectors, retrieval speed, scalar quantizer, search problems, subspaces, vector quantisation, Vectors
AbstractProduct quantization (PQ) and its variations are popular and attractive in approximate nearest neighbor search (ANN) due to their lower memory usage and faster retrieval speed. PQ decomposes the high-dimensional vector space into several low-dimensional subspaces, and quantizes each sub-vector in their subspaces, separately. Thus, PQ can generate a codebook containing an exponential number of codewords or indices by a Cartesian product of the sub-codebooks from different subspaces. However, when there is large variance in the average amplitude of the components of the data points, directly utilizing the PQ on the data points would result in poor performance. In this paper, we propose a new approach, namely, mean-removed product quantization (MRPQ) to address this issue. In fact, the average amplitude of a data point or the mean of a date point can be regarded as statistically independent of the variation of the vector, that is, of the way the components vary about this average. Then we can learn a separate scalar quantizer of the means of the data points and apply the PQ to their residual vectors. As shown in our comprehensive experiments on four large-scale public datasets, our approach can achieve substantial improvements in terms of Recall and MAP over some known methods. Moreover, our approach is general which can be combined with PQ and its variations.
DOI10.1109/ICDMW.2019.00107
Citation Keyyang_mean-removed_2019