Fast Scalable Approximate Nearest Neighbor Search for High-dimensional Data

Submitted by aekwall on Mon, 02/22/2021 - 1:45pm

Title	Fast Scalable Approximate Nearest Neighbor Search for High-dimensional Data
Publication Type	Conference Paper
Year of Publication	2020
Authors	Bashyam, K. G. Renga, Vadhiyar, S.
Conference Name	2020 IEEE International Conference on Cluster Computing (CLUSTER)
Keywords	application program interfaces, Approximation algorithms, approximation theory, Big Data, data mining, graph theory, graph-based sequential approximate k-NN search algorithm, hierarchical navigable small world, high-dimensional data, HNSW, k-d tree-based solution, k-nearest neighbor search, K-NN search, learning (artificial intelligence), load balancing, Load management, machine learning, machine learning algorithms, Measurement, message passing, MPI one-sided communication, MPI-OpenMP solution, Nearest neighbor methods, nearest neighbor search, parallel algorithms, Partitioning algorithms, Predictive Metrics, pubcrawl, query processing, search problems, similarity search, trees (mathematics), Vantage Point Tree, vantage point trees
Abstract	K-Nearest Neighbor (k-NN) search is one of the most commonly used approaches for similarity search. It finds extensive applications in machine learning and data mining. This era of big data warrants efficiently scaling k-NN search algorithms for billion-scale datasets with high dimensionality. In this paper, we propose a solution towards this end where we use vantage point trees for partitioning the dataset across multiple processes and exploit an existing graph-based sequential approximate k-NN search algorithm called HNSW (Hierarchical Navigable Small World) for searching locally within a process. Our hybrid MPI-OpenMP solution employs techniques including exploiting MPI one-sided communication for reducing communication times and partition replication for better load balancing across processes. We demonstrate computation of k-NN for 10,000 queries in the order of seconds using our approach on 8000 cores on a dataset with billion points in an 128-dimensional space. We also show 10X speedup over a completely k-d tree-based solution for the same dataset, thus demonstrating better suitability of our solution for high dimensional datasets. Our solution shows almost linear strong scaling.
DOI	10.1109/CLUSTER49012.2020.00040
Citation Key	bashyam_fast_2020

Groups:

Science of Security VO