Visible to the public Fast Scalable Approximate Nearest Neighbor Search for High-dimensional Data

TitleFast Scalable Approximate Nearest Neighbor Search for High-dimensional Data
Publication TypeConference Paper
Year of Publication2020
AuthorsBashyam, K. G. Renga, Vadhiyar, S.
Conference Name2020 IEEE International Conference on Cluster Computing (CLUSTER)
Keywordsapplication program interfaces, Approximation algorithms, approximation theory, Big Data, data mining, graph theory, graph-based sequential approximate k-NN search algorithm, hierarchical navigable small world, high-dimensional data, HNSW, k-d tree-based solution, k-nearest neighbor search, K-NN search, learning (artificial intelligence), load balancing, Load management, machine learning, machine learning algorithms, Measurement, message passing, MPI one-sided communication, MPI-OpenMP solution, Nearest neighbor methods, nearest neighbor search, parallel algorithms, Partitioning algorithms, Predictive Metrics, pubcrawl, query processing, search problems, similarity search, trees (mathematics), Vantage Point Tree, vantage point trees
AbstractK-Nearest Neighbor (k-NN) search is one of the most commonly used approaches for similarity search. It finds extensive applications in machine learning and data mining. This era of big data warrants efficiently scaling k-NN search algorithms for billion-scale datasets with high dimensionality. In this paper, we propose a solution towards this end where we use vantage point trees for partitioning the dataset across multiple processes and exploit an existing graph-based sequential approximate k-NN search algorithm called HNSW (Hierarchical Navigable Small World) for searching locally within a process. Our hybrid MPI-OpenMP solution employs techniques including exploiting MPI one-sided communication for reducing communication times and partition replication for better load balancing across processes. We demonstrate computation of k-NN for 10,000 queries in the order of seconds using our approach on 8000 cores on a dataset with billion points in an 128-dimensional space. We also show 10X speedup over a completely k-d tree-based solution for the same dataset, thus demonstrating better suitability of our solution for high dimensional datasets. Our solution shows almost linear strong scaling.
DOI10.1109/CLUSTER49012.2020.00040
Citation Keybashyam_fast_2020