Title | Fast Scalable Approximate Nearest Neighbor Search for High-dimensional Data |
Publication Type | Conference Paper |
Year of Publication | 2020 |
Authors | Bashyam, K. G. Renga, Vadhiyar, S. |
Conference Name | 2020 IEEE International Conference on Cluster Computing (CLUSTER) |
Keywords | application program interfaces, Approximation algorithms, approximation theory, Big Data, data mining, graph theory, graph-based sequential approximate k-NN search algorithm, hierarchical navigable small world, high-dimensional data, HNSW, k-d tree-based solution, k-nearest neighbor search, K-NN search, learning (artificial intelligence), load balancing, Load management, machine learning, machine learning algorithms, Measurement, message passing, MPI one-sided communication, MPI-OpenMP solution, Nearest neighbor methods, nearest neighbor search, parallel algorithms, Partitioning algorithms, Predictive Metrics, pubcrawl, query processing, search problems, similarity search, trees (mathematics), Vantage Point Tree, vantage point trees |
Abstract | K-Nearest Neighbor (k-NN) search is one of the most commonly used approaches for similarity search. It finds extensive applications in machine learning and data mining. This era of big data warrants efficiently scaling k-NN search algorithms for billion-scale datasets with high dimensionality. In this paper, we propose a solution towards this end where we use vantage point trees for partitioning the dataset across multiple processes and exploit an existing graph-based sequential approximate k-NN search algorithm called HNSW (Hierarchical Navigable Small World) for searching locally within a process. Our hybrid MPI-OpenMP solution employs techniques including exploiting MPI one-sided communication for reducing communication times and partition replication for better load balancing across processes. We demonstrate computation of k-NN for 10,000 queries in the order of seconds using our approach on 8000 cores on a dataset with billion points in an 128-dimensional space. We also show 10X speedup over a completely k-d tree-based solution for the same dataset, thus demonstrating better suitability of our solution for high dimensional datasets. Our solution shows almost linear strong scaling. |
DOI | 10.1109/CLUSTER49012.2020.00040 |
Citation Key | bashyam_fast_2020 |