Biblio

List
Filter

Found 64 results

Filters: Keyword is nearest neighbor search [Clear All Filters]

2022-03-08

Wu, Chao, Ren, Lihong, Hao, Kuangrong. 2021. Modeling of Aggregation Process Based on Feature Selection Extreme Learning Machine of Atomic Search Algorithm. 2021 IEEE 10th Data Driven Control and Learning Systems Conference (DDCLS). :1453—1458.

Polymerization process is a process in the production of polyester fiber, and its reaction parameter intrinsic viscosity has an important influence on the properties of the final polyester fiber. In this paper, a feature selection extreme learning machine model based on binary encoding Atom Search Optimization algorithm is proposed and applied to the polymerization process of polyester fiber production. Firstly, the distance measure of K-NearestNeighbor algorithm, combined with binary coding, and Atom Search Optimization algorithm are used to select features of industrial data to obtain the optimal data set. According to the data set, atom search optimization algorithm is used to optimize the weight and threshold of extreme learning machine and the activation function of the improved extreme learning machine. A prediction model with root mean square error as fitness function was established and applied to polyester production process. The simulation results show that the model has good prediction accuracy, which can be used for reference in the follow-up industrial production.

Yang, Cuicui, Liu, Pinjie. 2021. Big Data Nearest Neighbor Similar Data Retrieval Algorithm based on Improved Random Forest. 2021 International Conference on Big Data Analysis and Computer Science (BDACS). :175—178.

In the process of big data nearest neighbor similar data retrieval, affected by the way of data feature extraction, the retrieval accuracy is low. Therefore, this paper proposes the design of big data nearest neighbor similar data retrieval algorithm based on improved random forest. Through the improvement of random forest model and the construction of random decision tree, the characteristics of current nearest neighbor big data are clarified. Based on the improved random forest, the hash code is generated. Finally, combined with the Hamming distance calculation method, the nearest neighbor similar data retrieval of big data is realized. The experimental results show that: in the multi label environment, the retrieval accuracy is improved by 9% and 10%. In the single label environment, the similar data retrieval accuracy of the algorithm is improved by 12% and 28% respectively.

Razeghi, Behrooz, Ferdowsi, Sohrab, Kostadinov, Dimche, Calmon, Flavio P., Voloshynovskiy, Slava. 2021. Privacy-Preserving near Neighbor Search via Sparse Coding with Ambiguation. ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). :2635—2639.

In this paper, we propose a framework for privacy-preserving approximate near neighbor search via stochastic sparsifying encoding. The core of the framework relies on sparse coding with ambiguation (SCA) mechanism that introduces the notion of inherent shared secrecy based on the support intersection of sparse codes. This approach is ‘fairness-aware’, in the sense that any point in the neighborhood has an equiprobable chance to be chosen. Our approach can be applied to raw data, latent representation of autoencoders, and aggregated local descriptors. The proposed method is tested on both synthetic i.i.d data and real image databases.

Markchit, Sarawut. 2021. K-mean Index Learning for Multimedia Datasets. 2021 13th International Conference on Knowledge and Smart Technology (KST). :6—11.

Currently, one method to deal with the storage and computation of multimedia retrieval applications is an approximate nearest neighbor (ANN) search. Hashing algorithms and Vector quantization (VQ) are widely used in ANN search. So, K-mean clustering is a method of VQ that can solve those problems. With the increasing growth of multimedia data such as text view, image view, video view, audio view, and 3D view. Thus, it is a reason that why multimedia retrieval is very important. We can retrieve the results of each media type by inputting a query of that type. Even though many hashing algorithms and VQ techniques are proposed to produce a compact or short binary codes. In the real-time purposes the exhaustive search is impractical, and Hamming distance computation in the Hamming space suffers inaccurate results. The challenge of this paper is focusing on how to learn multimedia raw data or features representation to search on each media type for multimedia retrieval. So we propose a new search method that utilizes K-mean hash codes by computing the probability of a cluster in the index code. The proposed employs the index code from the K-mean cluster number that is converted to hash code. The inverted index table is constructed basing on the K-mean hash code. Then we can improve the original K-mean index accuracy and efficiency by learning a deep neural network (DNN). We performed the experiments on four benchmark multimedia datasets to retrieve each view such as 3D, image, video, text, and audio, where hash codes are produced by K-mean clustering methods. Our results show the effectiveness boost the performance on the baseline (exhaustive search).

Ma, Xiaoyu, Yang, Tao, Chen, Jiangchuan, Liu, Ziyu. 2021. k-Nearest Neighbor algorithm based on feature subspace. 2021 International Conference on Big Data Analysis and Computer Science (BDACS). :225—228.

The traditional KNN algorithm takes insufficient consideration of the spatial distribution of training samples, which leads to low accuracy in processing high-dimensional data sets. Moreover, the generation of k nearest neighbors requires all known samples to participate in the distance calculation, resulting in high time overhead. To solve these problems, a feature subspace based KNN algorithm (Feature Subspace KNN, FSS-KNN) is proposed in this paper. First, the FSS-KNN algorithm solves all the feature subspaces according to the distribution of the training samples in the feature space, so as to ensure that the samples in the same subspace have higher similarity. Second, the corresponding feature subspace is matched for the test set samples. On this basis, the search of k nearest neighbors is carried out in the corresponding subspace first, thus improving the accuracy and efficiency of the algorithm. Experimental results show that compared with the traditional KNN algorithm, FSS-KNN algorithm improves the accuracy and efficiency on Kaggle data set and UCI data set. Compared with the other four classical machine learning algorithms, FSS-KNN algorithm can significantly improve the accuracy.

Kazemi, Arman, Sharifi, Mohammad Mehdi, Laguna, Ann Franchesca, Müller, Franz, Rajaei, Ramin, Olivo, Ricardo, Kämpfe, Thomas, Niemier, Michael, Hu, X. Sharon. 2021. In-Memory Nearest Neighbor Search with FeFET Multi-Bit Content-Addressable Memories. 2021 Design, Automation Test in Europe Conference Exhibition (DATE). :1084—1089.

Nearest neighbor (NN) search is an essential operation in many applications, such as one/few-shot learning and image classification. As such, fast and low-energy hardware support for accurate NN search is highly desirable. Ternary content-addressable memories (TCAMs) have been proposed to accelerate NN search for few-shot learning tasks by implementing \$L\$∞ and Hamming distance metrics, but they cannot achieve software-comparable accuracies. This paper proposes a novel distance function that can be natively evaluated with multi-bit content-addressable memories (MCAMs) based on ferroelectric FETs (Fe-FETs) to perform a single-step, in-memory NN search. Moreover, this approach achieves accuracies comparable to floating-point precision implementations in software for NN classification and one/few-shot learning tasks. As an example, the proposed method achieves a 98.34% accuracy for a 5-way, 5-shot classification task for the Omniglot dataset (only 0.8% lower than software-based implementations) with a 3-bit MCAM. This represents a 13% accuracy improvement over state-of-the-art TCAM-based implementations at iso-energy and iso-delay. The presented distance function is resilient to the effects of FeFET device-to-device variations. Furthermore, this work experimentally demonstrates a 2-bit implementation of FeFET MCAM using AND arrays from GLOBALFOUNDRIES to further validate proof of concept.

Ramadhan, Hani, Kwon, Joonho. 2021. Enhancing Learned Index for A Higher Recall Trajectory K-Nearest Neighbor Search. 2021 IEEE International Conference on Big Data (Big Data). :6006—6007.

Learned indices can significantly shorten the query response time of k-Nearest Neighbor search of points data. However, extending the learned index for k-Nearest Neighbor search of trajectory data may return incorrect results (low recall) and require longer pruning time. Thus, we introduce an enhancement for trajectory learned index which is a pruning step for a learned index to retrieve the k-Nearest Neighbors correctly by learning the query workload. The pruning utilizes a predicted range query that covers the correct neighbors. We show that that our approach has the potential to work effectively in a large real-world trajectory dataset.

Myasnikov, Evgeny. 2021. Nearest Neighbor Search In Hyperspectral Data Using Binary Space Partitioning Trees. 2021 11th Workshop on Hyperspectral Imaging and Signal Processing: Evolution in Remote Sensing (WHISPERS). :1—4.

Fast search of hyperspectral data is crucial in many practical applications ranging from classification to finding duplicate fragments in images. In this paper, we evaluate two space partitioning data structures in the task of searching hyperspectral data. In particular, we consider vp-trees and ball-trees, study several tree construction algorithms, and compare these structures with the brute force approach. In addition, we evaluate vp-trees and ball-trees with four similarity measures, namely, Euclidean Distance, Spectral Angle Mapper Bhattacharyya Angle, and Hellinger distance.

Jia, Yunsong. 2021. Design of nearest neighbor search for dynamic interaction points. 2021 2nd International Conference on Big Data and Informatization Education (ICBDIE). :389—393.

This article describes the definition, theoretical derivation, design ideas, and specific implementation of the nearest query algorithm for the acceleration of probabilistic optimization at first, and secondly gives an optimization conclusion that is generally applicable to high-dimensional Minkowski spaces with even-numbered feature parameters. Thirdly the operating efficiency and space sensitivity of this algorithm and the commonly used algorithms are compared from both theoretical and experimental aspects. Finally, the optimization direction is analyzed based on the results.

Kim, Ji-Hoon, Park, Yeo-Reum, Do, Jaeyoung, Ji, Soo-Young, Kim, Joo-Young. 2021. Accelerating Large-Scale Nearest Neighbor Search with Computational Storage Device. 2021 IEEE 29th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). :254—254.

K-nearest neighbor algorithm that searches the K closest samples in a high dimensional feature space is one of the most fundamental tasks in machine learning and image retrieval applications. Computational storage device that combines computing unit and storage module on a single board becomes popular to address the data bandwidth bottleneck of the conventional computing system. In this paper, we propose a nearest neighbor search acceleration platform based on computational storage device, which can process a large-scale image dataset efficiently in terms of speed, energy, and cost. We believe that the proposed acceleration platform is promising to be deployed in cloud datacenters for data-intensive applications.

2021-02-22

Li, M., Zhang, Y., Sun, Y., Wang, W., Tsang, I. W., Lin, X.. 2020. I/O Efficient Approximate Nearest Neighbour Search based on Learned Functions. 2020 IEEE 36th International Conference on Data Engineering (ICDE). :289–300.

Approximate nearest neighbour search (ANNS) in high dimensional space is a fundamental problem in many applications, such as multimedia database, computer vision and information retrieval. Among many solutions, data-sensitive hashing-based methods are effective to this problem, yet few of them are designed for external storage scenarios and hence do not optimized for I/O efficiency during the query processing. In this paper, we introduce a novel data-sensitive indexing and query processing framework for ANNS with an emphasis on optimizing the I/O efficiency, especially, the sequential I/Os. The proposed index consists of several lists of point IDs, ordered by values that are obtained by learned hashing (i.e., mapping) functions on each corresponding data point. The functions are learned from the data and approximately preserve the order in the high-dimensional space. We consider two instantiations of the functions (linear and non-linear), both learned from the data with novel objective functions. We also develop an I/O efficient ANNS framework based on the index. Comprehensive experiments on six benchmark datasets show that our proposed methods with learned index structure perform much better than the state-of-the-art external memory-based ANNS methods in terms of I/O efficiency and accuracy.

Chen, T., Lin, T., Hong, Y.- P.. 2020. Gait Phase Segmentation Using Weighted Dynamic Time Warping and K-Nearest Neighbors Graph Embedding. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). :1180–1184.

Gait phase segmentation is the process of identifying the start and end of different phases within a gait cycle. It is essential to many medical applications, such as disease diagnosis or rehabilitation. This work utilizes inertial measurement units (IMUs) mounted on the individual's foot to gather gait information and develops a gait phase segmentation method based on the collected signals. The proposed method utilizes a weighted dynamic time warping (DTW) algorithm to measure the distance between two different gait signals, and a k-nearest neighbors (kNN) algorithm to obtain the gait phase estimates. To reduce the complexity of the DTW-based kNN search, we propose a neural network-based graph embedding scheme that is able to map the IMU signals associated with each gait cycle into a distance-preserving low-dimensional representation while also producing a prediction on the k nearest neighbors of the test signal. Experiments are conducted on self-collected IMU gait signals to demonstrate the effectiveness of the proposed scheme.

Bashyam, K. G. Renga, Vadhiyar, S.. 2020. Fast Scalable Approximate Nearest Neighbor Search for High-dimensional Data. 2020 IEEE International Conference on Cluster Computing (CLUSTER). :294–302.

K-Nearest Neighbor (k-NN) search is one of the most commonly used approaches for similarity search. It finds extensive applications in machine learning and data mining. This era of big data warrants efficiently scaling k-NN search algorithms for billion-scale datasets with high dimensionality. In this paper, we propose a solution towards this end where we use vantage point trees for partitioning the dataset across multiple processes and exploit an existing graph-based sequential approximate k-NN search algorithm called HNSW (Hierarchical Navigable Small World) for searching locally within a process. Our hybrid MPI-OpenMP solution employs techniques including exploiting MPI one-sided communication for reducing communication times and partition replication for better load balancing across processes. We demonstrate computation of k-NN for 10,000 queries in the order of seconds using our approach on 8000 cores on a dataset with billion points in an 128-dimensional space. We also show 10X speedup over a completely k-d tree-based solution for the same dataset, thus demonstrating better suitability of our solution for high dimensional datasets. Our solution shows almost linear strong scaling.

Fang, S., Kennedy, S., Wang, C., Wang, B., Pei, Q., Liu, X.. 2020. Sparser: Secure Nearest Neighbor Search with Space-filling Curves. IEEE INFOCOM 2020 - IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS). :370–375.

Nearest neighbor search, a classic way of identifying similar data, can be applied to various areas, including database, machine learning, natural language processing, software engineering, etc. Secure nearest neighbor search aims to find nearest neighbors to a given query point over encrypted data without accessing data in plaintext. It provides privacy protection to datasets when nearest neighbor queries need to be operated by an untrusted party (e.g., a public server). While different solutions have been proposed to support nearest neighbor queries on encrypted data, these existing solutions still encounter critical drawbacks either in efficiency or privacy. In light of the limitations in the current literature, we propose a novel approximate nearest neighbor search solution, referred to as Sparser, by leveraging a combination of space-filling curves, perturbation, and Order-Preserving Encryption. The advantages of Sparser are twofold, strengthening privacy and improving efficiency. Specifically, Sparser pre-processes plaintext data with space-filling curves and perturbation, such that data is sparse, which mitigates leakage abuse attacks and renders stronger privacy. In addition to privacy enhancement, Sparser can efficiently find approximate nearest neighbors over encrypted data with logarithmic time. Through extensive experiments over real-world datasets, we demonstrate that Sparser can achieve strong privacy protection under leakage abuse attacks and minimize search time.

Han, Z., Wang, F., Li, Z.. 2020. Research on Nearest Neighbor Data Association Algorithm Based on Target “Dynamic” Monitoring Model. 2020 IEEE 4th Information Technology, Networking, Electronic and Automation Control Conference (ITNEC). 1:665–668.

In order to solve the problem that the Nearest Neighbor Data Association (NNDA) algorithm cannot detect the “dynamic” change of the target, this paper proposes the nearest neighbor data association algorithm based on the Targets “Dynamic” Monitoring Model (TDMM). Firstly, the gate searching and updating of targets are completed based on TDMM, then the NNDA algorithm is utilized to achieve the data association of targets to realize track updating. Finally, the NNDA algorithm based on TDMM is realized by simulation. The experimental results show that the algorithm proposed can achieve “dynamic” monitoring in multi-target data association, and have more obvious advantages than Multiple Hypothesis Tracking (MHT) in timeliness and association performance.

Haile, J., Havens, S.. 2020. Identifying Ubiquitious Third-Party Libraries in Compiled Executables Using Annotated and Translated Disassembled Code with Supervised Machine Learning. 2020 IEEE Security and Privacy Workshops (SPW). :157–162.

The size and complexity of the software ecosystem is a major challenge for vendors, asset owners and cybersecurity professionals who need to understand the security posture of these systems. Annotated and Translated Disassembled Code is a graph based datastore designed to organize firmware and software analysis data across builds, packages and systems, providing a highly scalable platform enabling automated binary software analysis tasks including corpora construction and storage for machine learning. This paper describes an approach for the identification of ubiquitous third-party libraries in firmware and software using Annotated and Translated Disassembled Code and supervised machine learning. Annotated and Translated Disassembled Code provide matched libraries, function names and addresses of previously unidentified code in software as it is being automatically analyzed. This data can be ingested by other software analysis tools to improve accuracy and save time. Defenders can add the identified libraries to their vulnerability searches and add effective detection and mitigation into their operating environment.

Lei, X., Tu, G.-H., Liu, A. X., Xie, T.. 2020. Fast and Secure kNN Query Processing in Cloud Computing. 2020 IEEE Conference on Communications and Network Security (CNS). :1–9.

Advances in sensing and tracking technology lead to the proliferation of location-based services. Location service providers (LSPs) often resort to commercial public clouds to store the tremendous geospatial data and process location-based queries from data users. To protect the privacy of LSP's geospatial data and data user's query location against the untrusted cloud, they are required to be encrypted before sending to the cloud. Nevertheless, it is not easy to design a fast and secure location-based query processing scheme over the encrypted data. In this paper, we propose a Fast and Secure kNN (FSkNN) scheme to support secure k nearest neighbor (k NN) search in cloud computing. We reveal the inherent connection between an Sk NN protocol and a secure range query protocol and further describe how to construct FSkNN based on a secure range query protocol. FSkNN leverages a customized accuracy-assured strategy to ensure the result accuracy and adopts a data structure named random Bloom filter (RBF) to build a secure index for efficiently searching. We formally prove the security of FSkNN under the random oracle model. Our evaluation results show that FSkNN is highly practical.

Kornaropoulos, E. M., Papamanthou, C., Tamassia, R.. 2020. The State of the Uniform: Attacks on Encrypted Databases Beyond the Uniform Query Distribution. 2020 IEEE Symposium on Security and Privacy (SP). :1223–1240.

Recent foundational work on leakage-abuse attacks on encrypted databases has broadened our understanding of what an adversary can accomplish with a standard leakage profile. Nevertheless, all known value reconstruction attacks succeed under strong assumptions that may not hold in the real world. The most prevalent assumption is that queries are issued uniformly at random by the client. We present the first value reconstruction attacks that succeed without any knowledge about the query or data distribution. Our approach uses the search-pattern leakage, which exists in all known structured encryption schemes but has not been fully exploited so far. At the core of our method lies a support size estimator, a technique that utilizes the repetition of search tokens with the same response to estimate distances between encrypted values without any assumptions about the underlying distribution. We develop distribution-agnostic reconstruction attacks for both range queries and k-nearest-neighbor (k-NN) queries based on information extracted from the search-pattern leakage. Our new range attack follows a different algorithmic approach than state-of-the-art attacks, which are fine-tuned to succeed under the uniformly distributed queries. Instead, we reconstruct plaintext values under a variety of skewed query distributions and even outperform the accuracy of previous approaches under the uniform query distribution. Our new k-NN attack succeeds with far fewer samples than previous attacks and scales to much larger values of k. We demonstrate the effectiveness of our attacks by experimentally testing them on a wide range of query distributions and database densities, both unknown to the adversary.

Oliver, J., Ali, M., Hagen, J.. 2020. HAC-T and Fast Search for Similarity in Security. 2020 International Conference on Omni-layer Intelligent Systems (COINS). :1–7.

Similarity digests have gained popularity for many security applications like blacklisting/whitelisting, and finding similar variants of malware. TLSH has been shown to be particularly good at hunting similar malware, and is resistant to evasion as compared to other similarity digests like ssdeep and sdhash. Searching and clustering are fundamental tools which help the security analysts and security operations center (SOC) operators in hunting and analyzing malware. Current approaches which aim to cluster malware are not scalable enough to keep up with the vast amount of malware and goodware available in the wild. In this paper, we present techniques which allow for fast search and clustering of TLSH hash digests which can aid analysts to inspect large amounts of malware/goodware. Our approach builds on fast nearest neighbor search techniques to build a tree-based index which performs fast search based on TLSH hash digests. The tree-based index is used in our threshold based Hierarchical Agglomerative Clustering (HAC-T) algorithm which is able to cluster digests in a scalable manner. Our clustering technique can cluster digests in O (n logn) time on average. We performed an empirical evaluation by comparing our approach with many standard and recent clustering techniques. We demonstrate that our approach is much more scalable and still is able to produce good cluster quality. We measured cluster quality using purity on 10 million samples obtained from VirusTotal. We obtained a high purity score in the range from 0.97 to 0.98 using labels from five major anti-virus vendors (Kaspersky, Microsoft, Symantec, Sophos, and McAfee) which demonstrates the effectiveness of the proposed method.

2021-01-25

Arthy, R., Daniel, E., Maran, T. G., Praveen, M.. 2020. A Hybrid Secure Keyword Search Scheme in Encrypted Graph for Social Media Database. 2020 Fourth International Conference on Computing Methodologies and Communication (ICCMC). :1000–1004.

Privacy preservation is a challenging task with the huge amount of data that are available in social media. The data those are stored in the distributed environment or in cloud environment need to ensure confidentiality to data. In addition, representing the voluminous data is graph will be convenient to perform keyword search. The proposed work initially reads the data corresponding to social media and converts that into a graph. In order to prevent the data from the active attacks Advanced Encryption Standard algorithm is used to perform graph encryption. Later, search operation is done using two algorithms: kNK keyword search algorithm and top k nearest keyword search algorithm. The first scheme is used to fetch all the data corresponding to the keyword. The second scheme is used to fetch the nearest neighbor. This scheme increases the efficiency of the search process. Here shortest path algorithm is used to find the minimum distance. Now, based on the minimum value the results are produced. The proposed algorithm shows high performance for graph generation and searching and moderate performance for graph encryption.

2020-06-12

[Anonymous]. 2018. Discrete Locally-Linear Preserving Hashing. {2018 25th IEEE International Conference on Image Processing (ICIP). :490—494.

Recently, hashing has attracted considerable attention for nearest neighbor search due to its fast query speed and low storage cost. However, existing unsupervised hashing algorithms have two problems in common. Firstly, the widely utilized anchor graph construction algorithm has inherent limitations in local weight estimation. Secondly, the locally linear structure in the original feature space is seldom taken into account for binary encoding. Therefore, in this paper, we propose a novel unsupervised hashing method, dubbed “discrete locally-linear preserving hashing”, which effectively calculates the adjacent matrix while preserving the locally linear structure in the obtained hash space. Specifically, a novel local anchor embedding algorithm is adopted to construct the approximate adjacent matrix. After that, we directly minimize the reconstruction error with the discrete constrain to learn the binary codes. Experimental results on two typical image datasets indicate that the proposed method significantly outperforms the state-of-the-art unsupervised methods.

Al Kobaisi, Ali, Wocjan, Pawel. 2018. Supervised Max Hashing for Similarity Image Retrieval. 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA). :359—365.

The storage efficiency of hash codes and their application in the fast approximate nearest neighbor search, along with the explosion in the size of available labeled image datasets caused an intensive interest in developing learning based hash algorithms recently. In this paper, we present a learning based hash algorithm that utilize ordinal information of feature vectors. We have proposed a novel mathematically differentiable approximation of argmax function for this hash algorithm. It has enabled seamless integration of hash function with deep neural network architecture which can exploit the rich feature vectors generated by convolutional neural networks. We have also proposed a loss function for the case that the hash code is not binary and its entries are digits of arbitrary k-ary base. The resultant model comprised of feature vector generation and hashing layer is amenable to end-to-end training using gradient descent methods. In contrast to the majority of current hashing algorithms that are either not learning based or use hand-crafted feature vectors as input, simultaneous training of the components of our system results in better optimization. Extensive evaluations on NUS-WIDE, CIFAR-10 and MIRFlickr benchmarks show that the proposed algorithm outperforms state-of-art and classical data agnostic, unsupervised and supervised hashing methods by 2.6% to 19.8% mean average precision under various settings.

2020-06-08

Fang, Bo, Hua, Zhongyun, Huang, Hejiao. 2019. Locality-Sensitive Hashing Scheme Based on Heap Sort of Hash Bucket. 2019 14th International Conference on Computer Science Education (ICCSE). :5–10.

Nearest neighbor search (NNS) is one of the current popular research directions, which widely used in machine learning, pattern recognition, image detection and so on. In the low dimension data, based on tree search method can get good results. But when the data dimension goes up, that will produce a curse of dimensional. The proposed Locality-Sensitive Hashing algorithm (LSH) greatly improves the efficiency of nearest neighbor query for high dimensional data. But the algorithm relies on the building a large number of hash table, which makes the space complexity very high. C2LSH based on dynamic collision improves the disadvantage of LSH, but its disadvantage is that it needs to detect the collision times of a large number of data points which Increased query time. Therefore, Based on LSH algorithm, later researchers put forward many improved algorithms, but still not ideal.In this paper, we put forward Locality-Sensitive Hashing Scheme Based on Heap Sort of Hash Bucket (HSLSH) algorithm aiming at the shortcomings of LSH and C2LSH. Its main idea is to take advantage of the efficiency of heapsort in massive data sorting to improve the efficiency of nearest neighbor query. It only needs to rely on a small number of hash functions can not only overcome the shortcoming of LSH need to build a large number of hash table, and avoids defects of C2LSH. Experiments show that our algorithm is more than 20% better than C2LSH in query accuracy and 40% percent lower in query time.

2020-05-22

Sheth, Utsav, Dutta, Sanghamitra, Chaudhari, Malhar, Jeong, Haewon, Yang, Yaoqing, Kohonen, Jukka, Roos, Teemu, Grover, Pulkit. 2018. An Application of Storage-Optimal MatDot Codes for Coded Matrix Multiplication: Fast k-Nearest Neighbors Estimation. 2018 IEEE International Conference on Big Data (Big Data). :1113—1120.

We propose a novel application of coded computing to the problem of the nearest neighbor estimation using MatDot Codes (Fahim et al., Allerton'17) that are known to be optimal for matrix multiplication in terms of recovery threshold under storage constraints. In approximate nearest neighbor algorithms, it is common to construct efficient in-memory indexes to improve query response time. One such strategy is Multiple Random Projection Trees (MRPT), which reduces the set of candidate points over which Euclidean distance calculations are performed. However, this may result in a high memory footprint and possibly paging penalties for large or high-dimensional data. Here we propose two techniques to parallelize MRPT that exploit data and model parallelism respectively by dividing both the data storage and the computation efforts among different nodes in a distributed computing cluster. This is especially critical when a single compute node cannot hold the complete dataset in memory. We also propose a novel coded computation strategy based on MatDot codes for the model-parallel architecture that, in a straggler-prone environment, achieves the storage-optimal recovery threshold, i.e., the number of nodes that are required to serve a query. We experimentally demonstrate that, in the absence of straggling, our distributed approaches require less query time than execution on a single processing node, providing near-linear speedups with respect to the number of worker nodes. Our experiments on real systems with simulated straggling, we also show that in a straggler-prone environment, our strategy achieves a faster query execution than the uncoded strategy.

Despotovski, Filip, Gusev, Marjan, Zdraveski, Vladimir. 2018. Parallel Implementation of K-Nearest-Neighbors for Face Recognition. 2018 26th Telecommunications Forum (℡FOR). :1—4.

Face recognition is a fast-expanding field of research. Countless classification algorithms have found use in face recognition, with more still being developed, searching for better performance and accuracy. For high-dimensional data such as images, the K-Nearest-Neighbours classifier is a tempting choice. However, it is very computationally-intensive, as it has to perform calculations on all items in the stored dataset for each classification it makes. Fortunately, there is a way to speed up the process by performing some of the calculations in parallel. We propose a parallel CUDA implementation of the KNN classifier and then compare it to a serial implementation to demonstrate its performance superiority.