Li, M., Zhang, Y., Sun, Y., Wang, W., Tsang, I. W., Lin, X..
2020.
I/O Efficient Approximate Nearest Neighbour Search based on Learned Functions. 2020 IEEE 36th International Conference on Data Engineering (ICDE). :289–300.
Approximate nearest neighbour search (ANNS) in high dimensional space is a fundamental problem in many applications, such as multimedia database, computer vision and information retrieval. Among many solutions, data-sensitive hashing-based methods are effective to this problem, yet few of them are designed for external storage scenarios and hence do not optimized for I/O efficiency during the query processing. In this paper, we introduce a novel data-sensitive indexing and query processing framework for ANNS with an emphasis on optimizing the I/O efficiency, especially, the sequential I/Os. The proposed index consists of several lists of point IDs, ordered by values that are obtained by learned hashing (i.e., mapping) functions on each corresponding data point. The functions are learned from the data and approximately preserve the order in the high-dimensional space. We consider two instantiations of the functions (linear and non-linear), both learned from the data with novel objective functions. We also develop an I/O efficient ANNS framework based on the index. Comprehensive experiments on six benchmark datasets show that our proposed methods with learned index structure perform much better than the state-of-the-art external memory-based ANNS methods in terms of I/O efficiency and accuracy.
Chen, T., Lin, T., Hong, Y.- P..
2020.
Gait Phase Segmentation Using Weighted Dynamic Time Warping and K-Nearest Neighbors Graph Embedding. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). :1180–1184.
Gait phase segmentation is the process of identifying the start and end of different phases within a gait cycle. It is essential to many medical applications, such as disease diagnosis or rehabilitation. This work utilizes inertial measurement units (IMUs) mounted on the individual's foot to gather gait information and develops a gait phase segmentation method based on the collected signals. The proposed method utilizes a weighted dynamic time warping (DTW) algorithm to measure the distance between two different gait signals, and a k-nearest neighbors (kNN) algorithm to obtain the gait phase estimates. To reduce the complexity of the DTW-based kNN search, we propose a neural network-based graph embedding scheme that is able to map the IMU signals associated with each gait cycle into a distance-preserving low-dimensional representation while also producing a prediction on the k nearest neighbors of the test signal. Experiments are conducted on self-collected IMU gait signals to demonstrate the effectiveness of the proposed scheme.
Bashyam, K. G. Renga, Vadhiyar, S..
2020.
Fast Scalable Approximate Nearest Neighbor Search for High-dimensional Data. 2020 IEEE International Conference on Cluster Computing (CLUSTER). :294–302.
K-Nearest Neighbor (k-NN) search is one of the most commonly used approaches for similarity search. It finds extensive applications in machine learning and data mining. This era of big data warrants efficiently scaling k-NN search algorithms for billion-scale datasets with high dimensionality. In this paper, we propose a solution towards this end where we use vantage point trees for partitioning the dataset across multiple processes and exploit an existing graph-based sequential approximate k-NN search algorithm called HNSW (Hierarchical Navigable Small World) for searching locally within a process. Our hybrid MPI-OpenMP solution employs techniques including exploiting MPI one-sided communication for reducing communication times and partition replication for better load balancing across processes. We demonstrate computation of k-NN for 10,000 queries in the order of seconds using our approach on 8000 cores on a dataset with billion points in an 128-dimensional space. We also show 10X speedup over a completely k-d tree-based solution for the same dataset, thus demonstrating better suitability of our solution for high dimensional datasets. Our solution shows almost linear strong scaling.
Fang, S., Kennedy, S., Wang, C., Wang, B., Pei, Q., Liu, X..
2020.
Sparser: Secure Nearest Neighbor Search with Space-filling Curves. IEEE INFOCOM 2020 - IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS). :370–375.
Nearest neighbor search, a classic way of identifying similar data, can be applied to various areas, including database, machine learning, natural language processing, software engineering, etc. Secure nearest neighbor search aims to find nearest neighbors to a given query point over encrypted data without accessing data in plaintext. It provides privacy protection to datasets when nearest neighbor queries need to be operated by an untrusted party (e.g., a public server). While different solutions have been proposed to support nearest neighbor queries on encrypted data, these existing solutions still encounter critical drawbacks either in efficiency or privacy. In light of the limitations in the current literature, we propose a novel approximate nearest neighbor search solution, referred to as Sparser, by leveraging a combination of space-filling curves, perturbation, and Order-Preserving Encryption. The advantages of Sparser are twofold, strengthening privacy and improving efficiency. Specifically, Sparser pre-processes plaintext data with space-filling curves and perturbation, such that data is sparse, which mitigates leakage abuse attacks and renders stronger privacy. In addition to privacy enhancement, Sparser can efficiently find approximate nearest neighbors over encrypted data with logarithmic time. Through extensive experiments over real-world datasets, we demonstrate that Sparser can achieve strong privacy protection under leakage abuse attacks and minimize search time.
Han, Z., Wang, F., Li, Z..
2020.
Research on Nearest Neighbor Data Association Algorithm Based on Target “Dynamic” Monitoring Model. 2020 IEEE 4th Information Technology, Networking, Electronic and Automation Control Conference (ITNEC). 1:665–668.
In order to solve the problem that the Nearest Neighbor Data Association (NNDA) algorithm cannot detect the “dynamic” change of the target, this paper proposes the nearest neighbor data association algorithm based on the Targets “Dynamic” Monitoring Model (TDMM). Firstly, the gate searching and updating of targets are completed based on TDMM, then the NNDA algorithm is utilized to achieve the data association of targets to realize track updating. Finally, the NNDA algorithm based on TDMM is realized by simulation. The experimental results show that the algorithm proposed can achieve “dynamic” monitoring in multi-target data association, and have more obvious advantages than Multiple Hypothesis Tracking (MHT) in timeliness and association performance.
Haile, J., Havens, S..
2020.
Identifying Ubiquitious Third-Party Libraries in Compiled Executables Using Annotated and Translated Disassembled Code with Supervised Machine Learning. 2020 IEEE Security and Privacy Workshops (SPW). :157–162.
The size and complexity of the software ecosystem is a major challenge for vendors, asset owners and cybersecurity professionals who need to understand the security posture of these systems. Annotated and Translated Disassembled Code is a graph based datastore designed to organize firmware and software analysis data across builds, packages and systems, providing a highly scalable platform enabling automated binary software analysis tasks including corpora construction and storage for machine learning. This paper describes an approach for the identification of ubiquitous third-party libraries in firmware and software using Annotated and Translated Disassembled Code and supervised machine learning. Annotated and Translated Disassembled Code provide matched libraries, function names and addresses of previously unidentified code in software as it is being automatically analyzed. This data can be ingested by other software analysis tools to improve accuracy and save time. Defenders can add the identified libraries to their vulnerability searches and add effective detection and mitigation into their operating environment.
Lei, X., Tu, G.-H., Liu, A. X., Xie, T..
2020.
Fast and Secure kNN Query Processing in Cloud Computing. 2020 IEEE Conference on Communications and Network Security (CNS). :1–9.
Advances in sensing and tracking technology lead to the proliferation of location-based services. Location service providers (LSPs) often resort to commercial public clouds to store the tremendous geospatial data and process location-based queries from data users. To protect the privacy of LSP's geospatial data and data user's query location against the untrusted cloud, they are required to be encrypted before sending to the cloud. Nevertheless, it is not easy to design a fast and secure location-based query processing scheme over the encrypted data. In this paper, we propose a Fast and Secure kNN (FSkNN) scheme to support secure k nearest neighbor (k NN) search in cloud computing. We reveal the inherent connection between an Sk NN protocol and a secure range query protocol and further describe how to construct FSkNN based on a secure range query protocol. FSkNN leverages a customized accuracy-assured strategy to ensure the result accuracy and adopts a data structure named random Bloom filter (RBF) to build a secure index for efficiently searching. We formally prove the security of FSkNN under the random oracle model. Our evaluation results show that FSkNN is highly practical.
Kornaropoulos, E. M., Papamanthou, C., Tamassia, R..
2020.
The State of the Uniform: Attacks on Encrypted Databases Beyond the Uniform Query Distribution. 2020 IEEE Symposium on Security and Privacy (SP). :1223–1240.
Recent foundational work on leakage-abuse attacks on encrypted databases has broadened our understanding of what an adversary can accomplish with a standard leakage profile. Nevertheless, all known value reconstruction attacks succeed under strong assumptions that may not hold in the real world. The most prevalent assumption is that queries are issued uniformly at random by the client. We present the first value reconstruction attacks that succeed without any knowledge about the query or data distribution. Our approach uses the search-pattern leakage, which exists in all known structured encryption schemes but has not been fully exploited so far. At the core of our method lies a support size estimator, a technique that utilizes the repetition of search tokens with the same response to estimate distances between encrypted values without any assumptions about the underlying distribution. We develop distribution-agnostic reconstruction attacks for both range queries and k-nearest-neighbor (k-NN) queries based on information extracted from the search-pattern leakage. Our new range attack follows a different algorithmic approach than state-of-the-art attacks, which are fine-tuned to succeed under the uniformly distributed queries. Instead, we reconstruct plaintext values under a variety of skewed query distributions and even outperform the accuracy of previous approaches under the uniform query distribution. Our new k-NN attack succeeds with far fewer samples than previous attacks and scales to much larger values of k. We demonstrate the effectiveness of our attacks by experimentally testing them on a wide range of query distributions and database densities, both unknown to the adversary.
Oliver, J., Ali, M., Hagen, J..
2020.
HAC-T and Fast Search for Similarity in Security. 2020 International Conference on Omni-layer Intelligent Systems (COINS). :1–7.
Similarity digests have gained popularity for many security applications like blacklisting/whitelisting, and finding similar variants of malware. TLSH has been shown to be particularly good at hunting similar malware, and is resistant to evasion as compared to other similarity digests like ssdeep and sdhash. Searching and clustering are fundamental tools which help the security analysts and security operations center (SOC) operators in hunting and analyzing malware. Current approaches which aim to cluster malware are not scalable enough to keep up with the vast amount of malware and goodware available in the wild. In this paper, we present techniques which allow for fast search and clustering of TLSH hash digests which can aid analysts to inspect large amounts of malware/goodware. Our approach builds on fast nearest neighbor search techniques to build a tree-based index which performs fast search based on TLSH hash digests. The tree-based index is used in our threshold based Hierarchical Agglomerative Clustering (HAC-T) algorithm which is able to cluster digests in a scalable manner. Our clustering technique can cluster digests in O (n logn) time on average. We performed an empirical evaluation by comparing our approach with many standard and recent clustering techniques. We demonstrate that our approach is much more scalable and still is able to produce good cluster quality. We measured cluster quality using purity on 10 million samples obtained from VirusTotal. We obtained a high purity score in the range from 0.97 to 0.98 using labels from five major anti-virus vendors (Kaspersky, Microsoft, Symantec, Sophos, and McAfee) which demonstrates the effectiveness of the proposed method.