Biblio

List
Filter

Found 90 results

Filters: Keyword is search problems [Clear All Filters]

2021-02-22

Li, M., Zhang, Y., Sun, Y., Wang, W., Tsang, I. W., Lin, X.. 2020. I/O Efficient Approximate Nearest Neighbour Search based on Learned Functions. 2020 IEEE 36th International Conference on Data Engineering (ICDE). :289–300.

Approximate nearest neighbour search (ANNS) in high dimensional space is a fundamental problem in many applications, such as multimedia database, computer vision and information retrieval. Among many solutions, data-sensitive hashing-based methods are effective to this problem, yet few of them are designed for external storage scenarios and hence do not optimized for I/O efficiency during the query processing. In this paper, we introduce a novel data-sensitive indexing and query processing framework for ANNS with an emphasis on optimizing the I/O efficiency, especially, the sequential I/Os. The proposed index consists of several lists of point IDs, ordered by values that are obtained by learned hashing (i.e., mapping) functions on each corresponding data point. The functions are learned from the data and approximately preserve the order in the high-dimensional space. We consider two instantiations of the functions (linear and non-linear), both learned from the data with novel objective functions. We also develop an I/O efficient ANNS framework based on the index. Comprehensive experiments on six benchmark datasets show that our proposed methods with learned index structure perform much better than the state-of-the-art external memory-based ANNS methods in terms of I/O efficiency and accuracy.

Bashyam, K. G. Renga, Vadhiyar, S.. 2020. Fast Scalable Approximate Nearest Neighbor Search for High-dimensional Data. 2020 IEEE International Conference on Cluster Computing (CLUSTER). :294–302.

K-Nearest Neighbor (k-NN) search is one of the most commonly used approaches for similarity search. It finds extensive applications in machine learning and data mining. This era of big data warrants efficiently scaling k-NN search algorithms for billion-scale datasets with high dimensionality. In this paper, we propose a solution towards this end where we use vantage point trees for partitioning the dataset across multiple processes and exploit an existing graph-based sequential approximate k-NN search algorithm called HNSW (Hierarchical Navigable Small World) for searching locally within a process. Our hybrid MPI-OpenMP solution employs techniques including exploiting MPI one-sided communication for reducing communication times and partition replication for better load balancing across processes. We demonstrate computation of k-NN for 10,000 queries in the order of seconds using our approach on 8000 cores on a dataset with billion points in an 128-dimensional space. We also show 10X speedup over a completely k-d tree-based solution for the same dataset, thus demonstrating better suitability of our solution for high dimensional datasets. Our solution shows almost linear strong scaling.

Han, Z., Wang, F., Li, Z.. 2020. Research on Nearest Neighbor Data Association Algorithm Based on Target “Dynamic” Monitoring Model. 2020 IEEE 4th Information Technology, Networking, Electronic and Automation Control Conference (ITNEC). 1:665–668.

In order to solve the problem that the Nearest Neighbor Data Association (NNDA) algorithm cannot detect the “dynamic” change of the target, this paper proposes the nearest neighbor data association algorithm based on the Targets “Dynamic” Monitoring Model (TDMM). Firstly, the gate searching and updating of targets are completed based on TDMM, then the NNDA algorithm is utilized to achieve the data association of targets to realize track updating. Finally, the NNDA algorithm based on TDMM is realized by simulation. The experimental results show that the algorithm proposed can achieve “dynamic” monitoring in multi-target data association, and have more obvious advantages than Multiple Hypothesis Tracking (MHT) in timeliness and association performance.

Oliver, J., Ali, M., Hagen, J.. 2020. HAC-T and Fast Search for Similarity in Security. 2020 International Conference on Omni-layer Intelligent Systems (COINS). :1–7.

Similarity digests have gained popularity for many security applications like blacklisting/whitelisting, and finding similar variants of malware. TLSH has been shown to be particularly good at hunting similar malware, and is resistant to evasion as compared to other similarity digests like ssdeep and sdhash. Searching and clustering are fundamental tools which help the security analysts and security operations center (SOC) operators in hunting and analyzing malware. Current approaches which aim to cluster malware are not scalable enough to keep up with the vast amount of malware and goodware available in the wild. In this paper, we present techniques which allow for fast search and clustering of TLSH hash digests which can aid analysts to inspect large amounts of malware/goodware. Our approach builds on fast nearest neighbor search techniques to build a tree-based index which performs fast search based on TLSH hash digests. The tree-based index is used in our threshold based Hierarchical Agglomerative Clustering (HAC-T) algorithm which is able to cluster digests in a scalable manner. Our clustering technique can cluster digests in O (n logn) time on average. We performed an empirical evaluation by comparing our approach with many standard and recent clustering techniques. We demonstrate that our approach is much more scalable and still is able to produce good cluster quality. We measured cluster quality using purity on 10 million samples obtained from VirusTotal. We obtained a high purity score in the range from 0.97 to 0.98 using labels from five major anti-virus vendors (Kaspersky, Microsoft, Symantec, Sophos, and McAfee) which demonstrates the effectiveness of the proposed method.

2020-12-11

Zhang, W., Byna, S., Niu, C., Chen, Y.. 2019. Exploring Metadata Search Essentials for Scientific Data Management. 2019 IEEE 26th International Conference on High Performance Computing, Data, and Analytics (HiPC). :83—92.

Scientific experiments and observations store massive amounts of data in various scientific file formats. Metadata, which describes the characteristics of the data, is commonly used to sift through massive datasets in order to locate data of interest to scientists. Several indexing data structures (such as hash tables, trie, self-balancing search trees, sparse array, etc.) have been developed as part of efforts to provide an efficient method for locating target data. However, efficient determination of an indexing data structure remains unclear in the context of scientific data management, due to the lack of investigation on metadata, metadata queries, and corresponding data structures. In this study, we perform a systematic study of the metadata search essentials in the context of scientific data management. We study a real-world astronomy observation dataset and explore the characteristics of the metadata in the dataset. We also study possible metadata queries based on the discovery of the metadata characteristics and evaluate different data structures for various types of metadata attributes. Our evaluation on real-world dataset suggests that trie is a suitable data structure when prefix/suffix query is required, otherwise hash table should be used. We conclude our study with a summary of our findings. These findings provide a guideline and offers insights in developing metadata indexing methodologies for scientific applications.

2020-12-01

Wang, S., Mei, Y., Park, J., Zhang, M.. 2019. A Two-Stage Genetic Programming Hyper-Heuristic for Uncertain Capacitated Arc Routing Problem. 2019 IEEE Symposium Series on Computational Intelligence (SSCI). :1606—1613.

Genetic Programming Hyper-heuristic (GPHH) has been successfully applied to automatically evolve effective routing policies to solve the complex Uncertain Capacitated Arc Routing Problem (UCARP). However, GPHH typically ignores the interpretability of the evolved routing policies. As a result, GP-evolved routing policies are often very complex and hard to be understood and trusted by human users. In this paper, we aim to improve the interpretability of the GP-evolved routing policies. To this end, we propose a new Multi-Objective GP (MOGP) to optimise the performance and size simultaneously. A major issue here is that the size is much easier to be optimised than the performance, and the search tends to be biased to the small but poor routing policies. To address this issue, we propose a simple yet effective Two-Stage GPHH (TS-GPHH). In the first stage, only the performance is to be optimised. Then, in the second stage, both objectives are considered (using our new MOGP). The experimental results showed that TS-GPHH could obtain much smaller and more interpretable routing policies than the state-of-the-art single-objective GPHH, without deteriorating the performance. Compared with traditional MOGP, TS-GPHH can obtain a much better and more widespread Pareto front.

Nam, C., Li, H., Li, S., Lewis, M., Sycara, K.. 2018. Trust of Humans in Supervisory Control of Swarm Robots with Varied Levels of Autonomy. 2018 IEEE International Conference on Systems, Man, and Cybernetics (SMC). :825—830.

In this paper, we study trust-related human factors in supervisory control of swarm robots with varied levels of autonomy (LOA) in a target foraging task. We compare three LOAs: manual, mixed-initiative (MI), and fully autonomous LOA. In the manual LOA, the human operator chooses headings for a flocking swarm, issuing new headings as needed. In the fully autonomous LOA, the swarm is redirected automatically by changing headings using a search algorithm. In the mixed-initiative LOA, if performance declines, control is switched from human to swarm or swarm to human. The result of this work extends the current knowledge on human factors in swarm supervisory control. Specifically, the finding that the relationship between trust and performance improved for passively monitoring operators (i.e., improved situation awareness in higher LOAs) is particularly novel in its contradiction of earlier work. We also discover that operators switch the degree of autonomy when their trust in the swarm system is low. Last, our analysis shows that operator's preference for a lower LOA is confirmed for a new domain of swarm control.

2020-11-23

Ma, S.. 2018. Towards Effective Genetic Trust Evaluation in Open Network. 2018 IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS). :563–569.

In open network environments, since there is no centralized authority to monitor misbehaving entities, malicious entities can easily cause the degradation of the service quality. Trust has become an important factor to ensure network security, which can help entities to distinguish good partners from bad ones. In this paper, trust in open network environment is regarded as a self-organizing system, using self-organization principle of human social trust propagation, a genetic trust evaluation method with self-optimization and family attributes is proposed. In this method, factors of trust evaluation include time, IP, behavior feedback and intuitive trust. Data structure of access record table and trust record table are designed to store the relationship between ancestor nodes and descendant nodes. A genetic trust search algorithm is designed by simulating the biological evolution process. Based on trust information of the current node's ancestors, heuristics generate randomly chromosome populations, whose structure includes time, IP address, behavior feedback and intuitive trust. Then crossover and mutation strategy is used to make the population evolutionary searching. According to the genetic searching termination condition, the optimal trust chromosome in the population is selected, and trust value of the chromosome is computed, which is the node's genetic trust evaluation result. The simulation result shows that the genetic trust evaluation method is effective, and trust evaluation process of the current node can be regarded as the process of searching for optimal trust results from the ancestor nodes' information. With increasing of ancestor nodes' genetic trust information, the trust evaluation result from genetic algorithm searching is more accurate, which can effectively solve the joint fraud problem.

2020-11-20

Paul, S., Padhy, N. P., Mishra, S. K., Srivastava, A. K.. 2019. UUCA: Utility-User Cooperative Algorithm for Flexible Load Scheduling in Distribution System. 2019 8th International Conference on Power Systems (ICPS). :1—6.

Demand response analysis in smart grid deployment substantiated itself as an important research area in recent few years. Two-way communication between utility and users makes peak load reduction feasible by delaying the operation of deferrable appliances. Flexible appliance rescheduling is preferred to the users compared to traditional load curtailment. Again, if users' preferences are accounted into appliance transferring process, then customers concede a little discomfort to help the utility in peak reduction. This paper presents a novel Utility-User Cooperative Algorithm (UUCA) to lower total electricity cost and gross peak demand while preserving users' privacy and preferences. Main driving force in UUCA to motivate the consumers is a new cost function for their flexible appliances. As a result, utility will experience low peak and due to electricity cost decrement, users will get reduced bill. However, to maintain privacy, the behaviors of one customer have not be revealed either to other customers or to the central utility. To justify the effectiveness, UUCA is executed separately on residential, commercial and industrial customers of a distribution grid. Harmony search optimization technique has proved itself superior compared to other heuristic search techniques to prove efficacy of UUCA.

2020-09-04

Jing, Huiyun, Meng, Chengrui, He, Xin, Wei, Wei. 2019. Black Box Explanation Guided Decision-Based Adversarial Attacks. 2019 IEEE 5th International Conference on Computer and Communications (ICCC). :1592—1596.

Adversarial attacks have been the hot research field in artificial intelligence security. Decision-based black-box adversarial attacks are much more appropriate in the real-world scenarios, where only the final decisions of the targeted deep neural networks are accessible. However, since there is no available guidance for searching the imperceptive adversarial perturbation, boundary attack, one of the best performing decision-based black-box attacks, carries out computationally expensive search. For improving attack efficiency, we propose a novel black box explanation guided decision-based black-box adversarial attack. Firstly, the problem of decision-based adversarial attacks is modeled as a derivative-free and constraint optimization problem. To solve this optimization problem, the black box explanation guided constrained random search method is proposed to more quickly find the imperceptible adversarial example. The insights into the targeted deep neural networks explored by the black box explanation are fully used to accelerate the computationally expensive random search. Experimental results demonstrate that our proposed attack improves the attack efficiency by 64% compared with boundary attack.

2020-07-20

Komargodski, Ilan, Naor, Moni, Yogev, Eylon. 2017. White-Box vs. Black-Box Complexity of Search Problems: Ramsey and Graph Property Testing. 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS). :622–632.

Ramsey theory assures us that in any graph there is a clique or independent set of a certain size, roughly logarithmic in the graph size. But how difficult is it to find the clique or independent set? If the graph is given explicitly, then it is possible to do so while examining a linear number of edges. If the graph is given by a black-box, where to figure out whether a certain edge exists the box should be queried, then a large number of queries must be issued. But what if one is given a program or circuit for computing the existence of an edge? This problem was raised by Buss and Goldberg and Papadimitriou in the context of TFNP, search problems with a guaranteed solution. We examine the relationship between black-box complexity and white-box complexity for search problems with guaranteed solution such as the above Ramsey problem. We show that under the assumption that collision resistant hash function exist (which follows from the hardness of problems such as factoring, discrete-log and learning with errors) the white-box Ramsey problem is hard and this is true even if one is looking for a much smaller clique or independent set than the theorem guarantees. In general, one cannot hope to translate all black-box hardness for TFNP into white-box hardness: we show this by adapting results concerning the random oracle methodology and the impossibility of instantiating it. Another model we consider is the succinct black-box, where there is a known upper bound on the size of the black-box (but no limit on the computation time). In this case we show that for all TFNP problems there is an upper bound on the number of queries proportional to the description size of the box times the solution size. On the other hand, for promise problems this is not the case. Finally, we consider the complexity of graph property testing in the white-box model. We show a property which is hard to test even when one is given the program for computing the graph. The hard property is whether the graph is a two-source extractor.

2020-06-12

Domniţa, Dan, Oprişa, Ciprian. 2018. A genetic algorithm for obtaining memory constrained near-perfect hashing. 2018 IEEE International Conference on Automation, Quality and Testing, Robotics (AQTR). :1—6.

The problem of fast items retrieval from a fixed collection is often encountered in most computer science areas, from operating system components to databases and user interfaces. We present an approach based on hash tables that focuses on both minimizing the number of comparisons performed during the search and minimizing the total collection size. The standard open-addressing double-hashing approach is improved with a non-linear transformation that can be parametrized in order to ensure a uniform distribution of the data in the hash table. The optimal parameter is determined using a genetic algorithm. The paper results show that near-perfect hashing is faster than binary search, yet uses less memory than perfect hashing, being a good choice for memory-constrained applications where search time is also critical.

[Anonymous]. 2018. Discrete Locally-Linear Preserving Hashing. {2018 25th IEEE International Conference on Image Processing (ICIP). :490—494.

Recently, hashing has attracted considerable attention for nearest neighbor search due to its fast query speed and low storage cost. However, existing unsupervised hashing algorithms have two problems in common. Firstly, the widely utilized anchor graph construction algorithm has inherent limitations in local weight estimation. Secondly, the locally linear structure in the original feature space is seldom taken into account for binary encoding. Therefore, in this paper, we propose a novel unsupervised hashing method, dubbed “discrete locally-linear preserving hashing”, which effectively calculates the adjacent matrix while preserving the locally linear structure in the obtained hash space. Specifically, a novel local anchor embedding algorithm is adopted to construct the approximate adjacent matrix. After that, we directly minimize the reconstruction error with the discrete constrain to learn the binary codes. Experimental results on two typical image datasets indicate that the proposed method significantly outperforms the state-of-the-art unsupervised methods.

2020-05-22

Varricchio, Valerio, Frazzoli, Emilio. 2018. Asymptotically Optimal Pruning for Nonholonomic Nearest-Neighbor Search. 2018 IEEE Conference on Decision and Control (CDC). :4459—4466.

Nearest-Neighbor Search (NNS) arises as a key component of sampling-based motion planning algorithms and it is known as their asymptotic computational bottleneck. Algorithms for exact Nearest-Neighbor Search rely on explicit distance comparisons to different extents. However, in motion planning, evaluating distances is generally a computationally demanding task, since the metric is induced by the minimum cost of steering a dynamical system between states. In the presence of driftless nonholonomic constraints, we propose efficient pruning techniques for the k-d tree algorithm that drastically reduce the number of distance evaluations performed during a query. These techniques exploit computationally convenient lower and upper bounds to the geodesic distance of the corresponding sub-Riemannian geometry. Based on asymptotic properties of the reachable sets, we show that the proposed pruning techniques are optimal, modulo a constant factor, and we provide experimental results with the Reeds-Shepp vehicle model.

Yan, Donghui, Wang, Yingjie, Wang, Jin, Wang, Honggang, Li, Zhenpeng. 2018. K-nearest Neighbor Search by Random Projection Forests. 2018 IEEE International Conference on Big Data (Big Data). :4775—4781.

K-nearest neighbor (kNN) search has wide applications in many areas, including data mining, machine learning, statistics and many applied domains. Inspired by the success of ensemble methods and the flexibility of tree-based methodology, we propose random projection forests, rpForests, for kNN search. rpForests finds kNNs by aggregating results from an ensemble of random projection trees with each constructed recursively through a series of carefully chosen random projections. rpForests achieves a remarkable accuracy in terms of fast decay in the missing rate of kNNs and that of discrepancy in the kNN distances. rpForests has a very low computational complexity. The ensemble nature of rpForests makes it easily run in parallel on multicore or clustered computers; the running time is expected to be nearly inversely proportional to the number of cores or machines. We give theoretical insights by showing the exponential decay of the probability that neighboring points would be separated by ensemble random projection trees when the ensemble size increases. Our theory can be used to refine the choice of random projections in the growth of trees, and experiments show that the effect is remarkable.

Kang, Hyunjoong, Hong, Sanghyun, Lee, Kookjin, Park, Noseong, Kwon, Soonhyun. 2018. On Integrating Knowledge Graph Embedding into SPARQL Query Processing. 2018 IEEE International Conference on Web Services (ICWS). :371—374.

SPARQL is a standard query language for knowledge graphs (KGs). However, it is hard to find correct answer if KGs are incomplete or incorrect. Knowledge graph embedding (KGE) enables answering queries on such KGs by inferring unknown knowledge and removing incorrect knowledge. Hence, our long-term goal in this line of research is to propose a new framework that integrates KGE and SPARQL, which opens various research problems to be addressed. In this paper, we solve one of the most critical problems, that is, optimizing the performance of nearest neighbor (NN) search. In our evaluations, we demonstrate that the search time of state-of-the-art NN search algorithms is improved by 40% without sacrificing answer accuracy.

Chen, Yalin, Li, Zhiyang, Shi, Jia, Liu, Zhaobin, Qu, Wenyu. 2018. Stacked K-Means Hashing Quantization for Nearest Neighbor Search. 2018 IEEE Fourth International Conference on Multimedia Big Data (BigMM). :1—4.

Nowadays, with such a huge amount of information available online, one key challenge is how to retrieve target data efficiently. A recent state-of-art solution, k-means hashing (KMH), codes data via a string of binary code obtained by iterative k-means clustering and binary code optimizing. To deal with high dimensional data, KMH divides the space into low-dimensional subspaces, places a hypercube in each subspace and finds its proper location by the mentioned optimizing process. However, the complexity of the optimization increases rapidly when the dimension of the hypercube increases. To address this issue, we propose an improved hashing method stacked k-means hashing (SKMH). The main idea is to increase the approximation by a coarse-to-fine multi-layer lower-dimensional cubes. With these kinds of lower-dimensional cubes, SKMH can achieve a similar approximation ability via a less optimizing time, compared with KMH method using higher-dimensional cubes. Extensive experiments have been conducted on two public databases, demonstrating the performance of our method by some common metrics in fast nearest neighbor search.

Ito, Toshitaka, Itotani, Yuri, Wakabayashi, Shin'ichi, Nagayama, Shinobu, Inagi, Masato. 2018. A Nearest Neighbor Search Engine Using Distance-Based Hashing. 2018 International Conference on Field-Programmable Technology (FPT). :150—157.

This paper proposes an FPGA-based nearest neighbor search engine for high-dimensional data, in which nearest neighbor search is performed based on distance-based hashing. The proposed hardware search engine implements a nearest neighbor search algorithm based on an extension of flexible distance-based hashing (FDH, for short), which finds an exact solution with high probability. The proposed engine is a parallel processing and pipelined circuit so that search results can be obtained in a short execution time. Experimental results show the effectiveness and efficiency of the proposed engine.

Wang, Xi, Yao, Jun, Ji, Hongxia, Zhang, Ze, Li, Chen, Ma, Beizhi. 2018. A Local Integral Hash Nearest Neighbor Algorithm. 2018 3rd International Conference on Mechanical, Control and Computer Engineering (ICMCCE). :544—548.

Nearest neighbor search algorithm plays a very important role in computer image algorithm. When the search data is large, we need to use fast search algorithm. The current fast retrieval algorithms are tree based algorithms. The efficiency of the tree algorithm decreases sharply with the increase of the data dimension. In this paper, a local integral hash nearest neighbor algorithm of the spatial space is proposed to construct the tree structure by changing the way of the node of the access tree. It is able to express data distribution characteristics. After experimental testing, this paper achieves more efficient performance in high dimensional data.

Vijay, Savinu T., Pournami, P. N.. 2018. Feature Based Image Registration using Heuristic Nearest Neighbour Search. 2018 22nd International Computer Science and Engineering Conference (ICSEC). :1—3.

Image registration is the process of aligning images of the same scene taken at different instances, from different viewpoints or by heterogeneous sensors. This can be achieved either by area based or by feature based image matching techniques. Feature based image registration focuses on detecting relevant features from the input images and attaching descriptors to these features. Matching visual descriptions of two images is a major task in image registration. This feature matching is currently done using Exhaustive Search (or Brute-Force) and Nearest Neighbour Search. The traditional method used for nearest neighbour search is by representing the data as k-d trees. This nearest neighbour search can also be performed using combinatorial optimization algorithms such as Simulated Annealing. This work proposes a method to perform image feature matching by nearest neighbour search done based on Threshold Accepting, a faster version of Simulated Annealing.The experiments performed suggest that the proposed algorithm can produce better results within a minimum number of iterations than many existing algorithms.

Song, Fuyuan, Qin, Zheng, Liu, Qin, Liang, Jinwen, Ou, Lu. 2019. Efficient and Secure k-Nearest Neighbor Search Over Encrypted Data in Public Cloud. ICC 2019 - 2019 IEEE International Conference on Communications (ICC). :1—6.

Cloud computing has become an important and popular infrastructure for data storage and sharing. Typically, data owners outsource their massive data to a public cloud that will provide search services to authorized data users. With privacy concerns, the valuable outsourced data cannot be exposed directly, and should be encrypted before outsourcing to the public cloud. In this paper, we focus on k-Nearest Neighbor (k-NN) search over encrypted data. We propose efficient and secure k-NN search schemes based on matrix similarity to achieve efficient and secure query services in public cloud. In our basic scheme, we construct the traces of two diagonal multiplication matrices to denote the Euclidean distance of two data points, and perform secure k-NN search by comparing traces of corresponding similar matrices. In our enhanced scheme, we strengthen the security property by decomposing matrices based on our basic scheme. Security analysis shows that our schemes protect the data privacy and query privacy under attacking with different levels of background knowledge. Experimental evaluations show that both schemes are efficient in terms of computation complexity as well as computational cost.

Markchit, Sarawut, Chiu, Chih-Yi. 2019. Hash Code Indexing in Cross-Modal Retrieval. 2019 International Conference on Content-Based Multimedia Indexing (CBMI). :1—4.

Cross-modal hashing, which searches nearest neighbors across different modalities in the Hamming space, has become a popular technique to overcome the storage and computation barrier in multimedia retrieval recently. Although dozens of cross-modal hashing algorithms are proposed to yield compact binary code representation, applying exhaustive search in a large-scale dataset is impractical for the real-time purpose, and the Hamming distance computation suffers inaccurate results. In this paper, we propose a novel index scheme over binary hash codes in cross-modal retrieval. The proposed indexing scheme exploits a few binary bits of the hash code as the index code. Based on the index code representation, we construct an inverted index structure to accelerate the retrieval efficiency and train a neural network to improve the indexing accuracy. Experiments are performed on two benchmark datasets for retrieval across image and text modalities, where hash codes are generated by three cross-modal hashing methods. Results show the proposed method effectively boosts the performance over the benchmark datasets and hash methods.

Yang, Jiacheng, Chen, Bin, Xia, Shu-Tao. 2019. Mean-Removed Product Quantization for Approximate Nearest Neighbor Search. 2019 International Conference on Data Mining Workshops (ICDMW). :711—718.

Product quantization (PQ) and its variations are popular and attractive in approximate nearest neighbor search (ANN) due to their lower memory usage and faster retrieval speed. PQ decomposes the high-dimensional vector space into several low-dimensional subspaces, and quantizes each sub-vector in their subspaces, separately. Thus, PQ can generate a codebook containing an exponential number of codewords or indices by a Cartesian product of the sub-codebooks from different subspaces. However, when there is large variance in the average amplitude of the components of the data points, directly utilizing the PQ on the data points would result in poor performance. In this paper, we propose a new approach, namely, mean-removed product quantization (MRPQ) to address this issue. In fact, the average amplitude of a data point or the mean of a date point can be regarded as statistically independent of the variation of the vector, that is, of the way the components vary about this average. Then we can learn a separate scalar quantizer of the means of the data points and apply the PQ to their residual vectors. As shown in our comprehensive experiments on four large-scale public datasets, our approach can achieve substantial improvements in terms of Recall and MAP over some known methods. Moreover, our approach is general which can be combined with PQ and its variations.

Horzyk, Adrian, Starzyk, Janusz A.. 2019. Associative Data Model in Search for Nearest Neighbors and Similar Patterns. 2019 IEEE Symposium Series on Computational Intelligence (SSCI). :933—940.

This paper introduces a biologically inspired associative data model and structure for finding nearest neighbors and similar patterns. The method can be used as an alternative to the classical approaches to accelerate the search for such patterns using various priorities for attributes according to the Sebestyen measure. The presented structure, together with algorithms developed in this paper can be useful in various computational intelligence tasks like pattern matching, recognition, clustering, classification, multi-criterion search etc. This approach is particularly useful for the on-line operation of associative neural network graphs. Graphs that dynamically develop their structure during learning on training data. The results of experiments show that the associative approach can substantially accelerate the nearest neighbor search and that associative structures can also be used as a model for KNN tasks. Finally, this paper presents how the associative structures can be used to self-organize data and represent knowledge about them in the associative way, which yields new search approaches described in this paper.

Ahsan, Ramoza, Bashir, Muzammil, Neamtu, Rodica, Rundensteiner, Elke A., Sarkozy, Gabor. 2019. Nearest Neighbor Subsequence Search in Time Series Data. 2019 IEEE International Conference on Big Data (Big Data). :2057—2066.

Continuous growth in sensor data and other temporal sequence data necessitates efficient retrieval and similarity search support on these big time series datasets. However, finding exact similarity results, especially at the granularity of subsequences, is known to be prohibitively costly for large data sets. In this paper, we thus propose an efficient framework for solving this exact subsequence similarity match problem, called TINN (TIme series Nearest Neighbor search). Exploiting the range interval diversity properties of time series datasets, TINN captures similarity at two levels of abstraction, namely, relationships among subsequences within each long time series and relationships across distinct time series in the data set. These relationships are compactly organized in an augmented relationship graph model, with the former relationships encoded in similarity vectors at TINN nodes and the later captured by augmented edge types in the TINN Graph. Query processing strategy deploy novel pruning techniques on the TINN Graph, including node skipping, vertical and horizontal pruning, to significantly reduce the number of time series as well as subsequences to be explored. Comprehensive experiments on synthetic and real world time series data demonstrate that our TINN model consistently outperforms state-of-the-art approaches while still guaranteeing to retrieve exact matches.