Biblio
Scientific experiments and observations store massive amounts of data in various scientific file formats. Metadata, which describes the characteristics of the data, is commonly used to sift through massive datasets in order to locate data of interest to scientists. Several indexing data structures (such as hash tables, trie, self-balancing search trees, sparse array, etc.) have been developed as part of efforts to provide an efficient method for locating target data. However, efficient determination of an indexing data structure remains unclear in the context of scientific data management, due to the lack of investigation on metadata, metadata queries, and corresponding data structures. In this study, we perform a systematic study of the metadata search essentials in the context of scientific data management. We study a real-world astronomy observation dataset and explore the characteristics of the metadata in the dataset. We also study possible metadata queries based on the discovery of the metadata characteristics and evaluate different data structures for various types of metadata attributes. Our evaluation on real-world dataset suggests that trie is a suitable data structure when prefix/suffix query is required, otherwise hash table should be used. We conclude our study with a summary of our findings. These findings provide a guideline and offers insights in developing metadata indexing methodologies for scientific applications.
Genetic Programming Hyper-heuristic (GPHH) has been successfully applied to automatically evolve effective routing policies to solve the complex Uncertain Capacitated Arc Routing Problem (UCARP). However, GPHH typically ignores the interpretability of the evolved routing policies. As a result, GP-evolved routing policies are often very complex and hard to be understood and trusted by human users. In this paper, we aim to improve the interpretability of the GP-evolved routing policies. To this end, we propose a new Multi-Objective GP (MOGP) to optimise the performance and size simultaneously. A major issue here is that the size is much easier to be optimised than the performance, and the search tends to be biased to the small but poor routing policies. To address this issue, we propose a simple yet effective Two-Stage GPHH (TS-GPHH). In the first stage, only the performance is to be optimised. Then, in the second stage, both objectives are considered (using our new MOGP). The experimental results showed that TS-GPHH could obtain much smaller and more interpretable routing policies than the state-of-the-art single-objective GPHH, without deteriorating the performance. Compared with traditional MOGP, TS-GPHH can obtain a much better and more widespread Pareto front.
In this paper, we study trust-related human factors in supervisory control of swarm robots with varied levels of autonomy (LOA) in a target foraging task. We compare three LOAs: manual, mixed-initiative (MI), and fully autonomous LOA. In the manual LOA, the human operator chooses headings for a flocking swarm, issuing new headings as needed. In the fully autonomous LOA, the swarm is redirected automatically by changing headings using a search algorithm. In the mixed-initiative LOA, if performance declines, control is switched from human to swarm or swarm to human. The result of this work extends the current knowledge on human factors in swarm supervisory control. Specifically, the finding that the relationship between trust and performance improved for passively monitoring operators (i.e., improved situation awareness in higher LOAs) is particularly novel in its contradiction of earlier work. We also discover that operators switch the degree of autonomy when their trust in the swarm system is low. Last, our analysis shows that operator's preference for a lower LOA is confirmed for a new domain of swarm control.
The problem of fast items retrieval from a fixed collection is often encountered in most computer science areas, from operating system components to databases and user interfaces. We present an approach based on hash tables that focuses on both minimizing the number of comparisons performed during the search and minimizing the total collection size. The standard open-addressing double-hashing approach is improved with a non-linear transformation that can be parametrized in order to ensure a uniform distribution of the data in the hash table. The optimal parameter is determined using a genetic algorithm. The paper results show that near-perfect hashing is faster than binary search, yet uses less memory than perfect hashing, being a good choice for memory-constrained applications where search time is also critical.
Recently, hashing has attracted considerable attention for nearest neighbor search due to its fast query speed and low storage cost. However, existing unsupervised hashing algorithms have two problems in common. Firstly, the widely utilized anchor graph construction algorithm has inherent limitations in local weight estimation. Secondly, the locally linear structure in the original feature space is seldom taken into account for binary encoding. Therefore, in this paper, we propose a novel unsupervised hashing method, dubbed “discrete locally-linear preserving hashing”, which effectively calculates the adjacent matrix while preserving the locally linear structure in the obtained hash space. Specifically, a novel local anchor embedding algorithm is adopted to construct the approximate adjacent matrix. After that, we directly minimize the reconstruction error with the discrete constrain to learn the binary codes. Experimental results on two typical image datasets indicate that the proposed method significantly outperforms the state-of-the-art unsupervised methods.
Nearest neighbor search algorithm plays a very important role in computer image algorithm. When the search data is large, we need to use fast search algorithm. The current fast retrieval algorithms are tree based algorithms. The efficiency of the tree algorithm decreases sharply with the increase of the data dimension. In this paper, a local integral hash nearest neighbor algorithm of the spatial space is proposed to construct the tree structure by changing the way of the node of the access tree. It is able to express data distribution characteristics. After experimental testing, this paper achieves more efficient performance in high dimensional data.
Cross-modal hashing, which searches nearest neighbors across different modalities in the Hamming space, has become a popular technique to overcome the storage and computation barrier in multimedia retrieval recently. Although dozens of cross-modal hashing algorithms are proposed to yield compact binary code representation, applying exhaustive search in a large-scale dataset is impractical for the real-time purpose, and the Hamming distance computation suffers inaccurate results. In this paper, we propose a novel index scheme over binary hash codes in cross-modal retrieval. The proposed indexing scheme exploits a few binary bits of the hash code as the index code. Based on the index code representation, we construct an inverted index structure to accelerate the retrieval efficiency and train a neural network to improve the indexing accuracy. Experiments are performed on two benchmark datasets for retrieval across image and text modalities, where hash codes are generated by three cross-modal hashing methods. Results show the proposed method effectively boosts the performance over the benchmark datasets and hash methods.