Biblio
Concurrency programs often induce buggy results due to the unexpected interaction among threads. The detection of these concurrency bugs costs a lot because they usually appear under a specific execution trace. How to virtually explore different thread schedules to detect concurrency bugs efficiently is an important research topic. Many techniques have been proposed, including lightweight techniques like adaptive randomized scheduling (ARS) and heavyweight techniques like maximal causality reduction (MCR). Compared to heavyweight techniques, ARS is efficient in exploring different schedulings and achieves state-of-the-art performance. However, it will lead to explore large numbers of redundant thread schedulings, which will reduce the efficiency. Moreover, it suffers from the “cold start” issue, when little information is available to guide the distance calculation at the beginning of the exploration. In this work, we propose a Heuristic-Enhanced Adaptive Randomized Scheduling (HARS) algorithm, which improves ARS to detect concurrency bugs guided with novel distance metrics and heuristics obtained from existing research findings. Compared with the adaptive randomized scheduling method, it can more effectively distinguish the traces that may contain concurrency bugs and avoid redundant schedules, thus exploring diverse thread schedules effectively. We conduct an evaluation on 45 concurrency Java programs. The evaluation results show that our algorithm performs more stably in terms of effectiveness and efficiency in detecting concurrency bugs. Notably, HARS detects hard-to-expose bugs more effectively, where the buggy traces are rare or the bug triggering conditions are tricky.
The rapid development of Internet has resulted in massive information overloading recently. These information is usually represented by high-dimensional feature vectors in many related applications such as recognition, classification and retrieval. These applications usually need efficient indexing and search methods for such large-scale and high-dimensional database, which typically is a challenging task. Some efforts have been made and solved this problem to some extent. However, most of them are implemented in a single machine, which is not suitable to handle large-scale database.In this paper, we present a novel data index structure and nearest neighbor search algorithm implemented on Apache Spark. We impose a grid on the database and index data by non-empty grid cells. This grid-based index structure is simple and easy to be implemented in parallel. Moreover, we propose to build a scalable KNN graph on the grids, which increase the efficiency of this index structure by a low cost in parallel implementation. Finally, experiments are conducted in both public databases and synthetic databases, showing that the proposed methods achieve overall high performance in both efficiency and accuracy.
The huge amount of user log data collected by search engine providers creates new opportunities to understand user loyalty and defection behavior at an unprecedented scale. However, this also poses a great challenge to analyze the behavior and glean insights into the complex, large data. In this paper, we introduce LoyalTracker, a visual analytics system to track user loyalty and switching behavior towards multiple search engines from the vast amount of user log data. We propose a new interactive visualization technique (flow view) based on a flow metaphor, which conveys a proper visual summary of the dynamics of user loyalty of thousands of users over time. Two other visualization techniques, a density map and a word cloud, are integrated to enable analysts to gain further insights into the patterns identified by the flow view. Case studies and the interview with domain experts are conducted to demonstrate the usefulness of our technique in understanding user loyalty and switching behavior in search engines.