Biblio
Program analyses necessarily make approximations that often lead them to report true alarms interspersed with many false alarms. We propose a new approach to leverage user feedback to guide program analyses towards true alarms and away from false alarms. Our approach associates each alarm with a confidence value by performing Bayesian inference on a probabilistic model derived from the analysis rules. In each iteration, the user inspects the alarm with the highest confidence and labels its ground truth, and the approach recomputes the confidences of the remaining alarms given this feedback. It thereby maximizes the return on the effort by the user in inspecting each alarm. We have implemented our approach in a tool named Bingo for program analyses expressed in Datalog. Experiments with real users and two sophisticated analyses–-a static datarace analysis for Java programs and a static taint analysis for Android apps–-show significant improvements on a range of metrics, including false alarm rates and number of bugs found.
Existing probabilistic privacy enforcement approaches permit the execution of a program that processes sensitive data only if the information it leaks is within the bounds specified by a given policy. Thus, to extract any information, users must manually design a program that satisfies the policy. In this work, we present a novel synthesis approach that automatically transforms a program into one that complies with a given policy. Our approach consists of two ingredients. First, we phrase the problem of determining the amount of leaked information as Bayesian inference, which enables us to leverage existing probabilistic programming engines. Second, we present two synthesis procedures that add uncertainty to the program's outputs as a way of reducing the amount of leaked information: an optimal one based on SMT solving and a greedy one with quadratic running time. We implemented and evaluated our approach on 10 representative programs from multiple application domains. We show that our system can successfully synthesize a permissive enforcement mechanism for all examples.
Two known shortcomings of standard techniques in formal verification are the limited capability to provide system-level assertions, and the scalability to large, complex models, such as those needed in Cyber-Physical Systems (CPS) applications. Leveraging data, which nowadays is becoming ever more accessible, has the potential to mitigate such limitations. However, this leads to a lack of formal proofs that are needed for modern safety-critical systems. This contribution presents a research initiative that addresses these shortcomings by bringing model-based techniques and data-driven methods together, which can help pushing the envelope of existing algorithms and tools in formal verification and thus expanding their applicability to complex engineering systems, such as CPS. In the first part of the contribution, we discuss a new, formal, measurement-driven and model-based automated technique, for the quantitative verification of physical systems with partly unknown dynamics. We formulate this setup as a data-driven Bayesian inference problem, formally embedded within a quantitative, model-based verification procedure. We argue that the approach can be applied to complex physical systems that are key for CPS applications, dealing with spatially continuous variables, evolving under complex dynamics, driven by external inputs, and accessed under noisy measurements. In the second part of the contribution, we concentrate on systems represented by models that evolve under probabilistic and heterogeneous (continuous/discrete - that is "hybrid" - as well as nonlinear) dynamics. Such stochastic hybrid models (also known as SHS) are a natural mathematical framework for CPS. With focus on model-based verification procedures, we provide algorithms for quantitative model checking of temporal specifications on SHS with formal guarantees. This is attained via the development of formal abstraction techniques that are based on quantitative approximations. Theory is complemented by algorithms, all packaged in software tools that are available to users, and which are applied here in the domain of Smart Energy.
Malware detection has been widely studied by analysing either file dropping relationships or characteristics of the file distribution network. This paper, for the first time, studies a global heterogeneous malware delivery graph fusing file dropping relationship and the topology of the file distribution network. The integration offers a unique ability of structuring the end-to-end distribution relationship. However, it brings large heterogeneous graphs to analysis. In our study, an average daily generated graph has more than 4 million edges and 2.7 million nodes that differ in type, such as IPs, URLs, and files. We propose a novel Bayesian label propagation model to unify the multi-source information, including content-agnostic features of different node types and topological information of the heterogeneous network. Our approach does not need to examine the source codes nor inspect the dynamic behaviours of a binary. Instead, it estimates the maliciousness of a given file through a semi-supervised label propagation procedure, which has a linear time complexity w.r.t. the number of nodes and edges. The evaluation on 567 million real-world download events validates that our proposed approach efficiently detects malware with a high accuracy.
The distinctive features of mobile ad hoc networks (MANETs), including dynamic topology and open wireless medium, may lead to MANETs suffering from many security vulnerabilities. In this paper, using recent advances in uncertain reasoning that originated from the artificial intelligence community, we propose a unified trust management scheme that enhances the security in MANETs. In the proposed trust management scheme, the trust model has two components: trust from direct observation and trust from indirect observation. With direct observation from an observer node, the trust value is derived using Bayesian inference, which is a type of uncertain reasoning when the full probability model can be defined. On the other hand, with indirect observation, which is also called secondhand information that is obtained from neighbor nodes of the observer node, the trust value is derived using the Dempster-Shafer theory (DST), which is another type of uncertain reasoning when the proposition of interest can be derived by an indirect method. By combining these two components in the trust model, we can obtain more accurate trust values of the observed nodes in MANETs. We then evaluate our scheme under the scenario of MANET routing. Extensive simulation results show the effectiveness of the proposed scheme. Specifically, throughput and packet delivery ratio (PDR) can be improved significantly with slightly increased average end-to-end delay and overhead of messages.
By representing large corpora with concise and meaningful elements, topic-based generative models aim to reduce the dimension and understand the content of documents. Those techniques originally analyze on words in the documents, but their extensions currently accommodate meta-data such as authorship information, which has been proved useful for textual modeling. The importance of learning authorship is to extract author interests and assign authors to anonymous texts. Author-Topic (AT) model, an unsupervised learning technique, successfully exploits authorship information to model both documents and author interests using topic representations. However, the AT model simplifies that each author has equal contribution on multiple-author documents. To overcome this limitation, we assumes that authors give different degrees of contributions on a document by using a Dirichlet distribution. This automatically transforms the unsupervised AT model to Supervised Author-Topic (SAT) model, which brings a novelty of authorship prediction on anonymous texts. The SAT model outperforms the AT model for identifying authors of documents written by either single authors or multiple authors with a better Receiver Operating Characteristic (ROC) curve and a significantly higher Area Under Curve (AUC). The SAT model not only achieves competitive performance to state-of-the-art techniques e.g. Random forests but also maintains the characteristics of the unsupervised models for information discovery i.e. Word distributions of topics, author interests, and author contributions.
Practical intrusion detection in Wireless Multihop Networks (WMNs) is a hard challenge. The distributed nature of the network makes centralized intrusion detection difficult, while resource constraints of the nodes and the characteristics of the wireless medium often render decentralized, node-based approaches impractical. We demonstrate that an active-probing-based network intrusion detection system (AP-NIDS) is practical for WMNs. The key contribution of this paper is to optimize the active probing process: we introduce a general Bayesian model and design a probe selection algorithm that reduces the number of probes while maximizing the insights gathered by the AP-NIDS. We validate our model by means of testbed experimentation. We integrate it to our open source AP-NIDS DogoIDS and run it in an indoor wireless mesh testbed utilizing the IEEE 802.11s protocol. For the example of a selective packet dropping attack, we develop the detection states for our Bayes model, and show its feasibility. We demonstrate that our approach does not need to execute the complete set of probes, yet we obtain good detection rates.