Biblio
We examine the problem of aggregating the results of multiple anti-virus (AV) vendors' detectors into a single authoritative ground-truth label for every binary. To do so, we adapt a well-known generative Bayesian model that postulates the existence of a hidden ground truth upon which the AV labels depend. We use training based on Expectation Maximization for this fully unsupervised technique. We evaluate our method using 279,327 distinct binaries from VirusTotal, each of which appeared for the rst time between January 2012 and June 2014.
Our evaluation shows that our statistical model is consistently more accurate at predicting the future-derived ground truth than all unweighted rules of the form \k out of n" AV detections. In addition, we evaluate the scenario where partial ground truth is available for model building. We train a logistic regression predictor on the partial label information. Our results show that as few as a 100 randomly selected training instances with ground truth are enough to achieve 80% true positive rate for 0.1% false positive rate. In comparison, the best unweighted threshold rule provides only 60% true positive rate at the same false positive rate.
While computer programs and logical theories begin by declaring the concepts of interest, be it as data types or as predicates, network computation does not allow such global declarations, and requires concept mining and concept analysis to extract shared semantics for different network nodes. Powerful semantic analysis systems have been the drivers of nearly all paradigm shifts on the web. In categorical terms, most of them can be described as bicompletions of enriched matrices, generalizing the Dedekind-MacNeille-style completions from posets to suitably enriched categories. Yet it has been well known for more than 40 years that ordinary categories themselves in general do not permit such completions. Armed with this new semantical view of Dedekind-MacNeille completions, and of matrix bicompletions, we take another look at this ancient mystery. It turns out that simple categorical versions of the limit superior and limit inferior operations characterize a general notion of Dedekind-MacNeille completion, that seems to be appropriate for ordinary categories, and boils down to the more familiar enriched versions when the limits inferior and superior coincide. This explains away the apparent gap among the completions of ordinary categories, and broadens the path towards categorical concept mining and analysis, opened in previous work.
The success of machine learning, particularly in supervised settings, has led to numerous attempts to apply it in adversarial settings such as spam and malware detection. The core challenge in this class of applications is that adversaries are not static data generators, but make a deliberate effort to evade the classifiers deployed to detect them. We investigate both the problem of modeling the objectives of such adversaries, as well as the algorithmic problem of accounting for rational, objective-driven adversaries. In particular, we demonstrate severe shortcomings of feature reduction in adversarial settings using several natural adversarial objective functions, an observation that is particularly pronounced when the adversary is able to substitute across similar features (for example, replace words with synonyms or replace letters in words). We offer a simple heuristic method for making learning more robust to feature cross-substitution attacks. We then present a more general approach based on mixed-integer linear programming with constraint generation, which implicitly trades off overfitting and feature selection in an adversarial setting using a sparse regularizer along with an evasion model. Our approach is the first method for combining an adversarial classification algorithm with a very general class of models of adversarial classifier evasion. We show that our algorithmic approach significantly outperforms state-of-the-art alternatives.
Stackelberg security game models and associated computational tools have seen deployment in a number of high- consequence security settings, such as LAX canine patrols and Federal Air Marshal Service. This deployment across essentially independent agencies raises a natural question: what global impact does the resulting strategic interaction among the defenders, each using a similar model, have? We address this question in two ways. First, we demonstrate that the most common solution concept of Strong Stackelberg equilibrium (SSE) can result in significant under-investment in security entirely because SSE presupposes a single defender. Second, we propose a framework based on a different solution concept which incorporates a model of interdependencies among targets, and show that in this framework defenders tend to over-defend, even under significant positive externalities of increased defense.
We investigate the coverage efficiency of a sensor network consisting of sensors with circular sensing footprints of different radii. The objective is to completely cover a region in an efficient manner through a controlled (or deterministic) deployment of such sensors. In particular, it is shown that when sensing nodes of two different radii are used for complete coverage, the coverage density is increased, and the sensing cost is significantly reduced as compared to the homogeneous case, in which all nodes have the same sensing radius. Configurations of heterogeneous disks of multiple radii to achieve efficient circle coverings are presented and analyzed.