Biblio
Onion sites on the darkweb operate using the Tor Hidden Service (HS) protocol to shield their locations on the Internet, which (among other features) enables these sites to host malicious and illegal content while being resistant to legal action and seizure. Identifying and monitoring such illicit sites in the darkweb is of high relevance to the Computer Security and Law Enforcement communities. We have developed an automated infrastructure that crawls and indexes content from onion sites into a large-scale data repository, called LIGHTS, with over 100M pages. In this paper we describe Automated Tool for Onion Labeling (ATOL), a novel scalable analysis service developed to conduct a thematic assessment of the content of onion sites in the LIGHTS repository. ATOL has three core components – (a) a novel keyword discovery mechanism (ATOLKeyword) which extends analyst-provided keywords for different categories by suggesting new descriptive and discriminative keywords that are relevant for the categories; (b) a classification framework (ATOLClassify) that uses the discovered keywords to map onion site content to a set of categories when sufficient labeled data is available; (c) a clustering framework (ATOLCluster) that can leverage information from multiple external heterogeneous knowledge sources, ranging from domain expertise to Bitcoin transaction data, to categorize onion content in the absence of sufficient supervised data. The paper presents empirical results of ATOL on onion datasets derived from the LIGHTS repository, and additionally benchmarks ATOL's algorithms on the publicly available 20 Newsgroups dataset to demonstrate the reproducibility of its results. On the LIGHTS dataset, ATOLClassify gives a 12% performance gain over an analyst-provided baseline, while ATOLCluster gives a 7% improvement over state-of-the-art semi-supervised clustering algorithms. We also discuss how ATOL has been deployed and externally evaluated, as part of the LIGHTS system.
Modern software should satisfy multiple goals simultaneously: it should provide predictable performance, be robust to failures, handle peak loads and deal seamlessly with unexpected conditions and changes in the execution environment. For this to happen, software designs should account for the possibility of runtime changes and provide formal guarantees of the software's behavior. Control theory is one of the possible design drivers for runtime adaptation, but adopting control theoretic principles often requires additional, specialized knowledge. To overcome this limitation, automated methodologies have been proposed to extract the necessary information from experimental data and design a control system for runtime adaptation. These proposals, however, only process one goal at a time, creating a chain of controllers. In this paper, we propose and evaluate the first automated strategy that takes into account multiple goals without separating them into multiple control strategies. Avoiding the separation allows us to tackle a larger class of problems and provide stronger guarantees. We test our methodology's generality with three case studies that demonstrate its broad applicability in meeting performance, reliability, quality, security, and energy goals despite environmental or requirements changes.
Malicious crowdsourcing forums are gaining traction as sources of spreading misinformation online, but are limited by the costs of hiring and managing human workers. In this paper, we identify a new class of attacks that leverage deep learning language models (Recurrent Neural Networks or RNNs) to automate the generation of fake online reviews for products and services. Not only are these attacks cheap and therefore more scalable, but they can control rate of content output to eliminate the signature burstiness that makes crowdsourced campaigns easy to detect. Using Yelp reviews as an example platform, we show how a two phased review generation and customization attack can produce reviews that are indistinguishable by state-of-the-art statistical detectors. We conduct a survey-based user study to show these reviews not only evade human detection, but also score high on "usefulness" metrics by users. Finally, we develop novel automated defenses against these attacks, by leveraging the lossy transformation introduced by the RNN training and generation cycle. We consider countermeasures against our mechanisms, show that they produce unattractive cost-benefit tradeoffs for attackers, and that they can be further curtailed by simple constraints imposed by online service providers.
Developing Big Data Analytics workloads often involves trial and error debugging, due to the unclean nature of datasets or wrong assumptions made about data. When errors (e.g., program crash, outlier results, etc.) arise, developers are often interested in identifying a subset of the input data that is able to reproduce the problem. BigSift is a new faulty data localization approach that combines insights from automated fault isolation in software engineering and data provenance in database systems to find a minimum set of failure-inducing inputs. BigSift redefines data provenance for the purpose of debugging using a test oracle function and implements several unique optimizations, specifically geared towards the iterative nature of automated debugging workloads. BigSift improves the accuracy of fault localizability by several orders-of-magnitude ($\sim$103 to 107×) compared to Titian data provenance, and improves performance by up to 66× compared to Delta Debugging, an automated fault-isolation technique. For each faulty output, BigSift is able to localize fault-inducing data within 62% of the original job running time.
Monkey, a random testing tool from Google, has been popularly used in industrial practices for automatic test input generation for Android due to its applicability to a variety of application settings, e.g., ease of use and compatibility with different Android platforms. Recently, Monkey has been under the spotlight of the research community: recent studies found out that none of the studied tools from the academia were actually better than Monkey when applied on a set of open source Android apps. Our recent efforts performed the first case study of applying Monkey on WeChat, a popular messenger app with over 800 million monthly active users, and revealed many limitations of Monkey along with developing our improved approach to alleviate some of these limitations. In this paper, we explore two optimization techniques to improve the effectiveness and efficiency of our previous approach. We also conduct manual categorization of not-covered activities and two automatic coverage-analysis techniques to provide insightful information about the not-covered code entities. Lastly, we present findings of our empirical studies of conducting automatic random testing on WeChat with the preceding techniques.
Many popular web applications incorporate end-toend secure messaging protocols, which seek to ensure that messages sent between users are kept confidential and authenticated, even if the web application's servers are broken into or otherwise compelled into releasing all their data. Protocols that promise such strong security guarantees should be held up to rigorous analysis, since protocol flaws and implementations bugs can easily lead to real-world attacks. We propose a novel methodology that allows protocol designers, implementers, and security analysts to collaboratively verify a protocol using automated tools. The protocol is implemented in ProScript, a new domain-specific language that is designed for writing cryptographic protocol code that can both be executed within JavaScript programs and automatically translated to a readable model in the applied pi calculus. This model can then be analyzed symbolically using ProVerif to find attacks in a variety of threat models. The model can also be used as the basis of a computational proof using CryptoVerif, which reduces the security of the protocol to standard cryptographic assumptions. If ProVerif finds an attack, or if the CryptoVerif proof reveals a weakness, the protocol designer modifies the ProScript protocol code and regenerates the model to enable a new analysis. We demonstrate our methodology by implementing and analyzing a variant of the popular Signal Protocol with only minor differences. We use ProVerif and CryptoVerif to find new and previously-known weaknesses in the protocol and suggest practical countermeasures. Our ProScript protocol code is incorporated within the current release of Cryptocat, a desktop secure messenger application written in JavaScript. Our results indicate that, with disciplined programming and some verification expertise, the systematic analysis of complex cryptographic web applications is now becoming practical.
Software-defined networks provide new facilities for deploying security mechanisms dynamically. In particular, it is possible to build and adjust security chains to protect the infrastructures, by combining different security functions, such as firewalls, intrusion detection systems and services for preventing data leakage. It is important to ensure that these security chains, in view of their complexity and dynamics, are consistent and do not include security violations. We propose in this paper an automated strategy for supporting the verification of security chains in software-defined networks. It relies on an architecture integrating formal verification methods for checking both the control and data planes of these chains, before their deployment. We describe algorithms for translating specifications of security chains into formal models that can then be verified by SMT1 solving or model checking. Our solution is prototyped as a package, named Synaptic, built as an extension of the Frenetic family of SDN programming languages. The performances of our approach are evaluated through extensive experimentations based on the CVC4, veriT, and nuXmv checkers.
A smart grid is a fully automated power electricity network, which operates, protects and controls all its physical environments of power electricity infrastructure being able to supply energy in an efficient and reliable way. As the importance of cyber-physical system (CPS) security is growing, various vulnerability analysis methodologies for general systems have been suggested, whereas there has been few practical research targeting the smart grid infrastructure. In this paper, we highlight the significance of security vulnerability analysis in the smart grid environment. Then we introduce various automated vulnerability analysis techniques from executable files. In our approach, we propose a novel binary-based vulnerability discovery method for AMI and EV charging system to automatically extract security-related features from the embedded software. Finally, we present the test result of vulnerability discovery applied for AMI and EV charging system in Korean smart grid environment.
Understanding how to group a set of binary files into the piece of software they belong to is highly desirable for software profiling, malware detection, or enterprise audits, among many other applications. Unfortunately, it is also extremely challenging: there is absolutely no uniformity in the ways different applications rely on different files, in how binaries are signed, or in the versioning schemes used across different pieces of software. In this paper, we show that, by combining information gleaned from a large number of endpoints (millions of computers), we can accomplish large-scale application identification automatically and reliably. Our approach relies on collecting metadata on billions of files every day, summarizing it into much smaller "sketches", and performing approximate k-nearest neighbor clustering on non-metric space representations derived from these sketches. We design and implement our proposed system using Apache Spark, show that it can process billions of files in a matter of hours, and thus could be used for daily processing. We further show our system manages to successfully identify which files belong to which application with very high precision, and adequate recall.
Traffic of Industrial Control System (ICS) between the Human Machine Interface (HMI) and the Programmable Logic Controller (PLC) is known to be highly periodic. However, it is sometimes multiplexed, due to asynchronous scheduling. Modeling the network traffic patterns of multiplexed ICS streams using Deterministic Finite Automata (DFA) for anomaly detection typically produces a very large DFA and a high false-alarm rate. In this article, we introduce a new modeling approach that addresses this gap. Our Statechart DFA modeling includes multiple DFAs, one per cyclic pattern, together with a DFA-selector that de-multiplexes the incoming traffic into sub-channels and sends them to their respective DFAs. We demonstrate how to automatically construct the statechart from a captured traffic stream. Our unsupervised learning algorithms first build a Discrete-Time Markov Chain (DTMC) from the stream. Next, we split the symbols into sets, one per multiplexed cycle, based on symbol frequencies and node degrees in the DTMC graph. Then, we create a sub-graph for each cycle and extract Euler cycles for each sub-graph. The final statechart is comprised of one DFA per Euler cycle. The algorithms allow for non-unique symbols, which appear in more than one cycle, and also for symbols that appear more than once in a cycle. We evaluated our solution on traces from a production ICS using the Siemens S7-0x72 protocol. We also stress-tested our algorithms on a collection of synthetically-generated traces that simulated multiplexed ICS traces with varying levels of symbol uniqueness and time overlap. The algorithms were able to split the symbols into sets with 99.6% accuracy. The resulting statechart modeled the traces with a median false-alarm rate of as low as 0.483%. In all but the most extreme scenarios, the Statechart model drastically reduced both the false-alarm rate and the learned model size in comparison with the naive single-DFA model.
The paper presents a fully automatic end-to-end trainable system to colorize grayscale images. Colorization is a highly under-constrained problem. In order to produce realistic outputs, the proposed approach takes advantage of the recent advances in deep learning and generative networks. To achieve plausible colorization, the paper investigates conditional Wasserstein Generative Adversarial Networks (WGAN) [3] as a solution to this problem. Additionally, a loss function consisting of two classification loss components apart from the adversarial loss learned by the WGAN is proposed. The first classification loss provides a measure of how much the predicted colored images differ from ground truth. The second classification loss component makes use of ground truth semantic classification labels in order to learn meaningful intermediate features. Finally, WGAN training procedure pushes the predictions to the manifold of natural images. The system is validated using a user study and a semantic interpretability test and achieves results comparable to [1] on Imagenet dataset [10].
Empirical research in the Internet is fraught with challenges. Among these is the possibility that local environmental conditions (e.g., CPU load or network load) introduce unexpected bias or artifacts in measurements that lead to erroneous conclusions. In this paper, we describe a framework for local environment monitoring that is designed to be used during Internet measurement experiments. The goals of our work are to provide a critical, expanded perspective on measurement results and to improve the opportunity for reproducibility of results. We instantiate our framework in a tool we call SoMeta, which monitors the local environment during active probe-based measurement experiments. We evaluate the runtime costs of SoMeta and conduct a series of experiments in which we intentionally perturb different aspects of the local environment during active probe-based measurements. Our experiments show how simple local monitoring can readily expose conditions that bias active probe-based measurement results. We conclude with a discussion of how our framework can be expanded to provide metadata for a broad range of Internet measurement experiments.
In recent cyber incidents, Ransom software (ransomware) causes a major threat to the security of computer systems. Consequently, ransomware detection has become a hot topic in computer security. Unfortunately, current signature-based and static detection model is often easily evadable by obfuscation, polymorphism, compress, and encryption. For overcoming the lack of signature-based and static ransomware detection approach, we have proposed the dynamic ransomware detection system using data mining techniques such as Random Forest (RF), Support Vector Machine (SVM), Simple Logistic (SL) and Naive Bayes (NB) algorithms for detecting known and unknown ransomware. We monitor the actual (dynamic) behaviors of software to generate API calls flow graphs (CFG) and transfer it in a feature space. Thereafter, data normalization and feature selection were applied to select informative features which are the best for discriminating between various categories of software and benign software. Finally, the data mining algorithms were used for building the detection model for judging whether the software is benign software or ransomware. Our experimental results show that our proposed system can be more effective to improve the performance for ransomware detection. Especially, the accuracy and detection rate of our proposed system with Simple Logistic (SL) algorithm can achieve to 98.2% and 97.6%, respectively. Meanwhile, the false positive rate also can be reduced to 1.2%.
Testing and fixing Web Application Firewalls (WAFs) are two relevant and complementary challenges for security analysts. Automated testing helps to cost-effectively detect vulnerabilities in a WAF by generating effective test cases, i.e., attacks. Once vulnerabilities have been identified, the WAF needs to be fixed by augmenting its rule set to filter attacks without blocking legitimate requests. However, existing research suggests that rule sets are very difficult to understand and too complex to be manually fixed. In this paper, we formalise the problem of fixing vulnerable WAFs as a combinatorial optimisation problem. To solve it, we propose an automated approach that combines machine learning with multi-objective genetic algorithms. Given a set of legitimate requests and bypassing SQL injection attacks, our approach automatically infers regular expressions that, when added to the WAF's rule set, prevent many attacks while letting legitimate requests go through. Our empirical evaluation based on both open-source and proprietary WAFs shows that the generated filter rules are effective at blocking previously identified and successful SQL injection attacks (recall between 54.6% and 98.3%), while triggering in most cases no or few false positives (false positive rate between 0% and 2%).
Autonomous systems are on everyone's lips, driven by current discussions in the automotive sector. In fact, automated systems of varying degrees of autonomy are part of current roadmaps and projections in many industries. In this article, the various industry-specific taxonomies and standards are summarized and characterized in terms of their functional capabilities and requirements for methods, processes and tools from the perspective of software engineering.
Cyber attacks, (e.g., DDoS), on computers connected to the Internet occur everyday. A DDoS attack in 2016 that used “Mirai botnet” generated over 600 Gbit/s traffic, which was twice as that of last year. In view of this situation, we can no longer adequately protect our computers using current end-point security solutions and must therefore introduce a new method of protection that uses distributed nodes, e.g., routers. We propose an Autonomous and Distributed Internet Security (AIS) infrastructure that provides two key functions: first, filtering source address spoofing packets (proactive filter), and second, filtering malicious packets that are observed at the end point (reactive filter) at the closest malicious packets origins. We also propose three types of Multi-Layer Binding Routers (MLBRs) to realize these functions. We implemented the MLBRs and constructed experimental systems to simulate DDoS attacks. Results showed that all malicious packets could be filtered by using the AIS infrastructure.
We present AVAMAT: AntiVirus and Malware Analysis Tool - a tool for analysing the malware detection capabilities of AntiVirus (AV) products running on different operating system (OS) platforms. Even though similar tools are available, such as VirusTotal and MetaDefender, they have several limitations, which motivated the creation of our own tool. With AVAMAT we are able to analyse not only whether an AV detects a malware, but also at what stage of inspection does it detect it and on what OS. AVAMAT enables experimental campaigns to answer various research questions, ranging from the detection capabilities of AVs on OSs, to optimal ways in which AVs could be combined to improve malware detection capabilities.
The panic among medical control, information, and device administrators is due to surmounting number of high-profile attacks on healthcare facilities. This hostile situation is going to lead the health informatics industry to cloud-hoarding of medical data, control flows, and site governance. While different healthcare enterprises opt for cloud-based solutions, it is a matter of time when fog computing environment are formed. Because of major gaps in reported techniques for fog security administration for health data i.e. absence of an overarching certification authority (CA), the security provisioning is one of the the issue that we address in this paper. We propose a security provisioning model (AZSPM) for medical devices in fog environments. We propose that the AZSPM can be build by using atomic security components that are dynamically composed. The verification of authenticity of the atomic components, for trust sake, is performed by calculating the processor clock cycles from service execution at the resident hardware platform. This verification is performed in the fully sand boxed environment. The results of the execution cycles are matched with the service specifications from the manufacturer before forwarding the mobile services to the healthcare cloud-lets. The proposed model is completely novel in the fog computing environments. We aim at building the prototype based on this model in a healthcare information system environment.
Sites for online classified ads selling sex are widely used by human traffickers to support their pernicious business. The sheer quantity of ads makes manual exploration and analysis unscalable. In addition, discerning whether an ad is advertising a trafficked victim or an independent sex worker is a very difficult task. Very little concrete ground truth (i.e., ads definitively known to be posted by a trafficker) exists in this space. In this work, we develop tools and techniques that can be used separately and in conjunction to group sex ads by their true owner (and not the claimed author in the ad). Specifically, we develop a machine learning classifier that uses stylometry to distinguish between ads posted by the same vs. different authors with 90% TPR and 1% FPR. We also design a linking technique that takes advantage of leakages from the Bitcoin mempool, blockchain and sex ad site, to link a subset of sex ads to Bitcoin public wallets and transactions. Finally, we demonstrate via a 4-week proof of concept using Backpage as the sex ad site, how an analyst can use these automated approaches to potentially find human traffickers.
Summary form only given. Strong light-matter coupling has been recently successfully explored in the GHz and THz [1] range with on-chip platforms. New and intriguing quantum optical phenomena have been predicted in the ultrastrong coupling regime [2], when the coupling strength Ω becomes comparable to the unperturbed frequency of the system ω. We recently proposed a new experimental platform where we couple the inter-Landau level transition of an high-mobility 2DEG to the highly subwavelength photonic mode of an LC meta-atom [3] showing very large Ω/ωc = 0.87. Our system benefits from the collective enhancement of the light-matter coupling which comes from the scaling of the coupling Ω ∝ √n, were n is the number of optically active electrons. In our previous experiments [3] and in literature [4] this number varies from 104-103 electrons per meta-atom. We now engineer a new cavity, resonant at 290 GHz, with an extremely reduced effective mode surface Seff = 4 × 10-14 m2 (FE simulations, CST), yielding large field enhancements above 1500 and allowing to enter the few ({\textbackslash}textless;100) electron regime. It consist of a complementary metasurface with two very sharp metallic tips separated by a 60 nm gap (Fig.1(a, b)) on top of a single triangular quantum well. THz-TDS transmission experiments as a function of the applied magnetic field reveal strong anticrossing of the cavity mode with linear cyclotron dispersion. Measurements for arrays of only 12 cavities are reported in Fig.1(c). On the top horizontal axis we report the number of electrons occupying the topmost Landau level as a function of the magnetic field. At the anticrossing field of B=0.73 T we measure approximately 60 electrons ultra strongly coupled (Ω/ω- {\textbackslash}textbar{\textbackslash}textbar
Runtime hardware Trojan detection techniques are required in third party IP based SoCs as a last line of defense. Traditional techniques rely on golden data model or exotic signal processing techniques such as utilizing Choas theory or machine learning. Due to cumbersome implementation of such techniques, it is highly impractical to embed them on the hardware, which is a requirement in some mission critical applications. In this paper, we propose a methodology that generates a digital power profile during the manufacturing test phase of the circuit under test. A simple processing mechanism, which requires minimal computation of measured power signals, is proposed. For the proof of concept, we have applied the proposed methodology on a classical Advanced Encryption Standard circuit with 21 available Trojans. The experimental results show that the proposed methodology is able to detect 75% of the intrusions with the potential of implementing the detection mechanism on-chip with minimal overhead compared to the state-of-the-art techniques.