Biblio
Hackers create different types of Malware such as Trojans which they use to steal user-confidential information (e.g. credit card details) with a few simple commands, recent malware however has been created intelligently and in an uncontrolled size, which puts malware analysis as one of the top important subjects of information security. This paper proposes an efficient dynamic malware-detection method based on API similarity. This proposed method outperform the traditional signature-based detection method. The experiment evaluated 197 malware samples and the proposed method showed promising results of correctly identified malware.
With the development of cyber threats on the Internet, the number of malware, especially unknown malware, is also dramatically increasing. Since all of malware cannot be analyzed by analysts, it is very important to find out new malware that should be analyzed by them. In order to cope with this issue, the existing approaches focused on malware classification using static or dynamic analysis results of malware. However, the static and the dynamic analyses themselves are also too costly and not easy to build the isolated, secure and Internet-like analysis environments such as sandbox. In this paper, we propose a lightweight malware classification method based on detection results of anti-virus software. Since the proposed method can reduce the volume of malware that should be analyzed by analysts, it can be used as a preprocess for in-depth analysis of malware. The experimental showed that the proposed method succeeded in classification of 1,000 malware samples into 187 unique groups. This means that 81% of the original malware samples do not need to analyze by analysts.
Volumetric DDoS attacks continue to inflict serious damage. Many proposed defenses for mitigating such attacks assume that a monitoring system has already detected the attack. However, many proposed DDoS monitoring systems do not focus on efficiently analyzing high volume network traffic to provide important characterizations of the attack in real-time to downstream traffic filtering systems. We propose a scalable real-time framework for an effective volumetric DDoS monitoring system that leverages modern big data technologies for streaming analytics of high volume network traffic to accurately detect and characterize attacks.
Ubiquitous deployment of low-cost mobile positioning devices and the widespread use of high-speed wireless networks enable massive collection of large-scale trajectory data of individuals moving on road networks. Trajectory data mining finds numerous applications including understanding users' historical travel preferences and recommending places of interest to new visitors. Privacy-preserving trajectory mining is an important and challenging problem as exposure of sensitive location information in the trajectories can directly invade the location privacy of the users associated with the trajectories. In this paper, we propose a differentially private trajectory analysis algorithm for points-of-interest recommendation to users that aims at maximizing the accuracy of the recommendation results while protecting the privacy of the exposed trajectories with differential privacy guarantees. Our algorithm first transforms the raw trajectory dataset into a bipartite graph with nodes representing the users and the points-of-interest and the edges representing the visits made by the users to the locations, and then extracts the association matrix representing the bipartite graph to inject carefully calibrated noise to meet έ-differential privacy guarantees. A post-processing of the perturbed association matrix is performed to suppress noise prior to performing a Hyperlink-Induced Topic Search (HITS) on the transformed data that generates an ordered list of recommended points-of-interest. Extensive experiments on a real trajectory dataset show that our algorithm is efficient, scalable and demonstrates high recommendation accuracy while meeting the required differential privacy guarantees.
The blockchain technology has emerged as an attractive solution to address performance and security issues in distributed systems. Blockchain's public and distributed peer-to-peer ledger capability benefits cloud computing services which require functions such as, assured data provenance, auditing, management of digital assets, and distributed consensus. Blockchain's underlying consensus mechanism allows to build a tamper-proof environment, where transactions on any digital assets are verified by set of authentic participants or miners. With use of strong cryptographic methods, blocks of transactions are chained together to enable immutability on the records. However, achieving consensus demands computational power from the miners in exchange of handsome reward. Therefore, greedy miners always try to exploit the system by augmenting their mining power. In this paper, we first discuss blockchain's capability in providing assured data provenance in cloud and present vulnerabilities in blockchain cloud. We model the block withholding (BWH) attack in a blockchain cloud considering distinct pool reward mechanisms. BWH attack provides rogue miner ample resources in the blockchain cloud for disrupting honest miners' mining efforts, which was verified through simulations.
Performing large-scale malware classification is increasingly becoming a critical step in malware analytics as the number and variety of malware samples is rapidly growing. Statistical machine learning constitutes an appealing method to cope with this increase as it can use mathematical tools to extract information out of large-scale datasets and produce interpretable models. This has motivated a surge of scientific work in developing machine learning methods for detection and classification of malicious executables. However, an optimal method for extracting the most informative features for different malware families, with the final goal of malware classification, is yet to be found. Fortunately, neural networks have evolved to the state that they can surpass the limitations of other methods in terms of hierarchical feature extraction. Consequently, neural networks can now offer superior classification accuracy in many domains such as computer vision and natural language processing. In this paper, we transfer the performance improvements achieved in the area of neural networks to model the execution sequences of disassembled malicious binaries. We implement a neural network that consists of convolutional and feedforward neural constructs. This architecture embodies a hierarchical feature extraction approach that combines convolution of n-grams of instructions with plain vectorization of features derived from the headers of the Portable Executable (PE) files. Our evaluation results demonstrate that our approach outperforms baseline methods, such as simple Feedforward Neural Networks and Support Vector Machines, as we achieve 93% on precision and recall, even in case of obfuscations in the data.
Machine learning and data mining algorithms typically assume that the training and testing data are sampled from the same fixed probability distribution; however, this violation is often violated in practice. The field of domain adaptation addresses the situation where this assumption of a fixed probability between the two domains is violated; however, the difference between the two domains (training/source and testing/target) may not be known a priori. There has been a recent thrust in addressing the problem of learning in the presence of an adversary, which we formulate as a problem of domain adaption to build a more robust classifier. This is because the overall security of classifiers and their preprocessing stages have been called into question with the recent findings of adversaries in a learning setting. Adversarial training (and testing) data pose a serious threat to scenarios where an attacker has the opportunity to ``poison'' the training or ``evade'' on the testing data set(s) in order to achieve something that is not in the best interest of the classifier. Recent work has begun to show the impact of adversarial data on several classifiers; however, the impact of the adversary on aspects related to preprocessing of data (i.e., dimensionality reduction or feature selection) has widely been ignored in the revamp of adversarial learning research. Furthermore, variable selection, which is a vital component to any data analysis, has been shown to be particularly susceptible under an attacker that has knowledge of the task. In this work, we explore avenues for learning resilient classification models in the adversarial learning setting by considering the effects of adversarial data and how to mitigate its effects through optimization. Our model forms a single convex optimization problem that uses the labeled training data from the source domain and known- weaknesses of the model for an adversarial component. We benchmark the proposed approach on synthetic data and show the trade-off between classification accuracy and skew-insensitive statistics.
As more and more technologies to store and analyze massive amount of data become available, it is extremely important to make privacy-sensitive data de-identified so that further analysis can be conducted by different parties. For example, data needs to go through data de-identification process before being transferred to institutes for further value added analysis. As such, privacy protection issues associated with the release of data and data mining have become a popular field of study in the domain of big data. As a strict and verifiable definition of privacy, differential privacy has attracted noteworthy attention and widespread research in recent years. Nevertheless, differential privacy is not practical for most applications due to its performance of synthetic dataset generation for data query. Moreover, the definition of data protection by randomized noise in native differential privacy is abstract to users. Therefore, we design a pragmatic DP-based data de-identification protection and risk of data disclosure estimation system, in which a DP-based noise addition mechanism is applied to generate synthetic datasets. Furthermore, the risk of data disclosure to these synthetic datasets can be evaluated before releasing to buyers/consumers.
As the malware threat landscape is constantly evolving and over one million new malware strains are being generated every day [1], early automatic detection of threats constitutes a top priority of cybersecurity research, and amplifies the need for more advanced detection and classification methods that are effective and efficient. In this paper, we present the application of machine learning algorithms to predict the length of time malware should be executed in a sandbox to reveal its malicious intent. We also introduce a novel hybrid approach to malware classification based on static binary analysis and dynamic analysis of malware. Static analysis extracts information from a binary file without executing it, and dynamic analysis captures the behavior of malware in a sandbox environment. Our experimental results show that by turning the aforementioned problems into machine learning problems, it is possible to get an accuracy of up to 90% on the prediction of the malware analysis run time and up to 92% on the classification of malware families.
The smart grid is an electrical grid that has a duplex communication. This communication is between the utility and the consumer. Digital system, automation system, computers and control are the various systems of Smart Grid. It finds applications in a wide variety of systems. Some of its applications have been designed to reduce the risk of power system blackout. Dynamic vulnerability assessment is done to identify, quantify, and prioritize the vulnerabilities in a system. This paper presents a novel approach for classifying the data into one of the two classes called vulnerable or non-vulnerable by carrying out Dynamic Vulnerability Assessment (DVA) based on some data mining techniques such as Multichannel Singular Spectrum Analysis (MSSA), and Principal Component Analysis (PCA), and a machine learning tool such as Support Vector Machine Classifier (SVM-C) with learning algorithms that can analyze data. The developed methodology is tested in the IEEE 57 bus, where the cause of vulnerability is transient instability. The results show that data mining tools can effectively analyze the patterns of the electric signals, and SVM-C can use those patterns for analyzing the system data as vulnerable or non-vulnerable and determines System Vulnerability Status.
Graph reordering is a powerful technique to increase the locality of the representations of graphs, which can be helpful in several applications. We study how the technique can be used to improve compression of graphs and inverted indexes. We extend the recent theoretical model of Chierichetti et al. (KDD 2009) for graph compression, and show how it can be employed for compression-friendly reordering of social networks and web graphs and for assigning document identifiers in inverted indexes. We design and implement a novel theoretically sound reordering algorithm that is based on recursive graph bisection. Our experiments show a significant improvement of the compression rate of graph and indexes over existing heuristics. The new method is relatively simple and allows efficient parallel and distributed implementations, which is demonstrated on graphs with billions of vertices and hundreds of billions of edges.
Enormous amount of educational data has been accumulated through Massive Open Online Courses (MOOCs), as well as commercial and non-commercial learning platforms. This is in addition to the educational data released by US government since 2012 to facilitate disruption in education by making data freely available. The high volume, variety and velocity of collected data necessitate use of big data tools and storage systems such as distributed databases for storage and Apache Spark for analysis. This tutorial will introduce researchers and faculty to real-world applications involving data mining and predictive analytics in learning sciences. In addition, the tutorial will introduce statistics required to validate and accurately report results. Topics will cover how big data is being used to transform education. Specifically, we will demonstrate how exploratory data analysis, data mining, predictive analytics, machine learning, and visualization techniques are being applied to educational big data to improve learning and scale insights driven from millions of student's records. The tutorial will be held over a half day and will be hands on with pre-posted material. Due to the interdisciplinary nature of work, the tutorial appeals to researchers from a wide range of backgrounds including big data, predictive analytics, learning sciences, educational data mining, and in general, those interested in how big data analytics can transform learning. As a prerequisite, attendees are required to have familiarity with at least one programming language.
In Data mining is the method of extracting the knowledge from huge amount of data and interesting patterns. With the rapid increase of data storage, cloud and service-based computing, the risk of misuse of data has become a major concern. Protecting sensitive information present in the data is crucial and critical. Data perturbation plays an important role in privacy preserving data mining. The major challenge of privacy preserving is to concentrate on factors to achieve privacy guarantee and data utility. We propose a data perturbation method that perturbs the data using fuzzy logic and random rotation. It also describes aspects of comparable level of quality over perturbed data and original data. The comparisons are illustrated on different multivariate datasets. Experimental study has proved the model is better in achieving privacy guarantee of data, as well as data utility.
Learning classifier systems (LCSs) are rule-based evolutionary algorithms uniquely suited to classification and data mining in complex, multi-factorial, and heterogeneous problems. LCS rule fitness is commonly based on accuracy, but this metric alone is not ideal for assessing global rule `value' in noisy problem domains, and thus impedes effective knowledge extraction. Multi-objective fitness functions are promising but rely on knowledge of how to weigh objective importance. Prior knowledge would be unavailable in most real-world problems. The Pareto-front concept offers a strategy for multi-objective machine learning that is agnostic to objective importance. We propose a Pareto-inspired multi-objective rule fitness (PIMORF) for LCS, and combine it with a complimentary rule-compaction approach (SRC). We implemented these strategies in ExSTraCS, a successful supervised LCS and evaluated performance over an array of complex simulated noisy and clean problems (i.e. genetic and multiplexer) that each concurrently model pure interaction effects and heterogeneity. While evaluation over multiple performance metrics yielded mixed results, this work represents an important first step towards efficiently learning complex problem spaces without the advantage of prior problem knowledge. Overall the results suggest that PIMORF paired with SRC improved rule set interpretability, particularly with regard to heterogeneous patterns.
Signed social networks have become increasingly important in recent years because of the ability to model trust-based relationships in review sites like Slashdot, Epinions, and Wikipedia. As a result, many traditional network mining problems have been re-visited in the context of networks in which signs are associated with the links. Examples of such problems include community detection, link prediction, and low rank approximation. In this paper, we will examine the problem of ranking nodes in signed networks. In particular, we will design a ranking model, which has a clear physical interpretation in terms of the sign of the edges in the network. Specifically, we propose the Troll-Trust model that models the probability of trustworthiness of individual data sources as an interpretation for the underlying ranking values. We will show the advantages of this approach over a variety of baselines.
Conventional security mechanisms at network, host, and source code levels are no longer sufficient in detecting and responding to increasingly dynamic and sophisticated cyber threats today. Detecting anomalous behavior at the architectural level can help better explain the intent of the threat and strengthen overall system security posture. To that end, we present a framework that mines software component interactions from system execution history and applies a detection algorithm to identify anomalous behavior. The framework uses unsupervised learning at runtime, can perform fast anomaly detection “on the fly”, and can quickly adapt to system load fluctuations and user behavior shifts. Our evaluation of the approach against a real Emergency Deployment System has demonstrated very promising results, showing the framework can effectively detect covert attacks, including insider threats, that may be easily missed by traditional intrusion detection methods.
Malware detection has been widely studied by analysing either file dropping relationships or characteristics of the file distribution network. This paper, for the first time, studies a global heterogeneous malware delivery graph fusing file dropping relationship and the topology of the file distribution network. The integration offers a unique ability of structuring the end-to-end distribution relationship. However, it brings large heterogeneous graphs to analysis. In our study, an average daily generated graph has more than 4 million edges and 2.7 million nodes that differ in type, such as IPs, URLs, and files. We propose a novel Bayesian label propagation model to unify the multi-source information, including content-agnostic features of different node types and topological information of the heterogeneous network. Our approach does not need to examine the source codes nor inspect the dynamic behaviours of a binary. Instead, it estimates the maliciousness of a given file through a semi-supervised label propagation procedure, which has a linear time complexity w.r.t. the number of nodes and edges. The evaluation on 567 million real-world download events validates that our proposed approach efficiently detects malware with a high accuracy.