Biblio
In this paper, we propose a deep learning framework for malware classification. There has been a huge increase in the volume of malware in recent years which poses a serious security threat to financial institutions, businesses and individuals. In order to combat the proliferation of malware, new strategies are essential to quickly identify and classify malware samples so that their behavior can be analyzed. Machine learning approaches are becoming popular for classifying malware, however, most of the existing machine learning methods for malware classification use shallow learning algorithms (e.g. SVM). Recently, Convolutional Neural Networks (CNN), a deep learning approach, have shown superior performance compared to traditional learning algorithms, especially in tasks such as image classification. Motivated by this success, we propose a CNN-based architecture to classify malware samples. We convert malware binaries to grayscale images and subsequently train a CNN for classification. Experiments on two challenging malware classification datasets, Malimg and Microsoft malware, demonstrate that our method achieves better than the state-of-the-art performance. The proposed method achieves 98.52% and 99.97% accuracy on the Malimg and Microsoft datasets respectively.
In recent years, deep convolution neural networks (DCNNs) have won many contests in machine learning, object detection, and pattern recognition. Furthermore, deep learning techniques achieved exceptional performance in image classification, reaching accuracy levels beyond human capability. Malware variants from similar categories often contain similarities due to code reuse. Converting malware samples into images can cause these patterns to manifest as image features, which can be exploited for DCNN classification. Techniques for converting malware binaries into images for visualization and classification have been reported in the literature, and while these methods do reach a high level of classification accuracy on training datasets, they tend to be vulnerable to overfitting and perform poorly on previously unseen samples. In this paper, we explore and document a variety of techniques for representing malware binaries as images with the goal of discovering a format best suited for deep learning. We implement a database for malware binaries from several families, stored in hexadecimal format. These malware samples are converted into images using various approaches and are used to train a neural network to recognize visual patterns in the input and classify malware based on the feature vectors. Each image type is assessed using a variety of learning models, such as transfer learning with existing DCNN architectures and feature extraction for support vector machine classifier training. Each technique is evaluated in terms of classification accuracy, result consistency, and time per trial. Our preliminary results indicate that improved image representation has the potential to enable more effective classification of new malware.
Behavioral malware detection aims to improve on the performance of static signature-based techniques used by anti-virus systems, which are less effective against modern polymorphic and metamorphic malware. Behavioral malware classification aims to go beyond the detection of malware by also identifying a malware's family according to a naming scheme such as the ones used by anti-virus vendors. Behavioral malware classification techniques use run-time features, such as file system or network activities, to capture the behavioral characteristic of running processes. The increasing volume of malware samples, diversity of malware families, and the variety of naming schemes given to malware samples by anti-virus vendors present challenges to behavioral malware classifiers. We describe a behavioral classifier that uses a Convolutional Recurrent Neural Network and data from Microsoft Windows Prefetch files. We demonstrate the model's improvement on the state-of-the-art using a large dataset of malware families and four major anti-virus vendor naming schemes. The model is effective in classifying malware samples that belong to common and rare malware families and can incrementally accommodate the introduction of new malware samples and families.
While the rapid adaptation of mobile devices changes our daily life more conveniently, the threat derived from malware is also increased. There are lots of research to detect malware to protect mobile devices, but most of them adopt only signature-based malware detection method that can be easily bypassed by polymorphic and metamorphic malware. To detect malware and its variants, it is essential to adopt behavior-based detection for efficient malware classification. This paper presents a system that classifies malware by using common behavioral characteristics along with malware families. We measure the similarity between malware families with carefully chosen features commonly appeared in the same family. With the proposed similarity measure, we can classify malware by malware's attack behavior pattern and tactical characteristics. Also, we apply community detection algorithm to increase the modularity within each malware family network aggregation. To maintain high classification accuracy, we propose a process to derive the optimal weights of the selected features in the proposed similarity measure. During this process, we find out which features are significant for representing the similarity between malware samples. Finally, we provide an intuitive graph visualization of malware samples which is helpful to understand the distribution and likeness of the malware networks. In the experiment, the proposed system achieved 97% accuracy for malware classification and 95% accuracy for prediction by K-fold cross-validation using the real malware dataset.
In this paper we propose a new algorithm to detect Advanced Persistent Threats (APT's) that relies on a graph model of HTTP traffic. We also implement a complete detection system with a web interface that allows to interactively analyze the data. We perform a complete parameter study and experimental evaluation using data collected on a real network. The results show that the performance of our system is comparable to currently available antiviruses, although antiviruses use signatures to detect known malwares while our algorithm solely uses behavior analysis to detect new undocumented attacks.
Better understanding of mobile applications' behaviors would lead to better malware detection/classification and better app recommendation for users. In this work, we design a framework AppDNA to automatically generate a compact representation for each app to comprehensively profile its behaviors. The behavior difference between two apps can be measured by the distance between their representations. As a result, the versatile representation can be generated once for each app, and then be used for a wide variety of objectives, including malware detection, app categorizing, plagiarism detection, etc. Based on a systematic and deep understanding of an app's behavior, we propose to perform a function-call-graph-based app profiling. We carefully design a graph-encoding method to convert a typically extremely large call-graph to a 64-dimension fix-size vector to achieve robust app profiling. Our extensive evaluations based on 86,332 benign and malicious apps demonstrate that our system performs app profiling (thus malware detection, classification, and app recommendation) to a high accuracy with extremely low computation cost: it classifies 4024 (benign/malware) apps using around 5.06 second with accuracy about 93.07%; it classifies 570 malware's family (total 21 families) using around 0.83 second with accuracy 82.3%; it classifies 9,730 apps' functionality with accuracy 33.3% for a total of 7 categories and accuracy of 88.1 % for 2 categories.
In this paper we present a new approach, named DLGraph, for malware detection using deep learning and graph embedding. DLGraph employs two stacked denoising autoencoders (SDAs) for representation learning, taking into consideration computer programs' function-call graphs and Windows application programming interface (API) calls. Given a program, we first use a graph embedding technique that maps the program's function-call graph to a vector in a low-dimensional feature space. One SDA in our deep learning model is used to learn a latent representation of the embedded vector of the function-call graph. The other SDA in our model is used to learn a latent representation of the given program's Windows API calls. The two learned latent representations are then merged to form a combined feature vector. Finally, we use softmax regression to classify the combined feature vector for predicting whether the given program is malware or not. Experimental results based on different datasets demonstrate the effectiveness of the proposed approach and its superiority over a related method.
Smartphones have evolved over the years from simple devices to communicate with each other to fully functional portable computers although with comparatively less computational power but inholding multiple applications within. With the smartphone revolution, the value of personal data has increased. As technological complexities increase, so do the vulnerabilities in the system. Smartphones are the latest target for attacks. Android being an open source platform and also the most widely used smartphone OS draws the attention of many malware writers to exploit the vulnerabilities of it. Attackers try to take advantage of these vulnerabilities and fool the user and misuse their data. Malwares have come a long way from simple worms to sophisticated DDOS using Botnets, the latest trends in computer malware tend to go in the distributed direction, to evade the multiple anti-virus apps developed to counter generic viruses and Trojans. However, the recent trend in android system is to have a combination of applications which acts as malware. The applications are benign individually but when grouped, these may result into a malicious activity. This paper proposes a new category of distributed malware in android system, how it can be used to evade the current security, and how it can be detected with the help of graph matching algorithm.
The latent behavior of an information system that can exhibit extreme events, such as system faults or cyber-attacks, is complex. Recently, the invariant network has shown to be a powerful way of characterizing complex system behaviors. Structures and evolutions of the invariance network, in particular, the vanishing correlations, can shed light on identifying causal anomalies and performing system diagnosis. However, due to the dynamic and complex nature of real-world information systems, learning a reliable invariant network in a new environment often requires continuous collecting and analyzing the system surveillance data for several weeks or even months. Although the invariant networks learned from old environments have some common entities and entity relationships, these networks cannot be directly borrowed for the new environment due to the domain variety problem. To avoid the prohibitive time and resource consuming network building process, we propose TINET, a knowledge transfer based model for accelerating invariant network construction. In particular, we first propose an entity estimation model to estimate the probability of each source domain entity that can be included in the final invariant network of the target domain. Then, we propose a dependency construction model for constructing the unbiased dependency relationships by solving a two-constraint optimization problem. Extensive experiments on both synthetic and real-world datasets demonstrate the effectiveness and efficiency of TINET. We also apply TINET to a real enterprise security system for intrusion detection. TINET achieves superior detection performance at least 20 days lead-lag time in advance with more than 75% accuracy.
Community detection in complex networks is a fundamental problem that attracts much attention across various disciplines. Previous studies have been mostly focusing on external connections between nodes (i.e., topology structure) in the network whereas largely ignoring internal intricacies (i.e., local behavior) of each node. A pair of nodes without any interaction can still share similar internal behaviors. For example, in an enterprise information network, compromised computers controlled by the same intruder often demonstrate similar abnormal behaviors even if they do not connect with each other. In this paper, we study the problem of community detection in enterprise information networks, where large-scale internal events and external events coexist on each host. The discovered host communities, capturing behavioral affinity, can benefit many comparative analysis tasks such as host anomaly assessment. In particular, we propose a novel community detection framework to identify behavior-based host communities in enterprise information networks, purely based on large-scale heterogeneous event data. We continue proposing an efficient method for assessing host's anomaly level by leveraging the detected host communities. Experimental results on enterprise networks demonstrate the effectiveness of our model.
Android applications are usually obfuscated before release, making it difficult to analyze them for malware presence or intellectual property violations. Obfuscators might hide the true intent of code by renaming variables and/or modifying program structures. It is challenging to search for executables relevant to an obfuscated application for developers to analyze efficiently. Prior approaches toward obfuscation resilient search have relied on certain structural parts of apps remaining as landmarks, un-touched by obfuscation. For instance, some prior approaches have assumed that the structural relationships between identifiers are not broken by obfuscators; others have assumed that control flow graphs maintain their structures. Both approaches can be easily defeated by a motivated obfuscator. We present a new approach, MACNETO, to search for programs relevant to obfuscated executables leveraging deep learning and principal components on instructions. MACNETO makes few assumptions about the kinds of modifications that an obfuscator might perform. We show that it has high search precision for executables obfuscated by a state-of-the-art obfuscator that changes control flow. Further, we also demonstrate the potential of MACNETO to help developers understand executables, where MACNETO infers keywords (which are from relevant un-obfuscated programs) for obfuscated executables.
Given graphs with millions or billions of vertices and edges, how can we efficiently make inferences based on partial knowledge? Loopy Belief Propagation(LBP) is a graph inference algorithm widely used in various applications including social network analysis, malware detection, recommendation, and image restoration. The algorithm calculates approximate marginal probabilities of vertices in a graph within a linear running time proportional to the number of edges. However, when it comes to real-world graphs with millions or billions of vertices and edges, this cost overwhelms the computing power of a single machine. Moreover, this kind of large-scale graphs does not fit into the memory of a single machine. Although several distributed LBP methods have been proposed, previous works do not consider the properties of real-world graphs, especially the effect of power-law degree distribution on LBP. Therefore, our work focuses on developing a fast and scalable LBP for such large real-world graphs on distributed environment. In this paper, we propose DLBP, a Distributed Loopy Belief Propagation algorithm which efficiently computes LBP in a distributed manner across multiple machines. By setting the correct convergence criterion and carefully scheduling the computations, DLBP provides up to 10.7x speed up compared to standard distributed LBP. We show that DLBP demonstrates near-linear scalability with respect to the number of machines as well as the number of edges.
In this paper, we propose a graph-based algorithmic technique for malware detection, utilizing the System-call Dependency Graphs (ScDG) obtained through taint analysis traces. We leverage the grouping of system-calls into system-call groups with respect to their functionality to merge disjoint vertices of ScDG graphs, transforming them to Group Relation Graphs (GrG); note that, the GrG graphs represent malware's behavior being hence more resilient to probable mutations of its structure. More precisely, we extend the use of GrG graphs by mapping their vertices on the plane utilizing the degrees and the vertex-weights of a specific underlying graph of the GrG graph as to compute domination relations. Furthermore, we investigate how the activity of each system-call group could be utilized in order to distinguish graph-representations of malware and benign software. The domination relations among the vertices of GrG graphs result to a new graph representation that we call Coverage Graph of the GrG graph. Finally, we evaluate the potentials of our detection model using graph similarity between Coverage Graphs of known malicious and benign software samples of various types.
The popularity of Android, not only in handsets but also in IoT devices, makes it a very attractive target for malware threats, which are actually expanding at a significant rate. The state-of-the-art in malware mitigation solutions mainly focuses on the detection of malicious Android apps using dynamic and static analysis features to segregate malicious apps from benign ones. Nevertheless, there is a small coverage for the Internet/network dimension of Android malicious apps. In this paper, we present ToGather, an automatic investigation framework that takes Android malware samples as input and produces insights about the underlying malicious cyber infrastructures. ToGather leverages state-of-the-art graph theory techniques to generate actionable, relevant and granular intelligence to mitigate the threat effects induced by the malicious Internet activity of Android malware apps. We evaluate ToGather on a large dataset of real malware samples from various Android families, and the obtained results are both interesting and promising.
Security has always been a major issue in cloud. Data sources are the most valuable and vulnerable information which is aimed by attackers to steal. If data is lost, then the privacy and security of every cloud user are compromised. Even though a cloud network is secured externally, the threat of an internal attacker exists. Internal attackers compromise a vulnerable user node and get access to a system. They are connected to the cloud network internally and launch attacks pretending to be trusted users. Machine learning approaches are widely used for cloud security issues. The existing machine learning based security approaches classify a node as a misbehaving node based on short-term behavioral data. These systems do not differentiate whether a misbehaving node is a malicious node or a broken node. To address this problem, this paper proposes an Improvised Long Short-Term Memory (ILSTM) model which learns the behavior of a user and automatically trains itself and stores the behavioral data. The model can easily classify the user behavior as normal or abnormal. The proposed ILSTM not only identifies an anomaly node but also finds whether a misbehaving node is a broken node or a new user node or a compromised node using the calculated trust factor. The proposed model not only detects the attack accurately but also reduces the false alarm in the cloud network.
Modern industrial control systems (ICS) act as victims of cyber attacks more often in last years. These attacks are hard to detect and their consequences can be catastrophic. Cyber attacks can cause anomalies in the work of the ICS and its technological equipment. The presence of mutual interference and noises in this equipment significantly complicates anomaly detection. Moreover, the traditional means of protection, which used in corporate solutions, require updating with each change in the structure of the industrial process. An approach based on the machine learning for anomaly detection was used to overcome these problems. It complements traditional methods and allows one to detect signal correlations and use them for anomaly detection. Additional Tennessee Eastman Process Simulation Data for Anomaly Detection Evaluation dataset was analyzed as example of industrial process. In the course of the research, correlations between the signals of the sensors were detected and preliminary data processing was carried out. Algorithms from the most common techniques of machine learning (decision trees, linear algorithms, support vector machines) and deep learning models (neural networks) were investigated for industrial process anomaly detection task. It's shown that linear algorithms are least demanding on computational resources, but they don't achieve an acceptable result and allow a significant number of errors. Decision tree-based algorithms provided an acceptable accuracy, but the amount of RAM, required for their operations, relates polynomially with the training sample volume. The deep neural networks provided the greatest accuracy, but they require considerable computing power for internal calculations.
With the exponential hike in cyber threats, organizations are now striving for better data mining techniques in order to analyze security logs received from their IT infrastructures to ensure effective and automated cyber threat detection. Machine Learning (ML) based analytics for security machine data is the next emerging trend in cyber security, aimed at mining security data to uncover advanced targeted cyber threats actors and minimizing the operational overheads of maintaining static correlation rules. However, selection of optimal machine learning algorithm for security log analytics still remains an impeding factor against the success of data science in cyber security due to the risk of large number of false-positive detections, especially in the case of large-scale or global Security Operations Center (SOC) environments. This fact brings a dire need for an efficient machine learning based cyber threat detection model, capable of minimizing the false detection rates. In this paper, we are proposing optimal machine learning algorithms with their implementation framework based on analytical and empirical evaluations of gathered results, while using various prediction, classification and forecasting algorithms.
The Machine Type Communication Devices (MTCDs) are usually based on Internet Protocol (IP), which can cause billions of connected objects to be part of the Internet. The enormous amount of data coming from these devices are quite heterogeneous in nature, which can lead to security issues, such as injection attacks, ballot stuffing, and bad mouthing. Consequently, this work considers machine learning trust evaluation as an effective and accurate option for solving the issues associate with security threats. In this paper, a comparative analysis is carried out with five different machine learning approaches: Naive Bayes (NB), Decision Tree (DT), Linear and Radial Support Vector Machine (SVM), KNearest Neighbor (KNN), and Random Forest (RF). As a critical element of the research, the recommendations consider different Machine-to-Machine (M2M) communication nodes with regard to their ability to identify malicious and honest information. To validate the performances of these models, two trust computation measures were used: Receiver Operating Characteristics (ROCs), Precision and Recall. The malicious data was formulated in Matlab. A scenario was created where 50% of the information were modified to be malicious. The malicious nodes were varied in the ranges of 10%, 20%, 30%, 40%, and the results were carefully analyzed.
Physical Unclonable Functions (PUFs) have been designed for many security applications such as identification, authentication of devices and key generation, especially for lightweight electronics. Traditional approaches to enhancing security, such as hash functions, may be expensive and resource dependent. However, modelling attacks using machine learning (ML) show the vulnerability of most PUFs. In this paper, a combination of a 32-bit current mirror and 16-bit arbiter PUFs in 65nm CMOS technology is proposed to improve resilience against modelling attacks. Both PUFs are vulnerable to machine learning attacks and we reduce the output prediction rate from 99.2% and 98.8% individually, to 60%.
Nowadays, most vendors apply the same open source code to their products, which is dangerous. In addition, when manufacturers release patches, they generally hide the exact location of the vulnerabilities. So, identifying vulnerabilities in binaries is crucial. However, just searching source program has a lower identifying accuracy of vulnerability, which requires operators further to differentiate searched results. Under this context, we propose VMPBL to enhance identifying the accuracy of vulnerability with the help of patch files. VMPBL, compared with other proposed schemes, uses patched functions according to its vulnerable functions in patch file to further distinguish results. We establish a prototype of VMPBL, which can effectively identify vulnerable function types and get rid of safe functions from results. Firstly, we get the potential vulnerable-patched functions by binary comparison technique based on K-Trace algorithm. Then we combine the functions with vulnerability and patch knowledge database to classify these function pairs and identify the possible vulnerable functions and the vulnerability types. Finally, we test some programs containing real-world CWE vulnerabilities, and one of the experimental results about CWE415 shows that the results returned from only searching source program are about twice as much as the results from VMPBL. We can see that using VMPBL can significantly reduce the false positive rate of discovering vulnerabilities compared with analyzing source files alone.
This computer era leads human to interact with computers and networks but there is no such solution to get rid of security problems. Securities threats misleads internet, we are sometimes losing our hope and reliability with many server based access. Even though many more crypto algorithms are coming for integrity and authentic data in computer access still there is a non reliable threat penetrates inconsistent vulnerabilities in networks. These vulnerable sites are taking control over the user's computer and doing harmful actions without user's privileges. Though Firewalls and protocols may support our browsers via setting certain rules, still our system couldn't support for data reliability and confidentiality. Since these problems are based on network access, lets we consider TCP/IP parameters as a dataset for analysis. By doing preprocess of TCP/IP packets we can build sovereign model on data set and clump cluster. Further the data set gets classified into regular traffic pattern and anonymous pattern using KNN classification algorithm. Based on obtained pattern for normal and threats data sets, security devices and system will set rules and guidelines to learn by it to take needed stroke. This paper analysis the computer to learn security actions from the given data sets which already exist in the previous happens.
With the growth of smartphone sales and app usage, fingerprinting and identification of smartphone apps have become a considerable threat to user security and privacy. Traffic analysis is one of the most common methods for identifying apps. Traditional countermeasures towards traffic analysis includes traffic morphing and multipath routing. The basic idea of multipath routing is to increase the difficulty for adversary to eavesdrop all traffic by splitting traffic into several subflows and transmitting them through different routes. Previous works in multipath routing mainly focus on Wireless Sensor Networks (WSNs) or Mobile Ad Hoc Networks (MANETs). In this paper, we propose a multipath routing scheme for smartphones with edge network assistance to mitigate traffic analysis attack. We consider an adversary with limited capability, that is, he can only intercept the traffic of one node following certain attack probability, and try to minimize the traffic an adversary can intercept. We formulate our design as a flow routing optimization problem. Then a heuristic algorithm is proposed to solve the problem. Finally, we present the simulation results for our scheme and justify that our scheme can effectively protect smartphones from traffic analysis attack.
Routing security plays an important role in Mobile Ad hoc Networks (MANETs). Despite many attempts to improve its security, the routing procedure of MANETs remains vulnerable to attacks. Existing approaches offer support for detecting attacks or debugging in different routing phases, but many of them have not considered the privacy of the nodes during the anomalies detection, which depend on the central control program or a third party to supervise the whole network. In this paper, we present an approach called LAD which uses the raw logs of routers to construct control a flow graph and find the existing communication rules in MANETs. With the reasoning rules, LAD can detect both active and passive attacks launched during the routing phase. LAD can also protect the privacy of the nodes in the verification phase with the specific Merkle hash tree. Without deploying any special nodes to assist the verification, LAD can detect multiple malicious nodes by itself. To show that our approach can be used to guarantee the security of the MANETs, we deploy our experiment in NS3 as well as the practical router environment. LAD can improve the accuracy rate from 2.28% to 29.22%. The results show that LAD performs limited time and memory usages, high detection and low false positives.
{We study secret sharing schemes for general (non-threshold) access structures. A general secret sharing scheme for n parties is associated to a monotone function F:\0,1\n$\rightarrowłbrace$0,1\}. In such a scheme, a dealer distributes shares of a secret s among n parties. Any subset of parties T {$\subseteq$} [n] should be able to put together their shares and reconstruct the secret s if F(T)=1, and should have no information about s if F(T)=0. One of the major long-standing questions in information-theoretic cryptography is to minimize the (total) size of the shares in a secret-sharing scheme for arbitrary monotone functions F. There is a large gap between lower and upper bounds for secret sharing. The best known scheme for general F has shares of size 2n-o(n), but the best lower bound is {$Ømega$}(n2/logn). Indeed, the exponential share size is a direct result of the fact that in all known secret-sharing schemes, the share size grows with the size of a circuit (or formula, or monotone span program) for F. Indeed, several researchers have suggested the existence of a representation size barrier which implies that the right answer is closer to the upper bound, namely, 2n-o(n). In this work, we overcome this barrier by constructing a secret sharing scheme for any access structure with shares of size 20.994n and a linear secret sharing scheme for any access structure with shares of size 20.999n. As a contribution of independent interest, we also construct a secret sharing scheme with shares of size 2Õ({$\surd$}n) for 2n n/2 monotone access structures, out of a total of 2n n/2{$\cdot$} (1+O(logn/n)) of them. Our construction builds on recent works that construct better protocols for the conditional disclosure of secrets (CDS) problem.