Biblio
In this study, we have used the Image Similarity technique to detect the unknown or new type of malware using CNN ap- proach. CNN was investigated and tested with three types of datasets i.e. one from Vision Research Lab, which contains 9458 gray-scale images that have been extracted from the same number of malware samples that come from 25 differ- ent malware families, and second was benign dataset which contained 3000 different kinds of benign software. Benign dataset and dataset vision research lab were initially exe- cutable files which were converted in to binary code and then converted in to image files. We obtained a testing ac- curacy of 98% on Vision Research dataset.
In Android malware detection, recent work has shown that using contextual information of sensitive API invocation in the modeling of applications is able to improve the classification accuracy. However, the improvement brought by this context-awareness varies depending on how this information is used in the modeling. In this paper, we perform a comprehensive study on the effectiveness of using the contextual information in prior state-of-the-art detection systems. We find that this information has been "over-used" such that a large amount of non-essential metadata built into the models weakens the generalizability and longevity of the model, thus finally affects the detection accuracy. On the other hand, we find that the entrypoint of API invocation has the strongest impact on the classification correctness, which can further improve the accuracy if being properly captured. Based on this finding, we design and implement a lightweight, circumstance-aware detection system, named "PIKADROID" that only uses the API invocation and its entrypoint in the modeling. For extracting the meaningful entrypoints, PIKADROID applies a set of static analysis techniques to extract and sanitize the reachable entrypoints of a sensitive API, then constructs a frequency model for classification decision. In the evaluation, we show that this slim model significantly improves the detection accuracy on a data set of 23,631 applications by achieving an f-score of 97.41%, while maintaining a false positive rating of 0.96%.
This is very true for the Windows operating system (OS) used by government and private organizations. With Windows, the closed source nature of the operating system has unfortunately meant that hidden security issues are discovered very late and the fixes are not found in real time. There needs to be a reexamination of current static methods of malware detection. This paper presents an integrated system for automated and real-time monitoring and prediction of rootkit and malware threats for the Windows OS. We propose to host the target Windows machines on the widely used Xen hypervisor, and collect process behavior using virtual memory introspection (VMI). The collected data will be analyzed using state of the art machine learning techniques to quickly isolate malicious process behavior and alert system administrators about potential cyber breaches. This research has two focus areas: identifying memory data structures and developing prediction tools to detect malware. The first part of research focuses on identifying memory data structures affected by malware. This includes extracting the kernel data structures with VMI that are frequently targeted by rootkits/malware. The second part of the research will involve development of a prediction tool using machine learning techniques.
Microsoft's PowerShell is a command-line shell and scripting language that is installed by default on Windows machines. Based on Microsoft's .NET framework, it includes an interface that allows programmers to access operating system services. While PowerShell can be configured by administrators for restricting access and reducing vulnerabilities, these restrictions can be bypassed. Moreover, PowerShell commands can be easily generated dynamically, executed from memory, encoded and obfuscated, thus making the logging and forensic analysis of code executed by PowerShell challenging. For all these reasons, PowerShell is increasingly used by cybercriminals as part of their attacks' tool chain, mainly for downloading malicious contents and for lateral movement. Indeed, a recent comprehensive technical report by Symantec dedicated to PowerShell's abuse by cybercrimials [52] reported on a sharp increase in the number of malicious PowerShell samples they received and in the number of penetration tools and frameworks that use PowerShell. This highlights the urgent need of developing effective methods for detecting malicious PowerShell commands. In this work, we address this challenge by implementing several novel detectors of malicious PowerShell commands and evaluating their performance. We implemented both "traditional" natural language processing (NLP) based detectors and detectors based on character-level convolutional neural networks (CNNs). Detectors' performance was evaluated using a large real-world dataset. Our evaluation results show that, although our detectors (and especially the traditional NLP-based ones) individually yield high performance, an ensemble detector that combines an NLP-based classifier with a CNN-based classifier provides the best performance, since the latter classifier is able to detect malicious commands that succeed in evading the former. Our analysis of these evasive commands reveals that some obfuscation patterns automatically detected by the CNN classifier are intrinsically difficult to detect using the NLP techniques we applied. Our detectors provide high recall values while maintaining a very low false positive rate, making us cautiously optimistic that they can be of practical value.
Internet of Things (IoT) in military setting generally consists of a diverse range of Internet-connected devices and nodes (e.g. medical devices to wearable combat uniforms), which are a valuable target for cyber criminals, particularly state-sponsored or nation state actors. A common attack vector is the use of malware. In this paper, we present a deep learning based method to detect Internet Of Battlefield Things (IoBT) malware via the device's Operational Code (OpCode) sequence. We transmute OpCodes into a vector space and apply a deep Eigenspace learning approach to classify malicious and bening application. We also demonstrate the robustness of our proposed approach in malware detection and its sustainability against junk code insertion attacks. Lastly, we make available our malware sample on Github, which hopefully will benefit future research efforts (e.g. for evaluation of proposed malware detection approaches).
Detecting malicious code with exact match on collected datasets is becoming a large-scale identification problem due to the existence of new malware variants. Being able to promptly and accurately identify new attacks enables security experts to respond effectively. My proposal is to develop an automated framework for identification of unknown vulnerabilities by leveraging current neural network techniques. This has a significant and immediate value for the security field, as current anti-virus software is typically able to recognize the malware type only after its infection, and preventive measures are limited. Artificial Intelligence plays a major role in automatic malware classification: numerous machine-learning methods, both supervised and unsupervised, have been researched to try classifying malware into families based on features acquired by static and dynamic analysis. The value of automated identification is clear, as feature engineering is both a time-consuming and time-sensitive task, with new malware studied while being observed in the wild.
Nowadays, Malware has become a serious threat to the digitization of the world due to the emergence of various new and complex malware every day. Due to this, the traditional signature-based methods for detection of malware effectively becomes an obsolete method. The efficiency of the machine learning model in context to the detection of malware files has been proved by different researches and studies. In this paper, a framework has been developed to detect and classify different files (e.g exe, pdf, php, etc.) as benign and malicious using two level classifier namely, Macro (for detection of malware) and Micro (for classification of malware files as a Trojan, Spyware, Adware, etc.). Cuckoo Sandbox is used for generating static and dynamic analysis report by executing files in the virtual environment. In addition, a novel model is developed for extracting features based on static, behavioral and network analysis using analysis report generated by the Cuckoo Sandbox. Weka Framework is used to develop machine learning models by using training datasets. The experimental results using proposed framework shows high detection rate with an accuracy of 100% using J48 Decision tree model, 99% using SMO (Sequential Minimal Optimization) and 97% using Random Forest tree. It also shows effective classification rate with accuracy 100% using J48 Decision tree, 91% using SMO and 66% using Random Forest tree. These results are used for detecting and classifying unknown files as benign or malicious.
Malicious applications have become increasingly numerous. This demands adaptive, learning-based techniques for constructing malware detection engines, instead of the traditional manual-based strategies. Prior work in learning-based malware detection engines primarily focuses on dynamic trace analysis and byte-level n-grams. Our approach in this paper differs in that we use compiler intermediate representations, i.e., the callgraph representation of binaries. Using graph-based program representations for learning provides structure of the program, which can be used to learn more advanced patterns. We use the Shortest Path Graph Kernel (SPGK) to identify similarities between call graphs extracted from binaries. The output similarity matrix is fed into a Support Vector Machine (SVM) algorithm to construct highly-accurate models to predict whether a binary is malicious or not. However, SPGK is computationally expensive due to the size of the input graphs. Therefore, we evaluate different parallelization methods for CPUs and GPUs to speed up this kernel, allowing us to continuously construct up-to-date models in a timely manner. Our hybrid implementation, which leverages both CPU and GPU, yields the best performance, achieving up to a 14.2x improvement over our already optimized OpenMP version. We compared our generated graph-based models to previously state-of-the-art feature vector 2-gram and 3-gram models on a dataset consisting of over 22,000 binaries. We show that our classification accuracy using graphs is over 19% higher than either n-gram model and gives a false positive rate (FPR) of less than 0.1%. We are also able to consider large call graphs and dataset sizes because of the reduced execution time of our parallelized SPGK implementation.
As a vital component of variety cyber attacks, malicious domain detection becomes a hot topic for cyber security. Several recent techniques are proposed to identify malicious domains through analysis of DNS data because much of global information in DNS data which cannot be affected by the attackers. The attackers always recycle resources, so they frequently change the domain - IP resolutions and create new domains to avoid detection. Therefore, multiple malicious domains are hosted by the same IPs and multiple IPs also host same malicious domains in simultaneously, which create intrinsic association among them. Hence, using the labeled domains which can be traced back from queries history of all domains to verify and figure out the association of them all. Graphs seem the best candidate to represent for this relationship and there are many algorithms developed on graph with high performance. A graph-based interface can be developed and transformed to the graph mining task of inferring graph node's reputation scores using improvements of the belief propagation algorithm. Then higher reputation scores the nodes reveal, the more malicious probabilities they infer. For demonstration, this paper proposes a malicious domain detection technique and evaluates on a real-world dataset. The dataset is collected from DNS data servers which will be used for building a DNS graph. The proposed technique achieves high performance in accuracy rates over 98.3%, precision and recall rates as: 99.1%, 98.6%. Especially, with a small set of labeled domains (legitimate and malicious domains), the technique can discover a large set of potential malicious domains. The results indicate that the method is strongly effective in detecting malicious domains.
The application of machine learning for the detection of malicious network traffic has been well researched over the past several decades; it is particularly appealing when the traffic is encrypted because traditional pattern-matching approaches cannot be used. Unfortunately, the promise of machine learning has been slow to materialize in the network security domain. In this paper, we highlight two primary reasons why this is the case: inaccurate ground truth and a highly non-stationary data distribution. To demonstrate and understand the effect that these pitfalls have on popular machine learning algorithms, we design and carry out experiments that show how six common algorithms perform when confronted with real network data. With our experimental results, we identify the situations in which certain classes of algorithms underperform on the task of encrypted malware traffic classification. We offer concrete recommendations for practitioners given the real-world constraints outlined. From an algorithmic perspective, we find that the random forest ensemble method outperformed competing methods. More importantly, feature engineering was decisive; we found that iterating on the initial feature set, and including features suggested by domain experts, had a much greater impact on the performance of the classification system. For example, linear regression using the more expressive feature set easily outperformed the random forest method using a standard network traffic representation on all criteria considered. Our analysis is based on millions of TLS encrypted sessions collected over 12 months from a commercial malware sandbox and two geographically distinct, large enterprise networks.
The large number of malicious files that are produced daily outpaces the current capacity of malware analysis and detection. For example, Intel Security Labs reported that during the second quarter of 2016, their system found more than 40M of new malware [1]. The damage of malware attacks is also increasingly devastating, as witnessed by the recent Cryptowall malware that has reportedly generated more than \$325M in ransom payments to its perpetrators [2]. In terms of defense, it has been widely accepted that the traditional approach based on byte-string signatures is increasingly ineffective, especially for new malware samples and sophisticated variants of existing ones. New techniques are therefore needed for effective defense against malware. Motivated by this problem, the paper investigates a new defense technique against malware. The technique presented in this paper is utilized for automatic identification of malware packers that are used to obfuscate malware programs. Signatures of malware packers and obfuscators are extracted from the CFGs of malware samples. Unlike conventional byte signatures that can be evaded by simply modifying one or multiple bytes in malware samples, these signatures are more difficult to evade. For example, CFG-based signatures are shown to be resilient against instruction modifications and shuffling, as a single signature is sufficient for detecting mildly different versions of the same malware. Last but not least, the process for extracting CFG-based signatures is also made automatic.
Hardware Malware Detectors (HMDs) have recently been proposed as a defense against the proliferation of malware. These detectors use low-level features, that can be collected by the hardware performance monitoring units on modern CPUs to detect malware as a computational anomaly. Several aspects of the detector construction have been explored, leading to detectors with high accuracy. In this paper, we explore the question of how well evasive malware can avoid detection by HMDs. We show that existing HMDs can be effectively reverse-engineered and subsequently evaded, allowing malware to hide from detection without substantially slowing it down (which is important for certain types of malware). This result demonstrates that the current generation of HMDs can be easily defeated by evasive malware. Next, we explore how well a detector can evolve if it is exposed to this evasive malware during training. We show that simple detectors, such as logistic regression, cannot detect the evasive malware even with retraining. More sophisticated detectors can be retrained to detect evasive malware, but the retrained detectors can be reverse-engineered and evaded again. To address these limitations, we propose a new type of Resilient HMDs (RHMDs) that stochastically switch between different detectors. These detectors can be shown to be provably more difficult to reverse engineer based on resent results in probably approximately correct (PAC) learnability theory. We show that indeed such detectors are resilient to both reverse engineering and evasion, and that the resilience increases with the number and diversity of the individual detectors. Our results demonstrate that these HMDs offer effective defense against evasive malware at low additional complexity.
Information and communication technologies are extensively used to monitor and control electric microgrids. Although, such innovation enhance self healing, resilience, and efficiency of the energy infrastructure, it brings emerging security threats to be a critical challenge. In the context of microgrid, the cyber vulnerabilities may be exploited by malicious users for manipulate system parameters, meter measurements and price information. In particular, malware may be used to acquire direct access to monitor and control devices in order to destabilize the microgrid ecosystem. In this paper, we exploit a sandbox to analyze security vulnerability to malware of involved embedded smart-devices, by monitoring at different abstraction levels potential malicious behaviors. In this direction, the CoSSMic project represents a relevant case study.
The state-of-the-art Android malware often encrypts or encodes malicious code snippets to evade malware detection. In this paper, such undetectable codes are called Mysterious Codes. To make such codes detectable, we design a system called Droidrevealer to automatically identify Mysterious Codes and then decode or decrypt them. The prototype of Droidrevealer is implemented and evaluated with 5,600 malwares. The results show that 257 samples contain the Mysterious Codes and 11,367 items are exposed. Furthermore, several sensitive behaviors hidden in the Mysterious Codes are disclosed by Droidrevealer.
Modern cyber attacks are often conducted by distributing digital documents that contain malware. The approach detailed herein, which consists of a classifier that uses features derived from dynamic analysis of a document viewer as it renders the document in question, is capable of classifying the disposition of digital documents with greater than 98% accuracy even when its model is trained on just small amounts of data. To keep the classification model itself small and thereby to provide scalability, we employ an entity resolution strategy that merges syntactically disparate features that are thought to be semantically equivalent but vary due to programmatic randomness. Entity resolution enables construction of a comprehensive model of benign functionality using relatively few training documents, and the model does not improve significantly with additional training data.
As the malware threat landscape is constantly evolving and over one million new malware strains are being generated every day [1], early automatic detection of threats constitutes a top priority of cybersecurity research, and amplifies the need for more advanced detection and classification methods that are effective and efficient. In this paper, we present the application of machine learning algorithms to predict the length of time malware should be executed in a sandbox to reveal its malicious intent. We also introduce a novel hybrid approach to malware classification based on static binary analysis and dynamic analysis of malware. Static analysis extracts information from a binary file without executing it, and dynamic analysis captures the behavior of malware in a sandbox environment. Our experimental results show that by turning the aforementioned problems into machine learning problems, it is possible to get an accuracy of up to 90% on the prediction of the malware analysis run time and up to 92% on the classification of malware families.
In this paper, a novel method to do feature selection to detect botnets at their phase of Command and Control (C&C) is presented. A major problem is that researchers have proposed features based on their expertise, but there is no a method to evaluate these features since some of these features could get a lower detection rate than other. To this aim, we find the feature set based on connections of botnets at their phase of C&C, that maximizes the detection rate of these botnets. A Genetic Algorithm (GA) was used to select the set of features that gives the highest detection rate. We used the machine learning algorithm C4.5, this algorithm did the classification between connections belonging or not to a botnet. The datasets used in this paper were extracted from the repositories ISOT and ISCX. Some tests were done to get the best parameters in a GA and the algorithm C4.5. We also performed experiments in order to obtain the best set of features for each botnet analyzed (specific), and for each type of botnet (general) too. The results are shown at the end of the paper, in which a considerable reduction of features and a higher detection rate than the related work presented were obtained.
Ransomwares have become a growing threat since 2012, and the situation continues to worsen until now. The lack of security mechanisms and security awareness are pushing the systems into mire of ransomware attacks. In this paper, a new framework called 2entFOX' is proposed in order to detect high survivable ransomwares (HSR). To our knowledge this framework can be considered as one of the first frameworks in ransomware detection because of little publicly-available research in this field. We analyzed Windows ransomwares' behaviour and we tried to find appropriate features which are particular useful in detecting this type of malwares with high detection accuracy and low false positive rate. After hard experimental analysis we extracted 20 effective features which due to two highly efficient ones we could achieve an appropriate set for HSRs detection. After proposing architecture based on Bayesian belief network, the final evaluation is done on some known ransomware samples and unknown ones based on six different scenarios. The result of this evaluations shows the high accuracy of 2entFox in detection of HSRs.
In this paper we propose Mastino, a novel defense system to detect malware download events. A download event is a 3-tuple that identifies the action of downloading a file from a URL that was triggered by a client (machine). Mastino utilizes global situation awareness and continuously monitors various network- and system-level events of the clients' machines across the Internet and provides real time classification of both files and URLs to the clients upon submission of a new, unknown file or URL to the system. To enable detection of the download events, Mastino builds a large download graph that captures the subtle relationships among the entities of download events, i.e. files, URLs, and machines. We implemented a prototype version of Mastino and evaluated it in a large-scale real-world deployment. Our experimental evaluation shows that Mastino can accurately classify malware download events with an average of 95.5% true positive (TP), while incurring less than 0.5% false positives (FP). In addition, we show the Mastino can classify a new download event as either benign or malware in just a fraction of a second, and is therefore suitable as a real time defense system.
Mobile malware has recently become an acute problem. Existing solutions either base static reasoning on syntactic properties, such as exception handlers or configuration fields, or compute data-flow reachability over the program, which leads to scalability challenges. We explore a new and complementary category of features, which strikes a middleground between the above two categories. This new category focuses on security-relevant operations (communcation, lifecycle, etc) –- and in particular, their multiplicity and happens-before order –- as a means to distinguish between malicious and benign applications. Computing these features requires semantic, yet lightweight, modeling of the program's behavior. We have created a malware detection system for Android, MassDroid, that collects traces of security-relevant operations from the call graph via a scalable form of data-flow analysis. These are reduced to happens-before and multiplicity features, then fed into a supervised learning engine to obtain a malicious/benign classification. MassDroid also embodies a novel reporting interface, containing pointers into the code that serve as evidence supporting the determination. We have applied MassDroid to 35,000 Android apps from the wild. The results are highly encouraging with an F-score of 95% in standard testing, and textgreater90% when applied to previously unseen malware signatures. MassDroid is also efficient, requiring about two minutes per app. MassDroid is publicly available as a cloud service for malware detection.
Malware detection has been widely studied by analysing either file dropping relationships or characteristics of the file distribution network. This paper, for the first time, studies a global heterogeneous malware delivery graph fusing file dropping relationship and the topology of the file distribution network. The integration offers a unique ability of structuring the end-to-end distribution relationship. However, it brings large heterogeneous graphs to analysis. In our study, an average daily generated graph has more than 4 million edges and 2.7 million nodes that differ in type, such as IPs, URLs, and files. We propose a novel Bayesian label propagation model to unify the multi-source information, including content-agnostic features of different node types and topological information of the heterogeneous network. Our approach does not need to examine the source codes nor inspect the dynamic behaviours of a binary. Instead, it estimates the maliciousness of a given file through a semi-supervised label propagation procedure, which has a linear time complexity w.r.t. the number of nodes and edges. The evaluation on 567 million real-world download events validates that our proposed approach efficiently detects malware with a high accuracy.
A normal computer infected with malware is difficult to detect. There have been several approaches in the last years which analyze the behavior of malware and obtain good results. The malware traffic may be detected, but it is very common to miss-detect normal traffic as malicious and generate false positives. This is specially the case when the methods are tested in real and large networks. The detection errors are generated due to the malware changing and rapidly adapting its domains and patterns to mimic normal connections. To better detect malware infections and separate them from normal traffic we propose to detect the behavior of the group of connections generated by the malware. It is known that malware usually generates various related connections simultaneously and therefore it shows a group pattern. Based on previous experiments, this paper suggests that the behavior of a group of connections can be modelled as a directed cyclic graph with special properties, such as its internal patterns, relationships, frequencies and sequences of connections. By training the group models on known traffic it may be possible to better distinguish between a malware connection and a normal connection.
Kernel rootkits often hide associated malicious processes by altering reported task struct information to upper layers and applications such as ps and top. Virtualized settings offer a unique opportunity to mitigate this behavior using dynamic virtual machine introspection (VMI). For known kernels, VMI can be deployed to search for kernel objects and identify them by using unique data structure "signatures". In existing work, VMI-detected data structure signatures are based on values and structural features which must be (often exactly) present in memory snapshots taken, for accurate detection. This features a certain brittleness and rootkits can escape detection by simply temporarily "un-tangling" the corresponding structures when not running. Here we introduce a new paradigm, that defeats such behavior by training for and observing signatures of timing access patterns to any and all kernel-mapped data regions, including objects that are not directly linked in the "official" list of tasks. The use of timing information in training detection signatures renders the defenses resistant to attacks that try to evade detection by removing their corresponding malicious processes before scans. KXRay successfully detected processes hidden by four traditional rootkits.
Malware evolves perpetually and relies on increasingly so- phisticated attacks to supersede defense strategies. Data-driven approaches to malware detection run the risk of becoming rapidly antiquated. Keeping pace with malware requires models that are periodically enriched with fresh knowledge, commonly known as retraining. In this work, we propose the use of Venn-Abers predictors for assessing the quality of binary classification tasks as a first step towards identifying antiquated models. One of the key benefits behind the use of Venn-Abers predictors is that they are automatically well calibrated and offer probabilistic guidance on the identification of nonstationary populations of malware. Our framework is agnostic to the underlying classification algorithm and can then be used for building better retraining strategies in the presence of concept drift. Results obtained over a timeline-based evaluation with about 90K samples show that our framework can identify when models tend to become obsolete.