Biblio
We regularly use communication apps like Facebook and WhatsApp on our smartphones, and the exchange of media, particularly images, has grown at an exponential rate. There are over 3 billion images shared every day on Whatsapp alone. In such a scenario, the management of images on a mobile device has become highly inefficient, and this leads to problems like low storage, manual deletion of images, disorganization etc. In this paper, we present a solution to tackle these issues by automatically classifying every image on a smartphone into a set of predefined categories, thereby segregating spam images from them, allowing the user to delete them seamlessly.
Despite widespread use of commercial anti-virus products, the number of malicious files detected on home and corporate computers continues to increase at a significant rate. Recently, anti-virus companies have started investing in machine learning solutions to augment signatures manually designed by analysts. A malicious file's determination is often represented as a hierarchical structure consisting of a type (e.g. Worm, Backdoor), a platform (e.g. Win32, Win64), a family (e.g. Rbot, Rugrat) and a family variant (e.g. A, B). While there has been substantial research in automated malware classification, the aforementioned hierarchical structure, which can provide additional information to the classification models, has been ignored. In this paper, we propose the novel idea and study the performance of employing hierarchical learning algorithms for automated classification of malicious files. To the best of our knowledge, this is the first research effort which incorporates the hierarchical structure of the malware label in its automated classification and in the security domain, in general. It is important to note that our method does not require any additional effort by analysts because they typically assign these hierarchical labels today. Our empirical results on a real world, industrial-scale malware dataset of 3.6 million files demonstrate that incorporation of the label hierarchy achieves a significant reduction of 33.1% in the binary error rate as compared to a non-hierarchical classifier which is traditionally used in such problems.
Malware classification is a critical part in the cyber-security. Traditional methodologies for the malware classification typically use static analysis and dynamic analysis to identify malware. In this paper, a malware classification methodology based on its binary image and extracting local binary pattern (LBP) features is proposed. First, malware images are reorganized into 3 by 3 grids which is mainly used to extract LBP feature. Second, the LBP is implemented on the malware images to extract features in that it is useful in pattern or texture classification. Finally, Tensorflow, a library for machine learning, is applied to classify malware images with the LBP feature. Performance comparison results among different classifiers with different image descriptors such as GIST, a spatial envelop, and the LBP demonstrate that our proposed approach outperforms others.
Anti-virus vendors receive hundreds of thousands of malware to be analysed each day. Some are new malware while others are variations or evolutions of existing malware. Because analyzing each malware sample by hand is impossible, automated techniques to analyse and categorize incoming samples are needed. In this work, we explore various machine learning features extracted from malware samples through static analysis for classification of malware binaries into already known malware families. We present a new feature based on control statement shingling that has a comparable accuracy to ordinary opcode n-gram based features while requiring smaller dimensions. This, in turn, results in a shorter training time.
Malware writers often develop malware with automated measures, so the number of malware has increased dramatically. Automated measures tend to repeatedly use significant modules, which form the basis for identifying malware variants and discriminating malware families. Thus, we propose a novel visualization analysis method for researching malware similarity. This method converts malicious Windows Portable Executable (PE) files into local entropy images for observing internal features of malware, and then normalizes local entropy images into entropy pixel images for malware classification. We take advantage of the Jaccard index to measure similarities between entropy pixel images and the k-Nearest Neighbor (kNN) classification algorithm to assign entropy pixel images to different malware families. Preliminary experimental results show that our visualization method can discriminate malware families effectively.
Malicious applications have become increasingly numerous. This demands adaptive, learning-based techniques for constructing malware detection engines, instead of the traditional manual-based strategies. Prior work in learning-based malware detection engines primarily focuses on dynamic trace analysis and byte-level n-grams. Our approach in this paper differs in that we use compiler intermediate representations, i.e., the callgraph representation of binaries. Using graph-based program representations for learning provides structure of the program, which can be used to learn more advanced patterns. We use the Shortest Path Graph Kernel (SPGK) to identify similarities between call graphs extracted from binaries. The output similarity matrix is fed into a Support Vector Machine (SVM) algorithm to construct highly-accurate models to predict whether a binary is malicious or not. However, SPGK is computationally expensive due to the size of the input graphs. Therefore, we evaluate different parallelization methods for CPUs and GPUs to speed up this kernel, allowing us to continuously construct up-to-date models in a timely manner. Our hybrid implementation, which leverages both CPU and GPU, yields the best performance, achieving up to a 14.2x improvement over our already optimized OpenMP version. We compared our generated graph-based models to previously state-of-the-art feature vector 2-gram and 3-gram models on a dataset consisting of over 22,000 binaries. We show that our classification accuracy using graphs is over 19% higher than either n-gram model and gives a false positive rate (FPR) of less than 0.1%. We are also able to consider large call graphs and dataset sizes because of the reduced execution time of our parallelized SPGK implementation.
Due to the unavailability of signatures for previously unknown malware, non-signature malware detection schemes typically rely on analyzing program behavior. Prior behavior based non-signature malware detection schemes are either easily evadable by obfuscation or are very inefficient in terms of storage space and detection time. In this paper, we propose GZero, a graph theoretic approach fast and accurate non-signature malware detection at end hosts. GZero it is effective while being efficient in terms of both storage space and detection time. We conducted experiments on a large set of both benign software and malware. Our results show that GZero achieves more than 99% detection rate and a false positive rate of less than 1%, with less than 1 second of average scan time per program and is relatively robust to obfuscation attacks. Due to its low overheads, GZero can complement existing malware detection solutions at end hosts.
With growing popularity of Android, it's attack surface has also increased. Prevalence of third party android marketplaces gives attackers an opportunity to plant their malicious apps in the mobile eco-system. To evade signature based detection, attackers often transform their malware, for instance, by introducing code level changes. In this paper we propose a lightweight static Permission Flow Graph (PFG) based approach to detect malware even when they have been transformed (obfuscated). A number of techniques based on behavioral analysis have also been proposed in the past; how-ever our interest lies in leveraging the permission framework alone to detect malware variants and transformations without considering behavioral aspects of a malware. Our proposed approach constructs Permission Flow Graph (PFG) for an Android App. Transformations performed at code level, often result in changing control flow, however, most of the time, the permission flow remains invariant. As a consequences, PFGs of transformed malware and non-transformed malware remain structurally similar as shown in this paper using state-of-the-art graph similarity algorithm. Furthermore, we propose graph based similarity metrics at both edge level and vertex level in order to bring forth the structural similarity of the two PFGs being compared. We validate our proposed methodology through machine learning algorithms. Results prove that our approach is successfully able to group together Android malware and its variants (transformations) together in the same cluster. Further, we demonstrate that our proposed approach is able to detect transformed malware with a detection accuracy of 98.26%, thereby ensuring that malicious Apps can be detected even after transformations.
In recent years the use of wireless ad hoc networks has seen an increase of applications. A big part of the research has focused on Mobile Ad Hoc Networks (MAnETs), due to its implementations in vehicular networks, battlefield communications, among others. These peer-to-peer networks usually test novel communications protocols, but leave out the network security part. A wide range of attacks can happen as in wired networks, some of them being more damaging in MANETs. Because of the characteristics of these networks, conventional methods for detection of attack traffic are ineffective. Intrusion Detection Systems (IDSs) are constructed on various detection techniques, but one of the most important is anomaly detection. IDSs based only in past attacks signatures are less effective, even more if these IDSs are centralized. Our work focuses on adding a novel Machine Learning technique to the detection engine, which recognizes attack traffic in an online way (not to store and analyze after), re-writing IDS rules on the fly. Experiments were done using the Dockemu emulation tool with Linux Containers, IPv6 and OLSR as routing protocol, leading to promising results.
The rapid development of Internet has resulted in massive information overloading recently. These information is usually represented by high-dimensional feature vectors in many related applications such as recognition, classification and retrieval. These applications usually need efficient indexing and search methods for such large-scale and high-dimensional database, which typically is a challenging task. Some efforts have been made and solved this problem to some extent. However, most of them are implemented in a single machine, which is not suitable to handle large-scale database.In this paper, we present a novel data index structure and nearest neighbor search algorithm implemented on Apache Spark. We impose a grid on the database and index data by non-empty grid cells. This grid-based index structure is simple and easy to be implemented in parallel. Moreover, we propose to build a scalable KNN graph on the grids, which increase the efficiency of this index structure by a low cost in parallel implementation. Finally, experiments are conducted in both public databases and synthetic databases, showing that the proposed methods achieve overall high performance in both efficiency and accuracy.
The following topics are dealt with: feature extraction; data mining; support vector machines; mobile computing; photovoltaic power systems; mean square error methods; fault diagnosis; natural language processing; control system synthesis; and Internet of Things.
Here we explore the applicability of traditional sliding window based convolutional neural network (CNN) detection pipeline and region based object detection techniques such as Faster Region-based CNN (R-CNN) and Region-based Fully Convolutional Networks (R-FCN) on the problem of object detection in X-ray security imagery. Within this context, with limited dataset availability, we employ a transfer learning paradigm for network training tackling both single and multiple object detection problems over a number of R-CNN/R-FCN variants. The use of first-stage region proposal within the Faster RCNN and R-FCN provide superior results than traditional sliding window driven CNN (SWCNN) approach. With the use of Faster RCNN with VGG16, pretrained on the ImageNet dataset, we achieve 88.3 mAP for a six object class X-ray detection problem. The use of R-FCN with ResNet-101, yields 96.3 mAP for the two class firearm detection problem requiring 0.1 second computation per image. Overall we illustrate the comparative performance of these techniques as object localization strategies within cluttered X-ray security imagery.
PHP is one of the most popular web development tools in use today. A major concern though is the improper and insecure uses of the language by application developers, motivating the development of various static analyses that detect security vulnerabilities in PHP programs. However, many of these approaches do not handle recent, important PHP features such as object orientation, which greatly limits the use of such approaches in practice. In this paper, we present OOPIXY, a security analysis tool that extends the PHP security analyzer PIXY to support reasoning about object-oriented features in PHP applications. Our empirical evaluation shows that OOPIXY detects 88% of security vulnerabilities found in micro benchmarks. When used on real-world PHP applications, OOPIXY detects security vulnerabilities that could not be detected using state-of-the-art tools, retaining a high level of precision. We have contacted the maintainers of those applications, and two applications' development teams verified the correctness of our findings. They are currently working on fixing the bugs that lead to those vulnerabilities.
The large number of malicious files that are produced daily outpaces the current capacity of malware analysis and detection. For example, Intel Security Labs reported that during the second quarter of 2016, their system found more than 40M of new malware [1]. The damage of malware attacks is also increasingly devastating, as witnessed by the recent Cryptowall malware that has reportedly generated more than \$325M in ransom payments to its perpetrators [2]. In terms of defense, it has been widely accepted that the traditional approach based on byte-string signatures is increasingly ineffective, especially for new malware samples and sophisticated variants of existing ones. New techniques are therefore needed for effective defense against malware. Motivated by this problem, the paper investigates a new defense technique against malware. The technique presented in this paper is utilized for automatic identification of malware packers that are used to obfuscate malware programs. Signatures of malware packers and obfuscators are extracted from the CFGs of malware samples. Unlike conventional byte signatures that can be evaded by simply modifying one or multiple bytes in malware samples, these signatures are more difficult to evade. For example, CFG-based signatures are shown to be resilient against instruction modifications and shuffling, as a single signature is sufficient for detecting mildly different versions of the same malware. Last but not least, the process for extracting CFG-based signatures is also made automatic.
The mitigation of insider threats against databases is a challenging problem as insiders often have legitimate access privileges to sensitive data. Therefore, conventional security mechanisms, such as authentication and access control, may be insufficient for the protection of databases against insider threats and need to be complemented with techniques that support real-time detection of access anomalies. The existing real-time anomaly detection techniques consider anomalies in references to the database entities and the amounts of accessed data. However, they are unable to track the access frequencies. According to recent security reports, an increase in the access frequency by an insider is an indicator of a potential data misuse and may be the result of malicious intents for stealing or corrupting the data. In this paper, we propose techniques for tracking users' access frequencies and detecting anomalous related activities in real-time. We present detailed algorithms for constructing accurate profiles that describe the access patterns of the database users and for matching subsequent accesses by these users to the profiles. Our methods report and log mismatches as anomalies that may need further investigation. We evaluated our techniques on the OLTP-Benchmark. The results of the evaluation indicate that our techniques are very effective in the detection of anomalies.
Malicious emails pose substantial threats to businesses. Whether it is a malware attachment or a URL leading to malware, exploitation or phishing, attackers have been employing emails as an effective way to gain a foothold inside organizations of all kinds. To combat email threats, especially targeted attacks, traditional signature- and rule-based email filtering as well as advanced sandboxing technology both have their own weaknesses. In this paper, we propose a predictive analysis approach that learns the differences between legit and malicious emails through static analysis, creates a machine learning model and makes detection and prediction on unseen emails effectively and efficiently. By comparing three different machine learning algorithms, our preliminary evaluation reveals that a Random Forests model performs the best.
Ransomware is one of the most increasing malwares used by cyber-criminals in recent days. This type of malware uses cryptographic technology that encrypts a user's important files, folders makes the computer systems unusable, holds the decryption key and asks for the ransom from the victims for recovery. The recent ransomware families are very sophisticated and difficult to analyze & detect using static features only. On the other hand, latest crypto-ransomwares having sandboxing and IDS evading capabilities. So obviously, static or dynamic analysis of the ransomware alone cannot provide better solution. In this paper, we will present a Machine Learning based approach which will use integrated method, a combination of static and dynamic analysis to detect ransomware. The experimental test samples were taken from almost all ransomware families including the most recent ``WannaCry''. The results also suggest that combined analysis can detect ransomware with better accuracy compared to individual analysis approach. Since ransomware samples show some ``run-time'' and ``static code'' features, it also helps for the early detection of new and similar ransomware variants.
The continuous advance in recent cloud-based computer networks has generated a number of security challenges associated with intrusions in network systems. With the exponential increase in the volume of network traffic data, involvement of humans in such detection systems is time consuming and a non-trivial problem. Secondly, network traffic data tends to be highly dimensional, comprising of numerous features and attributes, making classification challenging and thus susceptible to the curse of dimensionality problem. Given such scenarios, the need arises for dimensional reduction, feature selection, combined with machine-learning techniques in the classification of such data. Therefore, as a contribution, this paper seeks to employ data mining techniques in a cloud-based environment, by selecting appropriate attributes and features with the least importance in terms of weight for the classification. Often the standard is to select features with better weights while ignoring those with least weights. In this study, we seek to find out if we can make prediction using those features with least weights. The motivation is that adversaries use stealth to hide their activities from the obvious. The question then is, can we predict any stealth activity of an adversary using the least observed attributes? In this particular study, we employ information gain to select attributes with the lowest weights and then apply machine learning to classify if a combination, in this case, of both source and destination ports are attacked or not. The motivation of this investigation is if attributes that are of least importance can be used to predict if an attack could occur. Our preliminary results show that even when the source and destination port attributes are used in combination with features with the least weights, it is possible to classify such network traffic data and predict if an attack will occur or not.
Nowadays, an increasing number of IoT vendors have complied and deployed third-party code bases across different architectures. Therefore, to avoid the firmware from being affected by the same known vulnerabilities, searching known vulnerabilities in binary firmware across different architectures is more crucial than ever. However, most of existing vulnerability search methods are limited to the same architecture, there are only a few researches on cross-architecture cases, of which the accuracy is not high. In this paper, to promote the accuracy of existing cross-architecture vulnerability search methods, we propose a new approach based on Support Vector Machine (SVM) and Attributed Control Flow Graph (ACFG) to search known vulnerability in firmware across different architectures at function level. We employ a known vulnerability function to recognize suspicious functions in other binary firmware. First, considering from the internal and external characteristics of the functions, we extract the function level features and basic-block level features of the functions to be inspected. Second, we employ SVM to recognize a little part of suspicious functions based on function level features. After the preliminary screening, we compute the graph similarity between the vulnerability function and suspicious functions based on their ACFGs. We have implemented our approach CVSSA, and employed the training samples to train the model with previous knowledge to improve the accuracy. We also search several vulnerabilities in the real-world firmware images, the experimental results show that CVSSA can be applied to the realistic scenarios.
Network traffic identification has been a hot topic in network security area. The identification of abnormal traffic can detect attack traffic and helps network manager enforce corresponding security policies to prevent attacks. Support Vector Machines (SVMs) are one of the most promising supervised machine learning (ML) algorithms that can be applied to the identification of traffic in IP networks as well as detection of abnormal traffic. SVM shows better performance because it can avoid local optimization problems existed in many supervised learning algorithms. However, as a binary classification approach, SVM needs more research in multiclass classification. In this paper, we proposed an abnormal traffic identification system(ATIS) that can classify and identify multiple attack traffic applications. Each component of ATIS is introduced in detail and experiments are carried out based on ATIS. Through the test of KDD CUP dataset, SVM shows good performance. Furthermore, the comparison of experiments reveals that scaling and parameters has a vital impact on SVM training results.
Based on the feature analysis of image content, this paper proposes a novel steganalytic method for grayscale images in spatial domain. In this work, we firstly investigates directional lifting wavelet transform (DLWT) as a sparse representation in compressive sensing (CS) domain. Then a block CS (BCS) measurement matrix is designed by using the generalized Gaussian distribution (GGD) model, in which the measurement matrix can be used to sense the DLWT coefficients of images to reflect the feature residual introduced by steganography. Extensive experiments are showed that proposed scheme CS-based is feasible and universal for detecting stegography in spatial domain.
Recently, due to the increase of outsourcing in IC design, it has been reported that malicious third-party vendors often insert hardware Trojans into their ICs. How to detect them is a strong concern in IC design process. The features of hardware-Trojan infected nets (or Trojan nets) in ICs often differ from those of normal nets. To classify all the nets in netlists designed by third-party vendors into Trojan ones and normal ones, we have to extract effective Trojan features from Trojan nets. In this paper, we first propose 51 Trojan features which describe Trojan nets from netlists. Based on the importance values obtained from the random forest classifier, we extract the best set of 11 Trojan features out of the 51 features which can effectively detect Trojan nets, maximizing the F-measures. By using the 11 Trojan features extracted, the machine-learning based hardware Trojan classifier has achieved at most 100% true positive rate as well as 100% true negative rate in several TrustHUB benchmarks and obtained the average F-measure of 74.6%, which realizes the best values among existing machine-learning-based hardware-Trojan detection methods.
With the amount of user-contributed image data increasing, it is a potential threat for users that everyone may have the access to gain privacy information. To reduce the possibility of the loss of real information, this paper combines homomorphic encryption scheme and image feature extraction to provide a guarantee for users' privacy. In this paper, the whole system model mainly consists of three parts, including social network service providers (SP), the Interested party (IP) and the applications. Except for the image preprocessing phase, the main operations of feature extraction are conducted in ciphertext domain, which means only SP has the access to the privacy of the users. The extraction algorithm is used to obtain a multi-dimensional histogram descriptor as image feature for each image. As a result, the histogram descriptor can be extracted correctly in encrypted domain in an acceptable time. Besides, the extracted feature can represent the image effectively because of relatively high accuracy. Additionally, many different applications can be conducted by using the encrypted features because of the support of our encryption scheme.
Explosive naval mines pose a threat to ocean and sea faring vessels, both military and civilian. This work applies deep neural network (DNN) methods to the problem of detecting minelike objects (MLO) on the seafloor in side-scan sonar imagery. We explored how the DNN depth, memory requirements, calculation requirements, and training data distribution affect detection efficacy. A visualization technique (class activation map) was incorporated that aids a user in interpreting the model's behavior. We found that modest DNN model sizes yielded better accuracy (98%) than very simple DNN models (93%) and a support vector machine (78%). The largest DNN models achieved textless;1% efficacy increase at a cost of a 17x increase of trainable parameter count and computation requirements. In contrast to DNNs popularized for many-class image recognition tasks, the models for this task require far fewer computational resources (0.3% of parameters), and are suitable for embedded use within an autonomous unmanned underwater vehicle.