Biblio
The task of attack attribution, i.e., identifying the entity responsible for an attack, is complicated and usually requires the involvement of an experienced security expert. Prior attempts to automate attack attribution apply various machine learning techniques on features extracted from the malware's code and behavior in order to identify other similar malware whose authors are known. However, the same malware can be reused by multiple actors, and the actor who performed an attack using a malware might differ from the malware's author. Moreover, information collected during an incident may contain many clues about the identity of the attacker in addition to the malware used. In this paper, we propose a method of attack attribution based on textual analysis of threat intelligence reports, using state of the art algorithms and models from the fields of machine learning and natural language processing (NLP). We have developed a new text representation algorithm which captures the context of the words and requires minimal feature engineering. Our approach relies on vector space representation of incident reports derived from a small collection of labeled reports and a large corpus of general security literature. Both datasets have been made available to the research community. Experimental results show that the proposed representation can attribute attacks more accurately than the baselines' representations. In addition, we show how the proposed approach can be used to identify novel previously unseen threat actors and identify similarities between known threat actors.
Phishing is typically deployed as an attack vector in the initial stages of a hacking endeavour. Due to it low-risk rightreward nature it has seen a widespread adoption, and detecting it has become a challenge in recent times. This paper proposes a novel means of detecting phishing websites using a Generative Adversarial Network. Taking into account the internal structure and external metadata of a website, the proposed approach uses a generator network which generates both legitimate as well as synthetic phishing features to train a discriminator network. The latter then determines if the features are either normal or phishing websites, before improving its detection accuracy based on the classification error. The proposed approach is evaluated using two different phishing datasets and is found to achieve a detection accuracy of up to 94%.
In recent days, Enterprises are expanding their business efficiently through web applications which has paved the way for building good consumer relationship with its customers. The major threat faced by these enterprises is their inability to provide secure environments as the web applications are prone to severe vulnerabilities. As a result of this, many security standards and tools have been evolving to handle the vulnerabilities. Though there are many vulnerability detection tools available in the present, they do not provide sufficient information on the attack. For the long-term functioning of an organization, data along with efficient analytics on the vulnerabilities is required to enhance its reliability. The proposed model thus aims to make use of Machine Learning with Analytics to solve the problem in hand. Hence, the sequence of the attack is detected through the pattern using PAA and further the detected vulnerabilities are classified using Machine Learning technique such as SVM. Probabilistic results are provided in order to obtain numerical data sets which could be used for obtaining a report on user and application behavior. Dynamic and Reconfigurable PAA with SVM Classifier is a challenging task to analyze the vulnerabilities and impact of these vulnerabilities in heterogeneous web environment. This will enhance the former processing by analysis of the origin and the pattern of the attack in a more effective manner. Hence, the proposed system is designed to perform detection of attacks. The system works on the mitigation and prevention as part of the attack prediction.
In recent years, there is a surge of interest in approaches pertaining to security issues of Internet of Things deployments and applications that leverage machine learning and deep learning techniques. A key prerequisite for enabling such approaches is the development of scalable infrastructures for collecting and processing security-related datasets from IoT systems and devices. This paper introduces such a scalable and configurable data collection infrastructure for data-driven IoT security. It emphasizes the collection of (security) data from different elements of IoT systems, including individual devices and smart objects, edge nodes, IoT platforms, and entire clouds. The scalability of the introduced infrastructure stems from the integration of state of the art technologies for large scale data collection, streaming and storage, while its configurability relies on an extensible approach to modelling security data from a variety of IoT systems and devices. The approach enables the instantiation and deployment of security data collection systems over complex IoT deployments, which is a foundation for applying effective security analytics algorithms towards identifying threats, vulnerabilities and related attack patterns.
The analysis of security-related event logs is an important step for the investigation of cyber-attacks. It allows tracing malicious activities and lets a security operator find out what has happened. However, since IT landscapes are growing in size and diversity, the amount of events and their highly different representations are becoming a Big Data challenge. Unfortunately, current solutions for the analysis of security-related events, so called Security Information and Event Management (SIEM) systems, are not able to keep up with the load. In this work, we propose a distributed SIEM platform that makes use of highly efficient distributed normalization and persists event data into an in-memory database. We implement the normalization on common distribution frameworks, i.e. Spark, Storm, Trident and Heron, and compare their performance with our custom-built distribution solution. Additionally, different tuning options are introduced and their speed advantage is presented. In the end, we show how the writing into an in-memory database can be tuned to achieve optimal persistence speed. Using the proposed approach, we are able to not only fully normalize, but also persist more than 20 billion events per day with relatively small client hardware. Therefore, we are confident that our approach can handle the load of events in even very large IT landscapes.
Data analytics is being increasingly used in cyber-security problems, and found to be useful in cases where data volumes and heterogeneity make it cumbersome for manual assessment by security experts. In practical cyber-security scenarios involving data-driven analytics, obtaining data with annotations (i.e. ground-truth labels) is a challenging and known limiting factor for many supervised security analytics task. Significant portions of the large datasets typically remain unlabelled, as the task of annotation is extensively manual and requires a huge amount of expert intervention. In this paper, we propose an effective active learning approach that can efficiently address this limitation in a practical cyber-security problem of Phishing categorization, whereby we use a human-machine collaborative approach to design a semi-supervised solution. An initial classifier is learnt on a small amount of the annotated data which in an iterative manner, is then gradually updated by shortlisting only relevant samples from the large pool of unlabelled data that are most likely to influence the classifier performance fast. Prioritized Active Learning shows a significant promise to achieve faster convergence in terms of the classification performance in a batch learning framework, and thus requiring even lesser effort for human annotation. An useful feature weight update technique combined with active learning shows promising classification performance for categorizing Phishing/malicious URLs without requiring a large amount of annotated training samples to be available during training. In experiments with several collections of PhishMonger's Targeted Brand dataset, the proposed method shows significant improvement over the baseline by as much as 12%.
Energy management systems (EMS) are used to control energy usage in buildings and campuses, by employing technologies such as supervisory control and data acquisition (SCADA) and building management systems (BMS), in order to provide reliable energy supply and maximise user comfort while minimising energy usage. Historically, EMS systems were installed when potential security threats were only physical. Nowadays, EMS systems are connected to the building network and as a result directly to the outside world. This extends the attack surface to potential sophisticated cyber-attacks, which adversely impact EMS operation, resulting in service interruption and downstream financial implications. Currently, the security systems that detect attacks operate independently to those which deploy resiliency policies and use very basic methods. We propose a novel EMS cyber-physical-security framework that executes a resilient policy whenever an attack is detected using security analytics. In this framework, both the resilient policy and the security analytics are driven by EMS data, where the physical correlations between the data-points are identified to detect outliers and then the control loop is closed using an estimated value in place of the outlier. The framework has been tested using a reduced order model of a real EMS site.
We present a novel Cyber Security analytics framework. We demonstrate a comprehensive cyber security monitoring system to construct cyber security correlated events with feature selection to anticipate behaviour based on various sensors.
Cyber-attacks have been evolved in a way to be more sophisticated by employing combinations of attack methodologies with greater impacts. For instance, Advanced Persistent Threats (APTs) employ a set of stealthy hacking processes running over a long period of time, making it much hard to detect. With this trend, the importance of big-data security analytics has taken greater attention since identifying such latest attacks requires large-scale data processing and analysis. In this paper, we present SEAS-MR (Security Event Aggregation System over MapReduce) that facilitates scalable security event aggregation for comprehensive situation analysis. The introduced system provides the following three core functions: (i) periodic aggregation, (ii) on-demand aggregation, and (iii) query support for effective analysis. We describe our design and implementation of the system over MapReduce and high-level query languages, and report our experimental results collected through extensive settings on a Hadoop cluster for performance evaluation and design impacts.
Cyber-attacks have been evolved in a way to be more sophisticated by employing combinations of attack methodologies with greater impacts. For instance, Advanced Persistent Threats (APTs) employ a set of stealthy hacking processes running over a long period of time, making it much hard to detect. With this trend, the importance of big-data security analytics has taken greater attention since identifying such latest attacks requires large-scale data processing and analysis. In this paper, we present SEAS-MR (Security Event Aggregation System over MapReduce) that facilitates scalable security event aggregation for comprehensive situation analysis. The introduced system provides the following three core functions: (i) periodic aggregation, (ii) on-demand aggregation, and (iii) query support for effective analysis. We describe our design and implementation of the system over MapReduce and high-level query languages, and report our experimental results collected through extensive settings on a Hadoop cluster for performance evaluation and design impacts.
In this paper, we present the design, architecture, and implementation of a novel analysis engine, called Feature Collection and Correlation Engine (FCCE), that finds correlations across a diverse set of data types spanning over large time windows with very small latency and with minimal access to raw data. FCCE scales well to collecting, extracting, and querying features from geographically distributed large data sets. FCCE has been deployed in a large production network with over 450,000 workstations for 3 years, ingesting more than 2 billion events per day and providing low latency query responses for various analytics. We explore two security analytics use cases to demonstrate how we utilize the deployment of FCCE on large diverse data sets in the cyber security domain: 1) detecting fluxing domain names of potential botnet activity and identifying all the devices in the production network querying these names, and 2) detecting advanced persistent threat infection. Both evaluation results and our experience with real-world applications show that FCCE yields superior performance over existing approaches, and excels in the challenging cyber security domain by correlating multiple features and deriving security intelligence.