Biblio
Severe class imbalance between the majority and minority classes in large datasets can prejudice Machine Learning classifiers toward the majority class. Our work uniquely consolidates two case studies, each utilizing three learners implemented within an Apache Spark framework, six sampling methods, and five sampling distribution ratios to analyze the effect of severe class imbalance on big data analytics. We use three performance metrics to evaluate this study: Area Under the Receiver Operating Characteristic Curve, Area Under the Precision-Recall Curve, and Geometric Mean. In the first case study, models were trained on one dataset (POST) and tested on another (SlowlorisBig). In the second case study, the training and testing dataset roles were switched. Our comparison of performance metrics shows that Area Under the Precision-Recall Curve and Geometric Mean are sensitive to changes in the sampling distribution ratio, whereas Area Under the Receiver Operating Characteristic Curve is relatively unaffected. In addition, we demonstrate that when comparing sampling methods, borderline-SMOTE2 outperforms the other methods in the first case study, and Random Undersampling is the top performer in the second case study.
Companies analyse large amounts of data on clusters of machines, using big data analytic tools such as Apache Spark and Apache Flink to analyse the data. Big data analytic tools are mainly tested regarding speed and reliability. Efforts about Security and thus authentication are spent only at second glance. In such big data analytic tools, authentication is achieved with the help of the Kerberos protocol that is basically built as authentication on top of big data analytic tools. However, Kerberos is vulnerable to attacks, and it lacks providing high availability when users are all over the world. To improve the authentication, this work presents first an analysis of the authentication in Hadoop and the data analytic tools. Second, we propose a concept to deploy Transport Layer Security (TLS) not only for the security of data transportation but as well for authentication within the big data tools. This is done by establishing the connections using certificates with a short lifetime. The proof of concept is realized in Apache Spark, where Kerberos is replaced by the method proposed. We deploy new short living certificates for authentication that are less vulnerable to abuse. With our approach the requirements of the industry regarding multi-factor authentication and scalability are met.
Feature extraction and feature selection are the first tasks in pre-processing of input logs in order to detect cybersecurity threats and attacks by utilizing data mining techniques in the field of Artificial Intelligence. When it comes to the analysis of heterogeneous data derived from different sources, these tasks are found to be time-consuming and difficult to be managed efficiently. In this paper, we present an approach for handling feature extraction and feature selection utilizing machine learning algorithms for security analytics of heterogeneous data derived from different network sensors. The approach is implemented in Apache Spark, using its python API, named pyspark.
Volumetric DDoS attacks continue to inflict serious damage. Many proposed defenses for mitigating such attacks assume that a monitoring system has already detected the attack. However, many proposed DDoS monitoring systems do not focus on efficiently analyzing high volume network traffic to provide important characterizations of the attack in real-time to downstream traffic filtering systems. We propose a scalable real-time framework for an effective volumetric DDoS monitoring system that leverages modern big data technologies for streaming analytics of high volume network traffic to accurately detect and characterize attacks.
Technological advancement enables the need of internet everywhere. The power industry is not an exception in the technological advancement which makes everything smarter. Smart grid is the advanced version of the traditional grid, which makes the system more efficient and self-healing. Synchrophasor is a device used in smart grids to measure the values of electric waves, voltages and current. The phasor measurement unit produces immense volume of current and voltage data that is used to monitor and control the performance of the grid. These data are huge in size and vulnerable to attacks. Intrusion Detection is a common technique for finding the intrusions in the system. In this paper, a big data framework is designed using various machine learning techniques, and intrusions are detected based on the classifications applied on the synchrophasor dataset. In this approach various machine learning techniques like deep neural networks, support vector machines, random forest, decision trees and naive bayes classifications are done for the synchrophasor dataset and the results are compared using metrics of accuracy, recall, false rate, specificity, and prediction time. Feature selection and dimensionality reduction algorithms are used to reduce the prediction time taken by the proposed approach. This paper uses apache spark as a platform which is suitable for the implementation of Intrusion Detection system in smart grids using big data analytics.