Biblio
Severe class imbalance between the majority and minority classes in large datasets can prejudice Machine Learning classifiers toward the majority class. Our work uniquely consolidates two case studies, each utilizing three learners implemented within an Apache Spark framework, six sampling methods, and five sampling distribution ratios to analyze the effect of severe class imbalance on big data analytics. We use three performance metrics to evaluate this study: Area Under the Receiver Operating Characteristic Curve, Area Under the Precision-Recall Curve, and Geometric Mean. In the first case study, models were trained on one dataset (POST) and tested on another (SlowlorisBig). In the second case study, the training and testing dataset roles were switched. Our comparison of performance metrics shows that Area Under the Precision-Recall Curve and Geometric Mean are sensitive to changes in the sampling distribution ratio, whereas Area Under the Receiver Operating Characteristic Curve is relatively unaffected. In addition, we demonstrate that when comparing sampling methods, borderline-SMOTE2 outperforms the other methods in the first case study, and Random Undersampling is the top performer in the second case study.
Community structure detection in social networks has become a big challenge. Various methods in the literature have been presented to solve this challenge. Recently, several methods have also been proposed to solve this challenge based on a mapping-reduction model, in which data and algorithms are divided between different process nodes so that the complexity of time and memory of community detection in large social networks is reduced. In this paper, a mapping-reduction model is first proposed to detect the structure of communities. Then the proposed framework is rewritten according to a new mechanism called distributed cache memory; distributed cache memory can store different values associated with different keys and, if necessary, put them at different computational nodes. Finally, the proposed rewritten framework has been implemented using SPARK tools and its implementation results have been reported on several major social networks. The performed experiments show the effectiveness of the proposed framework by varying the values of various parameters.
Intrusion Detection Systems (IDS) have been in existence for many years now, but they fall short in efficiently detecting zero-day attacks. This paper presents an organic combination of Semantic Link Networks (SLN) and dynamic semantic graph generation for the on the fly discovery of zero-day attacks using the Spark Streaming platform for parallel detection. In addition, a minimum redundancy maximum relevance (MRMR) feature selection algorithm is deployed to determine the most discriminating features of the dataset. Compared to previous studies on zero-day attack identification, the described method yields better results due to the semantic learning and reasoning on top of the training data and due to the use of collaborative classification methods. We also verified the scalability of our method in a distributed environment.
The most of the organizations tend to accumulate the data related to security, which goes up-to terabytes in every month. They collect this data to meet the security requirements. The data is mostly in the shape of logs like Dns logs, Pcap files, and Firewall data etc. The data can be related to any communication network like cloud, telecom, or smart grid network. Generally, these logs are stored in databases or warehouses which becomes ultimately gigantic in size. Such a huge size of data upsurge the importance of security analytics in big data. In surveys, the security experts grumble about the existing tools and recommend for special tools and methods for big data security analysis. In this paper, we are using a big data analysis tool, which is known as apache spark. Although this tool is used for general purpose but we have used this for security analysis. It offers a very good library for machine learning algorithms including the clustering which is the main algorithm used in our work. In this work, we have developed a novel model, which combines rule based and clustering analysis for security analysis of big dataset. The dataset we are using in our experiment is the Kddcup99 which is a widely used dataset for intrusion detection. It is of MBs in size but can be used as a test case for big data security analysis.
Enterprises usually provide strong controls to prevent cyberattacks and inadvertent leakage of data to external entities. However, in the case where employees and data scientists have legitimate access to analyze and derive insights from the data, there are insufficient controls and employees are usually permitted access to all information about the customers of the enterprise including sensitive and private information. Though it is important to be able to identify useful patterns of one's customers for better customization and service, customers' privacy must not be sacrificed to do so. We propose an alternative — a framework that will allow privacy preserving data analytics over big data. In this paper, we present an efficient and scalable framework for Apache Spark, a cluster computing framework, that provides strong privacy guarantees for users even in the presence of an informed adversary, while still providing high utility for analysts. The framework, titled Shade, includes two mechanisms — SparkLAP, which provides Laplacian perturbation based on a user's query and SparkSAM, which uses the contents of the database itself in order to calculate the perturbation. We show that the performance of Shade is substantially better than earlier differential privacy systems without loss of accuracy, particularly when run on datasets small enough to fit in memory, and find that SparkSAM can even exceed performance of an identical nonprivate Spark query.
In the era of Big Data, software systems can be affected by its growing complexity, both with respect to functional and non-functional requirements. As more and more people use software applications over the web, the ability to recognize if some of this traffic is malicious or legitimate is a challenge. The traffic load of security controllers, as well as the complexity of security rules to detect attacks can grow to levels where current solutions may not suffice. In this work, we propose a hierarchical distributed architecture for security control in order to partition responsibility and workload among many security controllers. In addition, our architecture proposes a more simplified way of defining security rules to allow security to be enforced on an operational level, rather than a development level.