Visible to the public Biblio

Filters: Keyword is Sparks  [Clear All Filters]
2021-08-17
Jaiswal, Ayshwarya, Dwivedi, Vijay Kumar, Yadav, Om Prakash.  2020.  Big Data and its Analyzing Tools : A Perspective. 2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS). :560–565.
Data are generated and stored in databases at a very high speed and hence it need to be handled and analyzed properly. Nowadays industries are extensively using Hadoop and Spark to analyze the datasets. Both the frameworks are used for increasing processing speeds in computing huge complex datasets. Many researchers are comparing both of them. Now, the big questions arising are, Is Spark a substitute for Hadoop? Is hadoop going to be replaced by spark in mere future?. Spark is “built on top of” Hadoop and it extends the model to deploy more types of computations which incorporates Stream Processing and Interactive Queries. No doubt, Spark's execution speed is much faster than Hadoop, but talking in terms of fault tolerance, hadoop is slightly more fault tolerant than spark. In this article comparison of various bigdata analytics tools are done and Hadoop and Spark are discussed in detail. This article further gives an overview of bigdata, spark and hadoop issues. In this survey paper, the approaches to resolve the issues of spark and hadoop are discussed elaborately.
2021-06-30
Xu, Hui, Zhang, Wei, Gao, Man, Chen, Hongwei.  2020.  Clustering Analysis for Big Data in Network Security Domain Using a Spark-Based Method. 2020 IEEE 5th International Symposium on Smart and Wireless Systems within the Conferences on Intelligent Data Acquisition and Advanced Computing Systems (IDAACS-SWS). :1—4.
Considering the problem of network security under the background of big data, the clustering analysis algorithms can be utilized to improve the correctness of network intrusion detection models for security management. As a kind of iterative clustering analysis algorithm, K-means algorithm is not only simple but also efficient, so it is widely used. However, the traditional K-means algorithm cannot well solve the network security problem when facing big data due to its high complexity and limited processing ability. In this case, this paper proposes to optimize the traditional K-means algorithm based on the Spark platform and deploy the optimized clustering analysis algorithm in the distributed architecture, so as to improve the efficiency of clustering algorithm for network intrusion detection in big data environment. The experimental result shows that, compared with the traditional K-means algorithm, the efficiency of the optimized K-means algorithm using a Spark-based method is significantly improved in the running time.
2021-04-27
Tian, Z..  2020.  Design and Implementation of Distributed Government Audit System Based on Multidimensional Online Analysis. 2020 IEEE International Conference on Power, Intelligent Computing and Systems (ICPICS). :981–983.
With the continuous progress of the information age, e-commerce, the Internet of things and other emerging Internet areas are gradually emerging. Massive amount of structured data auditing becomes a major issue. Log files and other data can be uploaded to the cloud via the Internet to guard against potential threats. Difficulty now is how to realize the data in the field of data audit query online, interactive and impromptu. There are two main methods of data warehouse, respectively is zhang table reduction method and basic data verification method. In the age of big data, data quantity increases gradually, so that the audit speed, design of the data storage and so on will be more or less problematic. If the audit task is not completed in time, it will result in the failure to store the audit data, which will cause losses to enterprises and the government. This paper focuses on the data cube physical model and distributed technical analysis, through the establishment of a set of efficient distributed and online auditing system, so as to make the data fast and efficient auditing.
2020-08-28
Hasanin, Tawfiq, Khoshgoftaar, Taghi M., Leevy, Joffrey L..  2019.  A Comparison of Performance Metrics with Severely Imbalanced Network Security Big Data. 2019 IEEE 20th International Conference on Information Reuse and Integration for Data Science (IRI). :83—88.

Severe class imbalance between the majority and minority classes in large datasets can prejudice Machine Learning classifiers toward the majority class. Our work uniquely consolidates two case studies, each utilizing three learners implemented within an Apache Spark framework, six sampling methods, and five sampling distribution ratios to analyze the effect of severe class imbalance on big data analytics. We use three performance metrics to evaluate this study: Area Under the Receiver Operating Characteristic Curve, Area Under the Precision-Recall Curve, and Geometric Mean. In the first case study, models were trained on one dataset (POST) and tested on another (SlowlorisBig). In the second case study, the training and testing dataset roles were switched. Our comparison of performance metrics shows that Area Under the Precision-Recall Curve and Geometric Mean are sensitive to changes in the sampling distribution ratio, whereas Area Under the Receiver Operating Characteristic Curve is relatively unaffected. In addition, we demonstrate that when comparing sampling methods, borderline-SMOTE2 outperforms the other methods in the first case study, and Random Undersampling is the top performer in the second case study.

2020-07-27
Tun, May Thet, Nyaung, Dim En, Phyu, Myat Pwint.  2019.  Performance Evaluation of Intrusion Detection Streaming Transactions Using Apache Kafka and Spark Streaming. 2019 International Conference on Advanced Information Technologies (ICAIT). :25–30.
In the information era, the size of network traffic is complex because of massive Internet-based services and rapid amounts of data. The more network traffic has enhanced, the more cyberattacks have dramatically increased. Therefore, cybersecurity intrusion detection has been a challenge in the current research area in recent years. The Intrusion detection system requires high-level protection and detects modern and complex attacks with more accuracy. Nowadays, big data analytics is the main key to solve marketing, security and privacy in an extremely competitive financial market and government. If a huge amount of stream data flows within a short period time, it is difficult to analyze real-time decision making. Performance analysis is extremely important for administrators and developers to avoid bottlenecks. The paper aims to reduce time-consuming by using Apache Kafka and Spark Streaming. Experiments on the UNSWNB-15 dataset indicate that the integration of Apache Kafka and Spark Streaming can perform better in terms of processing time and fault-tolerance on the huge amount of data. According to the results, the fault tolerance can be provided by the multiple brokers of Kafka and parallel recovery of Spark Streaming. And then, the multiple partitions of Apache Kafka increase the processing time in the integration of Apache Kafka and Spark Streaming.
2020-02-18
Fattahi, Saeideh, Yazdani, Reza, Vahidipour, Seyyed Mehdi.  2019.  Discovery of Society Structure in A Social Network Using Distributed Cache Memory. 2019 5th International Conference on Web Research (ICWR). :264–269.

Community structure detection in social networks has become a big challenge. Various methods in the literature have been presented to solve this challenge. Recently, several methods have also been proposed to solve this challenge based on a mapping-reduction model, in which data and algorithms are divided between different process nodes so that the complexity of time and memory of community detection in large social networks is reduced. In this paper, a mapping-reduction model is first proposed to detect the structure of communities. Then the proposed framework is rewritten according to a new mechanism called distributed cache memory; distributed cache memory can store different values associated with different keys and, if necessary, put them at different computational nodes. Finally, the proposed rewritten framework has been implemented using SPARK tools and its implementation results have been reported on several major social networks. The performed experiments show the effectiveness of the proposed framework by varying the values of various parameters.

2019-09-23
Suriarachchi, I., Withana, S., Plale, B..  2018.  Big Provenance Stream Processing for Data Intensive Computations. 2018 IEEE 14th International Conference on e-Science (e-Science). :245–255.
In the business and research landscape of today, data analysis consumes public and proprietary data from numerous sources, and utilizes any one or more of popular data-parallel frameworks such as Hadoop, Spark and Flink. In the Data Lake setting these frameworks co-exist. Our earlier work has shown that data provenance in Data Lakes can aid with both traceability and management. The sheer volume of fine-grained provenance generated in a multi-framework application motivates the need for on-the-fly provenance processing. We introduce a new parallel stream processing algorithm that reduces fine-grained provenance while preserving backward and forward provenance. The algorithm is resilient to provenance events arriving out-of-order. It is evaluated using several strategies for partitioning a provenance stream. The evaluation shows that the parallel algorithm performs well in processing out-of-order provenance streams, with good scalability and accuracy.
2018-03-26
Pallaprolu, S. C., Sankineni, R., Thevar, M., Karabatis, G., Wang, J..  2017.  Zero-Day Attack Identification in Streaming Data Using Semantics and Spark. 2017 IEEE International Congress on Big Data (BigData Congress). :121–128.

Intrusion Detection Systems (IDS) have been in existence for many years now, but they fall short in efficiently detecting zero-day attacks. This paper presents an organic combination of Semantic Link Networks (SLN) and dynamic semantic graph generation for the on the fly discovery of zero-day attacks using the Spark Streaming platform for parallel detection. In addition, a minimum redundancy maximum relevance (MRMR) feature selection algorithm is deployed to determine the most discriminating features of the dataset. Compared to previous studies on zero-day attack identification, the described method yields better results due to the semantic learning and reasoning on top of the training data and due to the use of collaborative classification methods. We also verified the scalability of our method in a distributed environment.

2018-02-27
Lighari, S. N., Hussain, D. M. A..  2017.  Hybrid Model of Rule Based and Clustering Analysis for Big Data Security. 2017 First International Conference on Latest Trends in Electrical Engineering and Computing Technologies (IN℡LECT). :1–5.

The most of the organizations tend to accumulate the data related to security, which goes up-to terabytes in every month. They collect this data to meet the security requirements. The data is mostly in the shape of logs like Dns logs, Pcap files, and Firewall data etc. The data can be related to any communication network like cloud, telecom, or smart grid network. Generally, these logs are stored in databases or warehouses which becomes ultimately gigantic in size. Such a huge size of data upsurge the importance of security analytics in big data. In surveys, the security experts grumble about the existing tools and recommend for special tools and methods for big data security analysis. In this paper, we are using a big data analysis tool, which is known as apache spark. Although this tool is used for general purpose but we have used this for security analysis. It offers a very good library for machine learning algorithms including the clustering which is the main algorithm used in our work. In this work, we have developed a novel model, which combines rule based and clustering analysis for security analysis of big dataset. The dataset we are using in our experiment is the Kddcup99 which is a widely used dataset for intrusion detection. It is of MBs in size but can be used as a test case for big data security analysis.

2018-02-06
Heifetz, A., Mugunthan, V., Kagal, L..  2017.  Shade: A Differentially-Private Wrapper for Enterprise Big Data. 2017 IEEE International Conference on Big Data (Big Data). :1033–1042.

Enterprises usually provide strong controls to prevent cyberattacks and inadvertent leakage of data to external entities. However, in the case where employees and data scientists have legitimate access to analyze and derive insights from the data, there are insufficient controls and employees are usually permitted access to all information about the customers of the enterprise including sensitive and private information. Though it is important to be able to identify useful patterns of one's customers for better customization and service, customers' privacy must not be sacrificed to do so. We propose an alternative — a framework that will allow privacy preserving data analytics over big data. In this paper, we present an efficient and scalable framework for Apache Spark, a cluster computing framework, that provides strong privacy guarantees for users even in the presence of an informed adversary, while still providing high utility for analysts. The framework, titled Shade, includes two mechanisms — SparkLAP, which provides Laplacian perturbation based on a user's query and SparkSAM, which uses the contents of the database itself in order to calculate the perturbation. We show that the performance of Shade is substantially better than earlier differential privacy systems without loss of accuracy, particularly when run on datasets small enough to fit in memory, and find that SparkSAM can even exceed performance of an identical nonprivate Spark query.

2018-01-16
Rouf, Y., Shtern, M., Fokaefs, M., Litoiu, M..  2017.  A Hierarchical Architecture for Distributed Security Control of Large Scale Systems. 2017 IEEE/ACM 39th International Conference on Software Engineering Companion (ICSE-C). :118–120.

In the era of Big Data, software systems can be affected by its growing complexity, both with respect to functional and non-functional requirements. As more and more people use software applications over the web, the ability to recognize if some of this traffic is malicious or legitimate is a challenge. The traffic load of security controllers, as well as the complexity of security rules to detect attacks can grow to levels where current solutions may not suffice. In this work, we propose a hierarchical distributed architecture for security control in order to partition responsibility and workload among many security controllers. In addition, our architecture proposes a more simplified way of defining security rules to allow security to be enforced on an operational level, rather than a development level.