Visible to the public Biblio

Filters: Keyword is Spark  [Clear All Filters]
2021-08-17
Jaiswal, Ayshwarya, Dwivedi, Vijay Kumar, Yadav, Om Prakash.  2020.  Big Data and its Analyzing Tools : A Perspective. 2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS). :560–565.
Data are generated and stored in databases at a very high speed and hence it need to be handled and analyzed properly. Nowadays industries are extensively using Hadoop and Spark to analyze the datasets. Both the frameworks are used for increasing processing speeds in computing huge complex datasets. Many researchers are comparing both of them. Now, the big questions arising are, Is Spark a substitute for Hadoop? Is hadoop going to be replaced by spark in mere future?. Spark is “built on top of” Hadoop and it extends the model to deploy more types of computations which incorporates Stream Processing and Interactive Queries. No doubt, Spark's execution speed is much faster than Hadoop, but talking in terms of fault tolerance, hadoop is slightly more fault tolerant than spark. In this article comparison of various bigdata analytics tools are done and Hadoop and Spark are discussed in detail. This article further gives an overview of bigdata, spark and hadoop issues. In this survey paper, the approaches to resolve the issues of spark and hadoop are discussed elaborately.
2021-06-30
Xu, Hui, Zhang, Wei, Gao, Man, Chen, Hongwei.  2020.  Clustering Analysis for Big Data in Network Security Domain Using a Spark-Based Method. 2020 IEEE 5th International Symposium on Smart and Wireless Systems within the Conferences on Intelligent Data Acquisition and Advanced Computing Systems (IDAACS-SWS). :1—4.
Considering the problem of network security under the background of big data, the clustering analysis algorithms can be utilized to improve the correctness of network intrusion detection models for security management. As a kind of iterative clustering analysis algorithm, K-means algorithm is not only simple but also efficient, so it is widely used. However, the traditional K-means algorithm cannot well solve the network security problem when facing big data due to its high complexity and limited processing ability. In this case, this paper proposes to optimize the traditional K-means algorithm based on the Spark platform and deploy the optimized clustering analysis algorithm in the distributed architecture, so as to improve the efficiency of clustering algorithm for network intrusion detection in big data environment. The experimental result shows that, compared with the traditional K-means algorithm, the efficiency of the optimized K-means algorithm using a Spark-based method is significantly improved in the running time.
2020-02-18
Fattahi, Saeideh, Yazdani, Reza, Vahidipour, Seyyed Mehdi.  2019.  Discovery of Society Structure in A Social Network Using Distributed Cache Memory. 2019 5th International Conference on Web Research (ICWR). :264–269.

Community structure detection in social networks has become a big challenge. Various methods in the literature have been presented to solve this challenge. Recently, several methods have also been proposed to solve this challenge based on a mapping-reduction model, in which data and algorithms are divided between different process nodes so that the complexity of time and memory of community detection in large social networks is reduced. In this paper, a mapping-reduction model is first proposed to detect the structure of communities. Then the proposed framework is rewritten according to a new mechanism called distributed cache memory; distributed cache memory can store different values associated with different keys and, if necessary, put them at different computational nodes. Finally, the proposed rewritten framework has been implemented using SPARK tools and its implementation results have been reported on several major social networks. The performed experiments show the effectiveness of the proposed framework by varying the values of various parameters.

2019-11-26
Wang, Pengfei, Wang, Fengyu, Lin, Fengbo, Cao, Zhenzhong.  2018.  Identifying Peer-to-Peer Botnets Through Periodicity Behavior Analysis. 2018 17th IEEE International Conference On Trust, Security And Privacy In Computing And Communications/ 12th IEEE International Conference On Big Data Science And Engineering (TrustCom/BigDataSE). :283-288.

Peer-to-Peer botnets have become one of the significant threat against network security due to their distributed properties. The decentralized nature makes their detection challenging. It is important to take measures to detect bots as soon as possible to minimize their harm. In this paper, we propose PeerGrep, a novel system capable of identifying P2P bots. PeerGrep starts from identifying hosts that are likely engaged in P2P communications, and then distinguishes P2P bots from P2P hosts by analyzing their active ratio, packet size and the periodicity of connection to destination IP addresses. The evaluation shows that PeerGrep can identify all P2P bots with quite low FPR even if the malicious P2P application and benign P2P application coexist within the same host or there is only one bot in the monitored network.

2019-02-08
Alzahrani, S., Hong, L..  2018.  Detection of Distributed Denial of Service (DDoS) Attacks Using Artificial Intelligence on Cloud. 2018 IEEE World Congress on Services (SERVICES). :35-36.

This research proposes a system for detecting known and unknown Distributed Denial of Service (DDoS) Attacks. The proposed system applies two different intrusion detection approaches anomaly-based distributed artificial neural networks(ANNs) and signature-based approach. The Amazon public cloud was used for running Spark as the fast cluster engine with varying cores of machines. The experiment results achieved the highest detection accuracy and detection rate comparing to signature based or neural networks-based approach.

2017-04-24
Tall, Anne, Wang, Jun, Han, Dezhi.  2016.  Survey of Data Intensive Computing Technologies Application to to Security Log Data Management. Proceedings of the 3rd IEEE/ACM International Conference on Big Data Computing, Applications and Technologies. :268–273.

Data intensive computing research and technology developments offer the potential of providing significant improvements in several security log management challenges. Approaches to address the complexity, timeliness, expense, diversity, and noise issues have been identified. These improvements are motivated by the increasingly important role of analytics. Machine learning and expert systems that incorporate attack patterns are providing greater detection insights. Finding actionable indicators requires the analysis to combine security event log data with other network data such and access control lists, making the big-data problem even bigger. Automation of threat intelligence is recognized as not complete with limited adoption of standards. With limited progress in anomaly signature detection, movement towards using expert systems has been identified as the path forward. Techniques focus on matching behaviors of attackers to patterns of abnormal activity in the network. The need to stream, parse, and analyze large volumes of small, semi-structured data files can be feasibly addressed through a variety of techniques identified by researchers. This report highlights research in key areas, including protection of the data, performance of the systems and network bandwidth utilization.

2017-03-07
Agnihotri, Lalitha, Mojarad, Shirin, Lewkow, Nicholas, Essa, Alfred.  2016.  Educational Data Mining with Python and Apache Spark: A Hands-on Tutorial. Proceedings of the Sixth International Conference on Learning Analytics & Knowledge. :507–508.

Enormous amount of educational data has been accumulated through Massive Open Online Courses (MOOCs), as well as commercial and non-commercial learning platforms. This is in addition to the educational data released by US government since 2012 to facilitate disruption in education by making data freely available. The high volume, variety and velocity of collected data necessitate use of big data tools and storage systems such as distributed databases for storage and Apache Spark for analysis. This tutorial will introduce researchers and faculty to real-world applications involving data mining and predictive analytics in learning sciences. In addition, the tutorial will introduce statistics required to validate and accurately report results. Topics will cover how big data is being used to transform education. Specifically, we will demonstrate how exploratory data analysis, data mining, predictive analytics, machine learning, and visualization techniques are being applied to educational big data to improve learning and scale insights driven from millions of student's records. The tutorial will be held over a half day and will be hands on with pre-posted material. Due to the interdisciplinary nature of work, the tutorial appeals to researchers from a wide range of backgrounds including big data, predictive analytics, learning sciences, educational data mining, and in general, those interested in how big data analytics can transform learning. As a prerequisite, attendees are required to have familiarity with at least one programming language.

2015-05-05
Marchal, S., Xiuyan Jiang, State, R., Engel, T..  2014.  A Big Data Architecture for Large Scale Security Monitoring. Big Data (BigData Congress), 2014 IEEE International Congress on. :56-63.

Network traffic is a rich source of information for security monitoring. However the increasing volume of data to treat raises issues, rendering holistic analysis of network traffic difficult. In this paper we propose a solution to cope with the tremendous amount of data to analyse for security monitoring perspectives. We introduce an architecture dedicated to security monitoring of local enterprise networks. The application domain of such a system is mainly network intrusion detection and prevention, but can be used as well for forensic analysis. This architecture integrates two systems, one dedicated to scalable distributed data storage and management and the other dedicated to data exploitation. DNS data, NetFlow records, HTTP traffic and honeypot data are mined and correlated in a distributed system that leverages state of the art big data solution. Data correlation schemes are proposed and their performance are evaluated against several well-known big data framework including Hadoop and Spark.