Visible to the public Biblio

Filters: Keyword is Hadoop  [Clear All Filters]
2022-11-18
Tall, Anne M., Zou, Cliff C., Wang, Jun.  2021.  Integrating Cybersecurity Into a Big Data Ecosystem. MILCOM 2021 - 2021 IEEE Military Communications Conference (MILCOM). :69—76.
This paper provides an overview of the security service controls that are applied in a big data processing (BDP) system to defend against cyber security attacks. We validate this approach by modeling attacks and effectiveness of security service controls in a sequence of states and transitions. This Finite State Machine (FSM) approach uses the probable effectiveness of security service controls, as defined in the National Institute of Standards and Technology (NIST) Risk Management Framework (RMF). The attacks used in the model are defined in the ATT&CK™ framework. Five different BDP security architecture configurations are considered, spanning from a low-cost default BDP configuration to a more expensive, industry supported layered security architecture. The analysis demonstrates the importance of a multi-layer approach to implementing security in BDP systems. With increasing interest in using BDP systems to analyze sensitive data sets, it is important to understand and justify BDP security architecture configurations with their significant costs. The output of the model demonstrates that over the run time, larger investment in security service controls results in significantly more uptime. There is a significant increase in uptime with a linear increase in security service control investment. We believe that these results support our recommended BDP security architecture. That is, a layered architecture with security service controls integrated into the user interface, boundary, central management of security policies, and applications that incorporate privacy preserving programs. These results enable making BDP systems operational for sensitive data accessed in a multi-tenant environment.
2021-08-17
Jaiswal, Ayshwarya, Dwivedi, Vijay Kumar, Yadav, Om Prakash.  2020.  Big Data and its Analyzing Tools : A Perspective. 2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS). :560–565.
Data are generated and stored in databases at a very high speed and hence it need to be handled and analyzed properly. Nowadays industries are extensively using Hadoop and Spark to analyze the datasets. Both the frameworks are used for increasing processing speeds in computing huge complex datasets. Many researchers are comparing both of them. Now, the big questions arising are, Is Spark a substitute for Hadoop? Is hadoop going to be replaced by spark in mere future?. Spark is “built on top of” Hadoop and it extends the model to deploy more types of computations which incorporates Stream Processing and Interactive Queries. No doubt, Spark's execution speed is much faster than Hadoop, but talking in terms of fault tolerance, hadoop is slightly more fault tolerant than spark. In this article comparison of various bigdata analytics tools are done and Hadoop and Spark are discussed in detail. This article further gives an overview of bigdata, spark and hadoop issues. In this survey paper, the approaches to resolve the issues of spark and hadoop are discussed elaborately.
2020-12-11
Kumar, S., Vasthimal, D. K..  2019.  Raw Cardinality Information Discovery for Big Datasets. 2019 IEEE 5th Intl Conference on Big Data Security on Cloud (BigDataSecurity), IEEE Intl Conference on High Performance and Smart Computing, (HPSC) and IEEE Intl Conference on Intelligent Data and Security (IDS). :200—205.
Real-time discovery of all different types of unique attributes within unstructured data is a challenging problem to solve when dealing with multiple petabytes of unstructured data volume everyday. Popular discovery solutions such as the creation of offline jobs to uniquely identify attributes or running aggregation queries on raw data sets limits real time discovery use-cases and often results into poor resource utilization. The discovery information must be treated as a parallel problem to just storing raw data sets efficiently onto back-end big data systems. Solving the discovery problem by creating a parallel discovery data store infrastructure has multiple benefits as it allows such to channel the actual search queries against the raw data set in much more funneled manner instead of being widespread across the entire data sets. Such focused search queries and data separation are far more performant and requires less compute and memory footprint.
2020-11-02
Kralevska, Katina, Gligoroski, Danilo, Jensen, Rune E., Øverby, Harald.  2018.  HashTag Erasure Codes: From Theory to Practice. IEEE Transactions on Big Data. 4:516—529.
Minimum-Storage Regenerating (MSR) codes have emerged as a viable alternative to Reed-Solomon (RS) codes as they minimize the repair bandwidth while they are still optimal in terms of reliability and storage overhead. Although several MSR constructions exist, so far they have not been practically implemented mainly due to the big number of I/O operations. In this paper, we analyze high-rate MDS codes that are simultaneously optimized in terms of storage, reliability, I/O operations, and repair-bandwidth for single and multiple failures of the systematic nodes. The codes were recently introduced in [1] without any specific name. Due to the resemblance between the hashtag sign \# and the procedure of the code construction, we call them in this paper HashTag Erasure Codes (HTECs). HTECs provide the lowest data-read and data-transfer, and thus the lowest repair time for an arbitrary sub-packetization level α, where α ≤ r⌈k/r⌉, among all existing MDS codes for distributed storage including MSR codes. The repair process is linear and highly parallel. Additionally, we show that HTECs are the first high-rate MDS codes that reduce the repair bandwidth for more than one failure. Practical implementations of HTECs in Hadoop release 3.0.0-alpha2 demonstrate their great potentials.
2020-08-28
Yau, Yiu Chung, Khethavath, Praveen, Figueroa, Jose A..  2019.  Secure Pattern-Based Data Sensitivity Framework for Big Data in Healthcare. 2019 IEEE International Conference on Big Data, Cloud Computing, Data Science Engineering (BCD). :65—70.
With the exponential growth in the usage of electronic medical records (EMR), the amount of data generated by the healthcare industry has too increased exponentially. These large amounts of data, known as “Big Data” is mostly unstructured. Special big data analytics methods are required to process the information and retrieve information which is meaningful. As patient information in hospitals and other healthcare facilities become increasingly electronic, Big Data technologies are needed now more than ever to manage and understand this data. In addition, this information tends to be quite sensitive and needs a highly secure environment. However, current security algorithms are hard to be implemented because it would take a huge amount of time and resources. Security protocols in Big data are also not adequate in protecting sensitive information in the healthcare. As a result, the healthcare data is both heterogeneous and insecure. As a solution we propose the Secure Pattern-Based Data Sensitivity Framework (PBDSF), that uses machine learning mechanisms to identify the common set of attributes of patient data, data frequency, various patterns of codes used to identify specific conditions to secure sensitive information. The framework uses Hadoop and is built on Hadoop Distributed File System (HDFS) as a basis for our clusters of machines to process Big Data, and perform tasks such as identifying sensitive information in a huge amount of data and encrypting data that are identified to be sensitive.
2020-07-10
Yang, Ying, Yu, Huanhuan, Yang, Lina, Yang, Ming, Chen, Lijuan, Zhu, Guichun, Wen, Liqiang.  2019.  Hadoop-based Dark Web Threat Intelligence Analysis Framework. 2019 IEEE 3rd Advanced Information Management, Communicates, Electronic and Automation Control Conference (IMCEC). :1088—1091.

With the development of network services and people's privacy requirements continue to increase. On the basis of providing anonymous user communication, it is necessary to protect the anonymity of the server. At the same time, there are many threatening crime messages in the dark network. However, many scholars lack the ability or expertise to conduct research on dark-net threat intelligence. Therefore, this paper designs a framework based on Hadoop is hidden threat intelligence. The framework uses HDFS as the underlying storage system to build a HBase-based distributed database to store and manage threat intelligence information. According to the heterogeneous type of the forum, the web crawler is used to collect data through the anonymous TOR tool. The framework is used to identify the characteristics of key dark network criminal networks, which is the basis for the later dark network research.

2020-02-10
Taneja, Shubbhi, Zhou, Yi, Chavan, Ajit, Qin, Xiao.  2019.  Improving Energy Efficiency of Hadoop Clusters using Approximate Computing. 2019 IEEE 5th Intl Conference on Big Data Security on Cloud (BigDataSecurity), IEEE Intl Conference on High Performance and Smart Computing, (HPSC) and IEEE Intl Conference on Intelligent Data and Security (IDS). :206–211.
There is an ongoing search for finding energy-efficient solutions in multi-core computing platforms. Approximate computing is one such solution leveraging the forgiving nature of applications to improve the energy efficiency at different layers of the computing platform ranging from applications to hardware. We are interested in understanding the benefits of approximate computing in the realm of Apache Hadoop and its applications. A few mechanisms for introducing approximation in programming models include sampling input data, skipping selective computations, relaxing synchronization, and user-defined quality-levels. We believe that it is straightforward to apply the aforementioned mechanisms to conserve energy in Hadoop clusters as well. The emerging trend of approximate computing motivates us to systematically investigate thermal profiling of approximate computing strategies in this research. In particular, we design a thermal-aware approximate computing framework called tHadoop2, which is an extension of tHadoop proposed by Chavan et al. We investigated the thermal behavior of a MapReduce application called Pi running on Hadoop clusters by varying two input parameters - number of maps and number of sampling points per map. Our profiling results show that Pi exhibits inherent resilience in terms of the number of precision digits present in its value.
2019-11-25
Liang, Tyng-Yeu, Yeh, Li-Wei, Wu, Chi-Hong.  2018.  A Visual MapReduce Program Development Environment for Heterogeneous Computing on Clouds. Proceedings of the 2018 International Conference on Computing and Data Engineering. :83–87.
This paper is aimed at proposing a visual MapReduce program development environment called VMR for heterogeneous computing on Clouds. This development environment mainly has three advantages as follows. First, it allows users to drag and drop graphical blocks instead of text typing for editing programs. Therefore, users can save their effort and time spent on MapReduce programming especially when they analyze data on clouds through mobile devices. Second, it can automatically translate the blocks of users' MapReduce programs into three different versions including Java, C and CUDA of source codes, and select one of these three versions according to the processor architecture of allocated resources for execution. Consequently, users can transparently and effectively exploit heterogeneous resources in clouds for executing their MapReduce programs while they has no need to individually write programs for each of different processor architectures by themselves. Third, it can enable clouds to outsource the computation tasks of MapReduce programs to mobile devices in order for increasing job throughput or program performance.
2019-02-25
Lekshmi, M. B., Deepthi, V. R..  2018.  Spam Detection Framework for Online Reviews Using Hadoop’ s Computational Capability. 2018 International CET Conference on Control, Communication, and Computing (IC4). :436–440.
Nowadays, online reviews have become one of the vital elements for customers to do online shopping. Organizations and individuals use this information to buy the right products and make business decisions. This has influenced the spammers or unethical business people to create false reviews and promote their products to out-beat competitions. Sophisticated systems are developed by spammers to create bulk of spam reviews in any websites within hours. To tackle this problem, studies have been conducted to formulate effective ways to detect the spam reviews. Various spam detection methods have been introduced in which most of them extracts meaningful features from the text or used machine learning techniques. These approaches gave little importance on extracted feature type and processing rate. NetSpam[1] defines a framework which can classify the review dataset based on spam features and maps them to a spam detection procedure which performs better than previous works in predictive accuracy. In this work, a method is proposed that can improve the processing rate by applying a distributed approach on review dataset using MapReduce feature. Parallel programming concept using MapReduce is used for processing big data in Hadoop. The solution involves parallelising the algorithm defined in NetSpam and it defines a spam detection procedure with better predictive accuracy and processing rate.
2018-09-12
Mattmann, Chris A., Sharan, Madhav.  2017.  Scalable Hadoop-Based Pooled Time Series of Big Video Data from the Deep Web. Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval. :117–120.

We contribute a scalable, open source implementation of the Pooled Time Series (PoT) algorithm from CVPR 2015. The algorithm is evaluated on approximately 6800 human trafficking (HT) videos collected from the deep and dark web, and on an open dataset: the Human Motion Database (HMDB). We describe PoT and our motivation for using it on larger data and the issues we encountered. Our new solution reimagines PoT as an Apache Hadoop-based algorithm. We demonstrate that our new Hadoop-based algorithm successfully identifies similar videos in the HT and HMDB datasets and we evaluate the algorithm qualitatively and quantitatively.

2018-04-02
Mamun, A. Al, Salah, K., Al-maadeed, S., Sheltami, T. R..  2017.  BigCrypt for Big Data Encryption. 2017 Fourth International Conference on Software Defined Systems (SDS). :93–99.

as data size is growing up, cloud storage is becoming more familiar to store a significant amount of private information. Government and private organizations require transferring plenty of business files from one end to another. However, we will lose privacy if we exchange information without data encryption and communication mechanism security. To protect data from hacking, we can use Asymmetric encryption technique, but it has a key exchange problem. Although Asymmetric key encryption deals with the limitations of Symmetric key encryption it can only encrypt limited size of data which is not feasible for a large amount of data files. In this paper, we propose a probabilistic approach to Pretty Good Privacy technique for encrypting large-size data, named as ``BigCrypt'' where both Symmetric and Asymmetric key encryption are used. Our goal is to achieve zero tolerance security on a significant amount of data encryption. We have experimentally evaluated our technique under three different platforms.

2017-04-24
Tall, Anne, Wang, Jun, Han, Dezhi.  2016.  Survey of Data Intensive Computing Technologies Application to to Security Log Data Management. Proceedings of the 3rd IEEE/ACM International Conference on Big Data Computing, Applications and Technologies. :268–273.

Data intensive computing research and technology developments offer the potential of providing significant improvements in several security log management challenges. Approaches to address the complexity, timeliness, expense, diversity, and noise issues have been identified. These improvements are motivated by the increasingly important role of analytics. Machine learning and expert systems that incorporate attack patterns are providing greater detection insights. Finding actionable indicators requires the analysis to combine security event log data with other network data such and access control lists, making the big-data problem even bigger. Automation of threat intelligence is recognized as not complete with limited adoption of standards. With limited progress in anomaly signature detection, movement towards using expert systems has been identified as the path forward. Techniques focus on matching behaviors of attackers to patterns of abnormal activity in the network. The need to stream, parse, and analyze large volumes of small, semi-structured data files can be feasibly addressed through a variety of techniques identified by researchers. This report highlights research in key areas, including protection of the data, performance of the systems and network bandwidth utilization.

2017-03-08
Gupta, A., Mehrotra, A., Khan, P. M..  2015.  Challenges of Cloud Computing amp; Big Data Analytics. 2015 2nd International Conference on Computing for Sustainable Global Development (INDIACom). :1112–1115.

Now-a-days for most of the organizations across the globe, two important IT initiatives are: Big Data Analytics and Cloud Computing. Big Data Analytics can provide valuables insight that can create competitiveness and generate increased revenues. Cloud Computing can enhance productivity and efficiencies thus reducing cost. Cloud Computing offers groups of servers, storages and various networking resources. It enables environment of Big Data to processes voluminous, high velocity and varied formats of Big Data.

2015-05-05
Marchal, S., Xiuyan Jiang, State, R., Engel, T..  2014.  A Big Data Architecture for Large Scale Security Monitoring. Big Data (BigData Congress), 2014 IEEE International Congress on. :56-63.

Network traffic is a rich source of information for security monitoring. However the increasing volume of data to treat raises issues, rendering holistic analysis of network traffic difficult. In this paper we propose a solution to cope with the tremendous amount of data to analyse for security monitoring perspectives. We introduce an architecture dedicated to security monitoring of local enterprise networks. The application domain of such a system is mainly network intrusion detection and prevention, but can be used as well for forensic analysis. This architecture integrates two systems, one dedicated to scalable distributed data storage and management and the other dedicated to data exploitation. DNS data, NetFlow records, HTTP traffic and honeypot data are mined and correlated in a distributed system that leverages state of the art big data solution. Data correlation schemes are proposed and their performance are evaluated against several well-known big data framework including Hadoop and Spark.

2014-09-17
Yu, Xianqing, Ning, Peng, Vouk, Mladen A..  2014.  Securing Hadoop in Cloud. Proceedings of the 2014 Symposium and Bootcamp on the Science of Security. :26:1–26:2.

Hadoop is a map-reduce implementation that rapidly processes data in parallel. Cloud provides reliability, flexibility, scalability, elasticity and cost saving to customers. Moving Hadoop into Cloud can be beneficial to Hadoop users. However, Hadoop has two vulnerabilities that can dramatically impact its security in a Cloud. The vulnerabilities are its overloaded authentication key, and the lack of fine-grained access control at the data access level. We propose and develop a security enhancement for Cloud-based Hadoop.