Visible to the public Biblio

Filters: Keyword is MapReduce  [Clear All Filters]
2022-08-26
Muchhala, Yash, Singhania, Harshit, Sheth, Sahil, Devadkar, Kailas.  2021.  Enabling MapReduce based Parallel Computation in Smart Contracts. 2021 6th International Conference on Inventive Computation Technologies (ICICT). :537—543.
Smart Contracts based cryptocurrencies such as Ethereum are becoming increasingly popular in various domains: but with this increase in popularity comes a significant decrease in throughput and efficiency. Smart Contracts are executed by every miner in the system serially without any parallelism, both inter and intra-Smart Contracts. Such a serial execution inhibits the scalability required to obtain extremely high throughput pertaining to computationally intensive tasks deployed with such Smart Contracts. While significant advancements have been made in the field of concurrency, from GPU architectures that enable massively parallel computation to tools such as MapRe-duce that distributed computing to several nodes connected in the system to achieve higher performance in distributed systems, none are incorporated in blockchain-based distributed computing. The team proposes a novel blockchain that allows public nodes in a permission-independent blockchain to deploy and run Smart Contracts that provide concurrency-related functionalities within the Smart Contract framework. In this paper, the researchers present “ConCurrency,” a blockchain network capable of handling big data-based computations. The technique is based on currently used distributed system paradigms, such as MapReduce, while also allowing for fundamental parallelly computable problems. Concurrency is achieved using a sharding protocol incorporated with consensus mechanisms to ensure high scalability, high reliability, and better efficiency. A detailed methodology and a comprehensive analysis of the proposed blockchain further indicate a significant increase in throughput for parallelly computable tasks, as detailed in this paper.
2020-08-24
Al-Odat, Zeyad A., Khan, Samee U..  2019.  Anonymous Privacy-Preserving Scheme for Big Data Over the Cloud. 2019 IEEE International Conference on Big Data (Big Data). :5711–5717.
This paper introduces an anonymous privacy-preserving scheme for big data over the cloud. The proposed design helps to enhance the encryption/decryption time of big data by utilizing the MapReduce framework. The Hadoop distributed file system and the secure hash algorithm are employed to provide the anonymity, security and efficiency requirements for the proposed scheme. The experimental results show a significant enhancement in the computational time of data encryption and decryption.
2020-02-10
Taneja, Shubbhi, Zhou, Yi, Chavan, Ajit, Qin, Xiao.  2019.  Improving Energy Efficiency of Hadoop Clusters using Approximate Computing. 2019 IEEE 5th Intl Conference on Big Data Security on Cloud (BigDataSecurity), IEEE Intl Conference on High Performance and Smart Computing, (HPSC) and IEEE Intl Conference on Intelligent Data and Security (IDS). :206–211.
There is an ongoing search for finding energy-efficient solutions in multi-core computing platforms. Approximate computing is one such solution leveraging the forgiving nature of applications to improve the energy efficiency at different layers of the computing platform ranging from applications to hardware. We are interested in understanding the benefits of approximate computing in the realm of Apache Hadoop and its applications. A few mechanisms for introducing approximation in programming models include sampling input data, skipping selective computations, relaxing synchronization, and user-defined quality-levels. We believe that it is straightforward to apply the aforementioned mechanisms to conserve energy in Hadoop clusters as well. The emerging trend of approximate computing motivates us to systematically investigate thermal profiling of approximate computing strategies in this research. In particular, we design a thermal-aware approximate computing framework called tHadoop2, which is an extension of tHadoop proposed by Chavan et al. We investigated the thermal behavior of a MapReduce application called Pi running on Hadoop clusters by varying two input parameters - number of maps and number of sampling points per map. Our profiling results show that Pi exhibits inherent resilience in terms of the number of precision digits present in its value.
2019-12-30
Dong, Yao, Milanova, Ana, Dolby, Julian.  2018.  SecureMR: Secure Mapreduce Computation Using Homomorphic Encryption and Program Partitioning. Proceedings of the 5th Annual Symposium and Bootcamp on Hot Topics in the Science of Security. :4:1–4:13.
In cloud computing customers upload data and computation to cloud providers. As they upload their data to the cloud provider, they typically give up data confidentiality. We develop SecureMR, a system that analyzes and transforms MapReduce programs to operate over encrypted data. SecureMR makes use of partially homomorphic encryption and a trusted client. We evaluate SecureMR on a set of complex computation-intensive MapReduce benchmarks.
2019-11-25
Liang, Tyng-Yeu, Yeh, Li-Wei, Wu, Chi-Hong.  2018.  A Visual MapReduce Program Development Environment for Heterogeneous Computing on Clouds. Proceedings of the 2018 International Conference on Computing and Data Engineering. :83–87.
This paper is aimed at proposing a visual MapReduce program development environment called VMR for heterogeneous computing on Clouds. This development environment mainly has three advantages as follows. First, it allows users to drag and drop graphical blocks instead of text typing for editing programs. Therefore, users can save their effort and time spent on MapReduce programming especially when they analyze data on clouds through mobile devices. Second, it can automatically translate the blocks of users' MapReduce programs into three different versions including Java, C and CUDA of source codes, and select one of these three versions according to the processor architecture of allocated resources for execution. Consequently, users can transparently and effectively exploit heterogeneous resources in clouds for executing their MapReduce programs while they has no need to individually write programs for each of different processor architectures by themselves. Third, it can enable clouds to outsource the computation tasks of MapReduce programs to mobile devices in order for increasing job throughput or program performance.
2019-07-01
Meryem, Amar, Samira, Douzi, Bouabid, El Ouahidi.  2018.  Enhancing Cloud Security Using Advanced MapReduce K-means on Log Files. Proceedings of the 2018 International Conference on Software Engineering and Information Management. :63–67.

Many customers ranked cloud security as a major challenge that threaten their work and reduces their trust on cloud service's provider. Hence, a significant improvement is required to establish better adaptations of security measures that suit recent technologies and especially distributed architectures. Considering the meaningful recorded data in cloud generated log files, making analysis on them, mines insightful value about hacker's activities. It identifies malicious user behaviors and predicts new suspected events. Not only that, but centralizing log files, prevents insiders from causing damage to system. In this paper, we proposed to take away sensitive log files into a single server provider and combining both MapReduce programming and k-means on the same algorithm to cluster observed events into classes having similar features. To label unknown user behaviors and predict new suspected activities this approach considers cosine distances and deviation metrics.

2019-02-25
Lekshmi, M. B., Deepthi, V. R..  2018.  Spam Detection Framework for Online Reviews Using Hadoop’ s Computational Capability. 2018 International CET Conference on Control, Communication, and Computing (IC4). :436–440.
Nowadays, online reviews have become one of the vital elements for customers to do online shopping. Organizations and individuals use this information to buy the right products and make business decisions. This has influenced the spammers or unethical business people to create false reviews and promote their products to out-beat competitions. Sophisticated systems are developed by spammers to create bulk of spam reviews in any websites within hours. To tackle this problem, studies have been conducted to formulate effective ways to detect the spam reviews. Various spam detection methods have been introduced in which most of them extracts meaningful features from the text or used machine learning techniques. These approaches gave little importance on extracted feature type and processing rate. NetSpam[1] defines a framework which can classify the review dataset based on spam features and maps them to a spam detection procedure which performs better than previous works in predictive accuracy. In this work, a method is proposed that can improve the processing rate by applying a distributed approach on review dataset using MapReduce feature. Parallel programming concept using MapReduce is used for processing big data in Hadoop. The solution involves parallelising the algorithm defined in NetSpam and it defines a spam detection procedure with better predictive accuracy and processing rate.
2018-09-12
Nagaratna, M., Sowmya, Y..  2017.  M-sanit: Computing misusability score and effective sanitization of big data using Amazon elastic MapReduce. 2017 International Conference on Computation of Power, Energy Information and Commuincation (ICCPEIC). :029–035.
The invent of distributed programming frameworks like Hadoop paved way for processing voluminous data known as big data. Due to exponential growth of data, enterprises started to exploit the availability of cloud infrastructure for storing and processing big data. Insider attacks on outsourced data causes leakage of sensitive data. Therefore, it is essential to sanitize data so as to preserve privacy or non-disclosure of sensitive data. Privacy Preserving Data Publishing (PPDP) and Privacy Preserving Data Mining (PPDM) are the areas in which data sanitization plays a vital role in preserving privacy. The existing anonymization techniques for MapReduce programming can be improved to have a misusability measure for determining the level of sanitization to be applied to big data. To overcome this limitation we proposed a framework known as M-Sanit which has mechanisms to exploit misusability score of big data prior to performing sanitization using MapReduce programming paradigm. Our empirical study using the real world cloud eco system such as Amazon Elastic Cloud Compute (EC2) and Amazon Elastic MapReduce (EMR) reveals the effectiveness of misusability score based sanitization of big data prior to publishing or mining it.
2018-04-02
Al-Zobbi, M., Shahrestani, S., Ruan, C..  2017.  Implementing A Framework for Big Data Anonymity and Analytics Access Control. 2017 IEEE Trustcom/BigDataSE/ICESS. :873–880.

Analytics in big data is maturing and moving towards mass adoption. The emergence of analytics increases the need for innovative tools and methodologies to protect data against privacy violation. Many data anonymization methods were proposed to provide some degree of privacy protection by applying data suppression and other distortion techniques. However, currently available methods suffer from poor scalability, performance and lack of framework standardization. Current anonymization methods are unable to cope with the massive size of data processing. Some of these methods were especially proposed for MapReduce framework to operate in Big Data. However, they still operate in conventional data management approaches. Therefore, there were no remarkable gains in the performance. We introduce a framework that can operate in MapReduce environment to benefit from its advantages, as well as from those in Hadoop ecosystems. Our framework provides a granular user's access that can be tuned to different authorization levels. The proposed solution provides a fine-grained alteration based on the user's authorization level to access MapReduce domain for analytics. Using well-developed role-based access control approaches, this framework is capable of assigning roles to users and map them to relevant data attributes.

2017-12-20
Li, S., Wang, B..  2017.  A Method for Hybrid Bayesian Network Structure Learning from Massive Data Using MapReduce. 2017 ieee 3rd international conference on big data security on cloud (bigdatasecurity), ieee international conference on high performance and smart computing (hpsc), and ieee international conference on intelligent data and security (ids). :272–276.
Bayesian Network is the popular and important data mining model for representing uncertain knowledge. For large scale data it is often too costly to learn the accurate structure. To resolve this problem, much work has been done on migrating the structure learning algorithms to the MapReduce framework. In this paper, we introduce a distributed hybrid structure learning algorithm by combining the advantages of constraint-based and score-and-search-based algorithms. By reusing the intermediate results of MapReduce, the algorithm greatly simplified the computing work and got good results in both efficiency and accuracy.
2017-10-13
Aydin, Kevin, Bateni, MohammadHossein, Mirrokni, Vahab.  2016.  Distributed Balanced Partitioning via Linear Embedding. Proceedings of the Ninth ACM International Conference on Web Search and Data Mining. :387–396.

Balanced partitioning is often a crucial first step in solving large-scale graph optimization problems: in some cases, a big graph is chopped into pieces that fit on one machine to be processed independently before stitching the results together, leading to certain suboptimality from the interaction among different pieces. In other cases, links between different parts may show up in the running time and/or network communications cost, hence the desire to have small cut size. We study a distributed balanced partitioning problem where the goal is to partition the vertices of a given graph into k pieces, minimizing the total cut size. Our algorithm is composed of a few steps that are easily implementable in distributed computation frameworks, e.g., MapReduce. The algorithm first embeds nodes of the graph onto a line, and then processes nodes in a distributed manner guided by the linear embedding order. We examine various ways to find the first embedding, e.g., via a hierarchical clustering or Hilbert curves. Then we apply four different techniques such as local swaps, minimum cuts on partition boundaries, as well as contraction and dynamic programming. Our empirical study compares the above techniques with each other, and to previous work in distributed algorithms, e.g., a label propagation method, FENNEL and Spinner. We report our results both on a private map graph and several public social networks, and show that our results beat previous distributed algorithms: we notice, e.g., 15-25% reduction in cut size over [UB13]. We also observe that our algorithms allow for scalable distributed implementation for any number of partitions. Finally, we apply our techniques for the Google Maps Driving Directions to minimize the number of multi-shard queries with the goal of saving in CPU usage. During live experiments, we observe an ≈ 40% drop in the number of multi-shard queries when comparing our method with a standard geography-based method.

2017-08-02
Agarwal, Pankaj K., Fox, Kyle, Munagala, Kamesh, Nath, Abhinandan.  2016.  Parallel Algorithms for Constructing Range and Nearest-Neighbor Searching Data Structures. Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems. :429–440.

With the massive amounts of data available today, it is common to store and process data using multiple machines. Parallel programming platforms such as MapReduce and its variants are popular frameworks for handling such large data. We present the first provably efficient algorithms to compute, store, and query data structures for range queries and approximate nearest neighbor queries in a popular parallel computing abstraction that captures the salient features of MapReduce and other massively parallel communication (MPC) models. In particular, we describe algorithms for \$kd\$-trees, range trees, and BBD-trees that only require O(1) rounds of communication for both preprocessing and querying while staying competitive in terms of running time and workload to their classical counterparts. Our algorithms are randomized, but they can be made deterministic at some increase in their running time and workload while keeping the number of rounds of communication to be constant.

2017-04-24
Zhang, Xuyun, Leckie, Christopher, Dou, Wanchun, Chen, Jinjun, Kotagiri, Ramamohanarao, Salcic, Zoran.  2016.  Scalable Local-Recoding Anonymization Using Locality Sensitive Hashing for Big Data Privacy Preservation. Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. :1793–1802.

While cloud computing has become an attractive platform for supporting data intensive applications, a major obstacle to the adoption of cloud computing in sectors such as health and defense is the privacy risk associated with releasing datasets to third-parties in the cloud for analysis. A widely-adopted technique for data privacy preservation is to anonymize data via local recoding. However, most existing local-recoding techniques are either serial or distributed without directly optimizing scalability, thus rendering them unsuitable for big data applications. In this paper, we propose a highly scalable approach to local-recoding anonymization in cloud computing, based on Locality Sensitive Hashing (LSH). Specifically, a novel semantic distance metric is presented for use with LSH to measure the similarity between two data records. Then, LSH with the MinHash function family can be employed to divide datasets into multiple partitions for use with MapReduce to parallelize computation while preserving similarity. By using our efficient LSH-based scheme, we can anonymize each partition through the use of a recursive agglomerative \$k\$-member clustering algorithm. Extensive experiments on real-life datasets show that our approach significantly improves the scalability and time-efficiency of local-recoding anonymization by orders of magnitude over existing approaches.

2017-02-14
K. F. Hong, C. C. Chen, Y. T. Chiu, K. S. Chou.  2015.  "Ctracer: Uncover C amp;amp;C in Advanced Persistent Threats Based on Scalable Framework for Enterprise Log Data". 2015 IEEE International Congress on Big Data. :551-558.

Advanced Persistent Threat (APT), unlike traditional hacking attempts, carries out specific attacks on a specific target to illegally collect information and data from it. These targeted attacks use special-crafted malware and infrequent activity to avoid detection, so that hackers can retain control over target systems unnoticed for long periods of time. In order to detect these stealthy activities, a large-volume of traffic data generated in a period of time has to be analyzed. We proposed a scalable solution, Ctracer to detect stealthy command and control channel in a large-volume of traffic data. APT uses multiple command and control (C&C) channel and change them frequently to avoid detection, but there are common signatures in those C&C sessions. By identifying common network signature, Ctracer is able to group the C&C sessions. Therefore, we can detect an APT and all the C&C session used in an APT attack. The Ctracer is evaluated in a large enterprise for four months, twenty C&C servers, three APT attacks are reported. After investigated by the enterprise's Security Operations Center (SOC), the forensic report shows that there is specific enterprise targeted APT cases and not ever discovered for over 120 days.

2015-05-04
Yun Shen, Thonnard, O..  2014.  MR-TRIAGE: Scalable multi-criteria clustering for big data security intelligence applications. Big Data (Big Data), 2014 IEEE International Conference on. :627-635.

Security companies have recently realised that mining massive amounts of security data can help generate actionable intelligence and improve their understanding of Internet attacks. In particular, attack attribution and situational understanding are considered critical aspects to effectively deal with emerging, increasingly sophisticated Internet attacks. This requires highly scalable analysis tools to help analysts classify, correlate and prioritise security events, depending on their likely impact and threat level. However, this security data mining process typically involves a considerable amount of features interacting in a non-obvious way, which makes it inherently complex. To deal with this challenge, we introduce MR-TRIAGE, a set of distributed algorithms built on MapReduce that can perform scalable multi-criteria data clustering on large security data sets and identify complex relationships hidden in massive datasets. The MR-TRIAGE workflow is made of a scalable data summarisation, followed by scalable graph clustering algorithms in which we integrate multi-criteria evaluation techniques. Theoretical computational complexity of the proposed parallel algorithms are discussed and analysed. The experimental results demonstrate that the algorithms can scale well and efficiently process large security datasets on commodity hardware. Our approach can effectively cluster any type of security events (e.g., spam emails, spear-phishing attacks, etc) that are sharing at least some commonalities among a number of predefined features.