Biblio
Many customers ranked cloud security as a major challenge that threaten their work and reduces their trust on cloud service's provider. Hence, a significant improvement is required to establish better adaptations of security measures that suit recent technologies and especially distributed architectures. Considering the meaningful recorded data in cloud generated log files, making analysis on them, mines insightful value about hacker's activities. It identifies malicious user behaviors and predicts new suspected events. Not only that, but centralizing log files, prevents insiders from causing damage to system. In this paper, we proposed to take away sensitive log files into a single server provider and combining both MapReduce programming and k-means on the same algorithm to cluster observed events into classes having similar features. To label unknown user behaviors and predict new suspected activities this approach considers cosine distances and deviation metrics.
Analytics in big data is maturing and moving towards mass adoption. The emergence of analytics increases the need for innovative tools and methodologies to protect data against privacy violation. Many data anonymization methods were proposed to provide some degree of privacy protection by applying data suppression and other distortion techniques. However, currently available methods suffer from poor scalability, performance and lack of framework standardization. Current anonymization methods are unable to cope with the massive size of data processing. Some of these methods were especially proposed for MapReduce framework to operate in Big Data. However, they still operate in conventional data management approaches. Therefore, there were no remarkable gains in the performance. We introduce a framework that can operate in MapReduce environment to benefit from its advantages, as well as from those in Hadoop ecosystems. Our framework provides a granular user's access that can be tuned to different authorization levels. The proposed solution provides a fine-grained alteration based on the user's authorization level to access MapReduce domain for analytics. Using well-developed role-based access control approaches, this framework is capable of assigning roles to users and map them to relevant data attributes.
Balanced partitioning is often a crucial first step in solving large-scale graph optimization problems: in some cases, a big graph is chopped into pieces that fit on one machine to be processed independently before stitching the results together, leading to certain suboptimality from the interaction among different pieces. In other cases, links between different parts may show up in the running time and/or network communications cost, hence the desire to have small cut size. We study a distributed balanced partitioning problem where the goal is to partition the vertices of a given graph into k pieces, minimizing the total cut size. Our algorithm is composed of a few steps that are easily implementable in distributed computation frameworks, e.g., MapReduce. The algorithm first embeds nodes of the graph onto a line, and then processes nodes in a distributed manner guided by the linear embedding order. We examine various ways to find the first embedding, e.g., via a hierarchical clustering or Hilbert curves. Then we apply four different techniques such as local swaps, minimum cuts on partition boundaries, as well as contraction and dynamic programming. Our empirical study compares the above techniques with each other, and to previous work in distributed algorithms, e.g., a label propagation method, FENNEL and Spinner. We report our results both on a private map graph and several public social networks, and show that our results beat previous distributed algorithms: we notice, e.g., 15-25% reduction in cut size over [UB13]. We also observe that our algorithms allow for scalable distributed implementation for any number of partitions. Finally, we apply our techniques for the Google Maps Driving Directions to minimize the number of multi-shard queries with the goal of saving in CPU usage. During live experiments, we observe an ≈ 40% drop in the number of multi-shard queries when comparing our method with a standard geography-based method.
With the massive amounts of data available today, it is common to store and process data using multiple machines. Parallel programming platforms such as MapReduce and its variants are popular frameworks for handling such large data. We present the first provably efficient algorithms to compute, store, and query data structures for range queries and approximate nearest neighbor queries in a popular parallel computing abstraction that captures the salient features of MapReduce and other massively parallel communication (MPC) models. In particular, we describe algorithms for \$kd\$-trees, range trees, and BBD-trees that only require O(1) rounds of communication for both preprocessing and querying while staying competitive in terms of running time and workload to their classical counterparts. Our algorithms are randomized, but they can be made deterministic at some increase in their running time and workload while keeping the number of rounds of communication to be constant.
While cloud computing has become an attractive platform for supporting data intensive applications, a major obstacle to the adoption of cloud computing in sectors such as health and defense is the privacy risk associated with releasing datasets to third-parties in the cloud for analysis. A widely-adopted technique for data privacy preservation is to anonymize data via local recoding. However, most existing local-recoding techniques are either serial or distributed without directly optimizing scalability, thus rendering them unsuitable for big data applications. In this paper, we propose a highly scalable approach to local-recoding anonymization in cloud computing, based on Locality Sensitive Hashing (LSH). Specifically, a novel semantic distance metric is presented for use with LSH to measure the similarity between two data records. Then, LSH with the MinHash function family can be employed to divide datasets into multiple partitions for use with MapReduce to parallelize computation while preserving similarity. By using our efficient LSH-based scheme, we can anonymize each partition through the use of a recursive agglomerative \$k\$-member clustering algorithm. Extensive experiments on real-life datasets show that our approach significantly improves the scalability and time-efficiency of local-recoding anonymization by orders of magnitude over existing approaches.
Advanced Persistent Threat (APT), unlike traditional hacking attempts, carries out specific attacks on a specific target to illegally collect information and data from it. These targeted attacks use special-crafted malware and infrequent activity to avoid detection, so that hackers can retain control over target systems unnoticed for long periods of time. In order to detect these stealthy activities, a large-volume of traffic data generated in a period of time has to be analyzed. We proposed a scalable solution, Ctracer to detect stealthy command and control channel in a large-volume of traffic data. APT uses multiple command and control (C&C) channel and change them frequently to avoid detection, but there are common signatures in those C&C sessions. By identifying common network signature, Ctracer is able to group the C&C sessions. Therefore, we can detect an APT and all the C&C session used in an APT attack. The Ctracer is evaluated in a large enterprise for four months, twenty C&C servers, three APT attacks are reported. After investigated by the enterprise's Security Operations Center (SOC), the forensic report shows that there is specific enterprise targeted APT cases and not ever discovered for over 120 days.
Security companies have recently realised that mining massive amounts of security data can help generate actionable intelligence and improve their understanding of Internet attacks. In particular, attack attribution and situational understanding are considered critical aspects to effectively deal with emerging, increasingly sophisticated Internet attacks. This requires highly scalable analysis tools to help analysts classify, correlate and prioritise security events, depending on their likely impact and threat level. However, this security data mining process typically involves a considerable amount of features interacting in a non-obvious way, which makes it inherently complex. To deal with this challenge, we introduce MR-TRIAGE, a set of distributed algorithms built on MapReduce that can perform scalable multi-criteria data clustering on large security data sets and identify complex relationships hidden in massive datasets. The MR-TRIAGE workflow is made of a scalable data summarisation, followed by scalable graph clustering algorithms in which we integrate multi-criteria evaluation techniques. Theoretical computational complexity of the proposed parallel algorithms are discussed and analysed. The experimental results demonstrate that the algorithms can scale well and efficiently process large security datasets on commodity hardware. Our approach can effectively cluster any type of security events (e.g., spam emails, spear-phishing attacks, etc) that are sharing at least some commonalities among a number of predefined features.