Biblio
Data deduplication [3] is able to effectively identify and eliminate redundant data and only maintain a single copy of files and chunks. Hence, it is widely used in cloud storage systems to save the users' network bandwidth for uploading data. However, the occurrence of deduplication can be easily identified by monitoring and analyzing network traffic, which leads to the risk of user privacy leakage. The attacker can carry out a very dangerous side channel attack, i.e., learn-the-remaining-information (LRI) attack, to reveal users' privacy information by exploiting the side channel of network traffic in deduplication [1]. In the LRI attack, the attacker knows a large part of the target file in the cloud and tries to learn the remaining unknown parts via uploading all possible versions of the file's content. For example, the attacker knows all the contents of the target file X except the sensitive information \texttheta. To learn the sensitive information, the attacker needs to upload m files with all possible values of \texttheta, respectively. If a file Xd with the value \textthetad is deduplicated and other files are not, the attacker knows that the information \texttheta = \textthetad. In the threat model of the LRI attack, we consider a general cloud storage service model that includes two entities, i.e., the user and cloud storage server. The attack is launched by the users who aim to steal the privacy information of other users [1]. The attacker can act as a user via its own account or use multiple accounts to disguise as multiple users. The cloud storage server communicates with the users through Internet. The connections from the clients to the cloud storage server are encrypted by SSL or TLS protocol. Hence, the attacker can monitor and measure the amount of network traffic between the client and server but cannot intercept and analyze the contents of the transmitted data due to the encryption. The attacker can then perform the sophisticated traffic analysis with sufficient computing resources. We propose a simple yet effective scheme, called randomized redundant chunk scheme (RRCS), to significantly mitigate the risk of the LRI attack while maintaining the high bandwidth efficiency of deduplication. The basic idea behind RRCS is to add randomized redundant chunks to mix up the real deduplication states of files used for the LRI attack, which effectively obfuscates the view of the attacker, who attempts to exploit the side channel of network traffic for the LRI attack. RRCS includes three key function modules, range generation (RG), secure bounds setting (SBS), and security-irrelevant redundancy elimination (SRE). When uploading the random-number redundant chunks, RRCS first uses RG to generate a fixed range [0,$łambda$N] ($łambda$ $ε$ (0,1]), in which the number of added redundant chunks is randomly chosen, where N is the total number of chunks in a file and $łambda$ is a system parameter. However, the fixed range may cause a security issue. SBS is used to deal with the bounds of the fixed range to avoid the security issue. There may exist security-irrelevant redundant chunks in RRCS. SRE reduces the security-irrelevant redundant chunks to improve the deduplication efficiency. The design details are presented in our technical report [5]. Our security analysis demonstrates RRCS can significantly reduce the risk of the LRI attack [5]. We examine the performance of RRCS using three real-world trace-based datasets, i.e., Fslhomes [2], MacOS [2], and Onefull [4], and compare RRCS with the randomized threshold scheme (RTS) [1]. Our experimental results show that source-based deduplication eliminates 100% data redundancy which however has no security guarantee. File-level (chunk-level) RTS only eliminates 8.1% – 16.8% (9.8% – 20.3%) redundancy, due to only eliminating the redundancy of the files (chunks) that have many copies. RRCS with $łambda$ = 0.5 eliminates 76.1% – 78.0% redundancy and RRCS with $łambda$ = 1 eliminates 47.9% – 53.6% redundancy.
Self-describing key-value data formats such as JSON are becoming increasingly popular as application developers choose to avoid the rigidity imposed by the relational model. Database systems designed for these self-describing formats, such as MongoDB, encourage users to use denormalized, heavily nested data models so that relationships across records and other schema information need not be predefined or standardized. Such data models contribute to long-term development complexity, as their lack of explicit entity and relationship tracking burdens new developers unfamiliar with the dataset. Furthermore, the large amount of data repetition present in such data layouts can introduce update anomalies and poor scan performance, which reduce both the quality and performance of analytics over the data. In this paper we present an algorithm that automatically transforms the denormalized, nested data commonly found in NoSQL systems into traditional relational data that can be stored in a standard RDBMS. This process includes a schema generation algorithm that discovers relationships across the attributes of the denormalized datasets in order to organize those attributes into relational tables. It further includes a matching algorithm that discovers sets of attributes that represent overlapping entities and merges those sets together. These algorithms reduce data repetition, allow the use of data analysis tools targeted at relational data, accelerate scan-intensive algorithms over the data, and help users gain a semantic understanding of complex, nested datasets.
Self-describing key-value data formats such as JSON are becoming increasingly popular as application developers choose to avoid the rigidity imposed by the relational model. Database systems designed for these self-describing formats, such as MongoDB, encourage users to use denormalized, heavily nested data models so that relationships across records and other schema information need not be predefined or standardized. Such data models contribute to long-term development complexity, as their lack of explicit entity and relationship tracking burdens new developers unfamiliar with the dataset. Furthermore, the large amount of data repetition present in such data layouts can introduce update anomalies and poor scan performance, which reduce both the quality and performance of analytics over the data. In this paper we present an algorithm that automatically transforms the denormalized, nested data commonly found in NoSQL systems into traditional relational data that can be stored in a standard RDBMS. This process includes a schema generation algorithm that discovers relationships across the attributes of the denormalized datasets in order to organize those attributes into relational tables. It further includes a matching algorithm that discovers sets of attributes that represent overlapping entities and merges those sets together. These algorithms reduce data repetition, allow the use of data analysis tools targeted at relational data, accelerate scan-intensive algorithms over the data, and help users gain a semantic understanding of complex, nested datasets.
Data deduplication is a technique for eliminating duplicate copies of data, and has been widely used in cloud storage to reduce storage space and upload bandwidth. Promising as it is, an arising challenge is to perform secure deduplication in cloud storage. Although convergent encryption has been extensively adopted for secure deduplication, a critical issue of making convergent encryption practical is to efficiently and reliably manage a huge number of convergent keys. This paper makes the first attempt to formally address the problem of achieving efficient and reliable key management in secure deduplication. We first introduce a baseline approach in which each user holds an independent master key for encrypting the convergent keys and outsourcing them to the cloud. However, such a baseline key management scheme generates an enormous number of keys with the increasing number of users and requires users to dedicatedly protect the master keys. To this end, we propose Dekey , a new construction in which users do not need to manage any keys on their own but instead securely distribute the convergent key shares across multiple servers. Security analysis demonstrates that Dekey is secure in terms of the definitions specified in the proposed security model. As a proof of concept, we implement Dekey using the Ramp secret sharing scheme and demonstrate that Dekey incurs limited overhead in realistic environments.