Visible to the public Biblio

Filters: Keyword is Deduplication  [Clear All Filters]
2023-02-17
Islam, Tariqul, Hasan, Kamrul, Singh, Saheb, Park, Joon S..  2022.  A Secure and Decentralized Auditing Scheme for Cloud Ensuring Data Integrity and Fairness in Auditing. 2022 IEEE 9th International Conference on Cyber Security and Cloud Computing (CSCloud)/2022 IEEE 8th International Conference on Edge Computing and Scalable Cloud (EdgeCom). :74–79.
With the advent of cloud storage services many users tend to store their data in the cloud to save storage cost. However, this has lead to many security concerns, and one of the most important ones is ensuring data integrity. Public verification schemes are able to employ a third party auditor to perform data auditing on behalf of the user. But most public verification schemes are vulnerable to procrastinating auditors who may not perform auditing on time. These schemes do not have fair arbitration also, i.e. they lack a way to punish the malicious Cloud Service Provider (CSP) and compensate user whose data has been corrupted. On the other hand, CSP might be storing redundant data that could increase the storage cost for the CSP and computational cost of data auditing for the user. In this paper, we propose a Blockchain-based public auditing and deduplication scheme with a fair arbitration system against procrastinating auditors. The key idea requires auditors to record each verification using smart contract and store the result into a Blockchain as a transaction. Our scheme can detect and punish the procrastinating auditors and compensate users in the case of any data loss. Additionally, our scheme can detect and delete duplicate data that improve storage utilization and reduce the computational cost of data verification. Experimental evaluation demonstrates that our scheme is provably secure and does not incur overhead compared to the existing public auditing techniques while offering an additional feature of verifying the auditor’s performance.
ISSN: 2693-8928
2022-06-09
Sujatha, G., Raj, Jeberson Retna.  2021.  Digital Data Identification for Deduplication Process using Cryptographic Hashing Techniques. 2021 International Conference on Intelligent Technologies (CONIT). :1–4.
The cloud storage system is a very big boon for the organizations and individuals who are all in the need of storage space to accommodate huge volume of digital data. The cloud storage space can handle various types of digital data like text, image, video and audio. Since the storage space can be shared among different users, it is possible to have duplicate copies of data in the storage space. An efficient mechanism is required to identify the digital data uniquely in order to check the duplicity. There are various ways by which the digital data can be identified. One among such technique is hash-based identification. Using cryptographic hashing algorithms, every data can be uniquely identified. The unique property of hashing algorithm helps to identify the data uniquely. In this research work, we are going to discuss the advantage of using cryptographic hashing algorithm for digital data identification and the comparison of various hashing algorithms.
2019-11-25
Leontiadis, Iraklis, Curtmola, Reza.  2018.  Secure Storage with Replication and Transparent Deduplication. Proceedings of the Eighth ACM Conference on Data and Application Security and Privacy. :13–23.
We seek to answer the following question: To what extent can we deduplicate replicated storage? To answer this question, we design ReDup, a secure storage system that provides users with strong integrity, reliability, and transparency guarantees about data that is outsourced at cloud storage providers. Users store multiple replicas of their data at different storage servers, and the data at each storage server is deduplicated across users. Remote data integrity mechanisms are used to check the integrity of replicas. We consider a strong adversarial model, in which collusions are allowed between storage servers and also between storage servers and dishonest users of the system. A cloud storage provider (CSP) could store less replicas than agreed upon by contract, unbeknownst to honest users. ReDup defends against such adversaries by making replica generation to be time consuming so that a dishonest CSP cannot generate replicas on the fly when challenged by the users. In addition, ReDup employs transparent deduplication, which means that users get a proof attesting the deduplication level used for their files at each replica server, and thus are able to benefit from the storage savings provided by deduplication. The proof is obtained by aggregating individual proofs from replica servers, and has a constant size regardless of the number of replica servers. Our solution scales better than state of the art and is provably secure under standard assumptions.
2018-09-12
Armknecht, Frederik, Boyd, Colin, Davies, Gareth T., Gjøsteen, Kristian, Toorani, Mohsen.  2017.  Side Channels in Deduplication: Trade-offs Between Leakage and Efficiency. Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security. :266–274.
Deduplication removes redundant copies of files or data blocks stored on the cloud. Client-side deduplication, where the client only uploads the file upon the request of the server, provides major storage and bandwidth savings, but introduces a number of security concerns. Harnik et al. (2010) showed how cross-user client-side deduplication inherently gives the adversary access to a (noisy) side-channel that may divulge whether or not a particular file is stored on the server, leading to leakage of user information. We provide formal definitions for deduplication strategies and their security in terms of adversarial advantage. Using these definitions, we provide a criterion for designing good strategies and then prove a bound characterizing the necessary trade-off between security and efficiency.
2018-04-11
Zuo, Pengfei, Hua, Yu, Wang, Cong, Xia, Wen, Cao, Shunde, Zhou, Yukun, Sun, Yuanyuan.  2017.  Mitigating Traffic-Based Side Channel Attacks in Bandwidth-Efficient Cloud Storage. Proceedings of the 2017 Symposium on Cloud Computing. :638–638.

Data deduplication [3] is able to effectively identify and eliminate redundant data and only maintain a single copy of files and chunks. Hence, it is widely used in cloud storage systems to save the users' network bandwidth for uploading data. However, the occurrence of deduplication can be easily identified by monitoring and analyzing network traffic, which leads to the risk of user privacy leakage. The attacker can carry out a very dangerous side channel attack, i.e., learn-the-remaining-information (LRI) attack, to reveal users' privacy information by exploiting the side channel of network traffic in deduplication [1]. In the LRI attack, the attacker knows a large part of the target file in the cloud and tries to learn the remaining unknown parts via uploading all possible versions of the file's content. For example, the attacker knows all the contents of the target file X except the sensitive information \texttheta. To learn the sensitive information, the attacker needs to upload m files with all possible values of \texttheta, respectively. If a file Xd with the value \textthetad is deduplicated and other files are not, the attacker knows that the information \texttheta = \textthetad. In the threat model of the LRI attack, we consider a general cloud storage service model that includes two entities, i.e., the user and cloud storage server. The attack is launched by the users who aim to steal the privacy information of other users [1]. The attacker can act as a user via its own account or use multiple accounts to disguise as multiple users. The cloud storage server communicates with the users through Internet. The connections from the clients to the cloud storage server are encrypted by SSL or TLS protocol. Hence, the attacker can monitor and measure the amount of network traffic between the client and server but cannot intercept and analyze the contents of the transmitted data due to the encryption. The attacker can then perform the sophisticated traffic analysis with sufficient computing resources. We propose a simple yet effective scheme, called randomized redundant chunk scheme (RRCS), to significantly mitigate the risk of the LRI attack while maintaining the high bandwidth efficiency of deduplication. The basic idea behind RRCS is to add randomized redundant chunks to mix up the real deduplication states of files used for the LRI attack, which effectively obfuscates the view of the attacker, who attempts to exploit the side channel of network traffic for the LRI attack. RRCS includes three key function modules, range generation (RG), secure bounds setting (SBS), and security-irrelevant redundancy elimination (SRE). When uploading the random-number redundant chunks, RRCS first uses RG to generate a fixed range [0,$łambda$N] ($łambda$ $ε$ (0,1]), in which the number of added redundant chunks is randomly chosen, where N is the total number of chunks in a file and $łambda$ is a system parameter. However, the fixed range may cause a security issue. SBS is used to deal with the bounds of the fixed range to avoid the security issue. There may exist security-irrelevant redundant chunks in RRCS. SRE reduces the security-irrelevant redundant chunks to improve the deduplication efficiency. The design details are presented in our technical report [5]. Our security analysis demonstrates RRCS can significantly reduce the risk of the LRI attack [5]. We examine the performance of RRCS using three real-world trace-based datasets, i.e., Fslhomes [2], MacOS [2], and Onefull [4], and compare RRCS with the randomized threshold scheme (RTS) [1]. Our experimental results show that source-based deduplication eliminates 100% data redundancy which however has no security guarantee. File-level (chunk-level) RTS only eliminates 8.1% – 16.8% (9.8% – 20.3%) redundancy, due to only eliminating the redundancy of the files (chunks) that have many copies. RRCS with $łambda$ = 0.5 eliminates 76.1% – 78.0% redundancy and RRCS with $łambda$ = 1 eliminates 47.9% – 53.6% redundancy.

2017-08-18
DiScala, Michael, Abadi, Daniel J..  2016.  Automatic Generation of Normalized Relational Schemas from Nested Key-Value Data. Proceedings of the 2016 International Conference on Management of Data. :295–310.

Self-describing key-value data formats such as JSON are becoming increasingly popular as application developers choose to avoid the rigidity imposed by the relational model. Database systems designed for these self-describing formats, such as MongoDB, encourage users to use denormalized, heavily nested data models so that relationships across records and other schema information need not be predefined or standardized. Such data models contribute to long-term development complexity, as their lack of explicit entity and relationship tracking burdens new developers unfamiliar with the dataset. Furthermore, the large amount of data repetition present in such data layouts can introduce update anomalies and poor scan performance, which reduce both the quality and performance of analytics over the data. In this paper we present an algorithm that automatically transforms the denormalized, nested data commonly found in NoSQL systems into traditional relational data that can be stored in a standard RDBMS. This process includes a schema generation algorithm that discovers relationships across the attributes of the denormalized datasets in order to organize those attributes into relational tables. It further includes a matching algorithm that discovers sets of attributes that represent overlapping entities and merges those sets together. These algorithms reduce data repetition, allow the use of data analysis tools targeted at relational data, accelerate scan-intensive algorithms over the data, and help users gain a semantic understanding of complex, nested datasets.

2017-03-07
DiScala, Michael, Abadi, Daniel J..  2016.  Automatic Generation of Normalized Relational Schemas from Nested Key-Value Data. Proceedings of the 2016 International Conference on Management of Data. :295–310.

Self-describing key-value data formats such as JSON are becoming increasingly popular as application developers choose to avoid the rigidity imposed by the relational model. Database systems designed for these self-describing formats, such as MongoDB, encourage users to use denormalized, heavily nested data models so that relationships across records and other schema information need not be predefined or standardized. Such data models contribute to long-term development complexity, as their lack of explicit entity and relationship tracking burdens new developers unfamiliar with the dataset. Furthermore, the large amount of data repetition present in such data layouts can introduce update anomalies and poor scan performance, which reduce both the quality and performance of analytics over the data. In this paper we present an algorithm that automatically transforms the denormalized, nested data commonly found in NoSQL systems into traditional relational data that can be stored in a standard RDBMS. This process includes a schema generation algorithm that discovers relationships across the attributes of the denormalized datasets in order to organize those attributes into relational tables. It further includes a matching algorithm that discovers sets of attributes that represent overlapping entities and merges those sets together. These algorithms reduce data repetition, allow the use of data analysis tools targeted at relational data, accelerate scan-intensive algorithms over the data, and help users gain a semantic understanding of complex, nested datasets.

2015-05-06
Jin Li, Xiaofeng Chen, Mingqiang Li, Jingwei Li, Lee, P.P.C., Wenjing Lou.  2014.  Secure Deduplication with Efficient and Reliable Convergent Key Management. Parallel and Distributed Systems, IEEE Transactions on. 25:1615-1625.

Data deduplication is a technique for eliminating duplicate copies of data, and has been widely used in cloud storage to reduce storage space and upload bandwidth. Promising as it is, an arising challenge is to perform secure deduplication in cloud storage. Although convergent encryption has been extensively adopted for secure deduplication, a critical issue of making convergent encryption practical is to efficiently and reliably manage a huge number of convergent keys. This paper makes the first attempt to formally address the problem of achieving efficient and reliable key management in secure deduplication. We first introduce a baseline approach in which each user holds an independent master key for encrypting the convergent keys and outsourcing them to the cloud. However, such a baseline key management scheme generates an enormous number of keys with the increasing number of users and requires users to dedicatedly protect the master keys. To this end, we propose Dekey , a new construction in which users do not need to manage any keys on their own but instead securely distribute the convergent key shares across multiple servers. Security analysis demonstrates that Dekey is secure in terms of the definitions specified in the proposed security model. As a proof of concept, we implement Dekey using the Ramp secret sharing scheme and demonstrate that Dekey incurs limited overhead in realistic environments.