Visible to the public Biblio

Filters: Keyword is Data Sanitization  [Clear All Filters]
2021-02-16
Wu, J. M.-T., Srivastava, G., Pirouz, M., Lin, J. C.-W..  2020.  A GA-based Data Sanitization for Hiding Sensitive Information with Multi-Thresholds Constraint. 2020 International Conference on Pervasive Artificial Intelligence (ICPAI). :29—34.
In this work, we propose a new concept of multiple support thresholds to sanitize the database for specific sensitive itemsets. The proposed method assigns a stricter threshold to the sensitive itemset for data sanitization. Furthermore, a genetic-algorithm (GA)-based model is involved in the designed algorithm to minimize side effects. In our experimental results, the GA-based PPDM approach is compared with traditional compact GA-based model and results clearly showed that our proposed method can obtain better performance with less computational cost.
2021-01-11
Wang, W.-C., Ho, C.-C., Chang, Y.-M., Chang, Y.-H..  2020.  Challenges and Designs for Secure Deletion in Storage Systems. 2020 Indo – Taiwan 2nd International Conference on Computing, Analytics and Networks (Indo-Taiwan ICAN). :181–189.
Data security has risen to be one of the most critical concerns of computer professionals. Tighter legal requirements now exist for the purpose of protecting user data from unauthorized uses and for both preserving and erasing/sanitizing data records to meet legal compliance requirements. To meet the data security requirement, many secure (data) deletion techniques have been proposed to deal with the data security concerns from different system layers. This paper surveys the state-of-the-art secure deletion techniques that have been designed to pursue higher efficiency, verifiability, and portability for emerging types of hard disk drives and flash-based solid-state drives. Meanwhile, the pros and cons of implementing secure deletion in different system layers are also discussed, so as to assist in pursuing better secure deletion designs for future storage systems.
2020-11-20
Wang, X., Herwono, I., Cerbo, F. D., Kearney, P., Shackleton, M..  2018.  Enabling Cyber Security Data Sharing for Large-scale Enterprises Using Managed Security Services. 2018 IEEE Conference on Communications and Network Security (CNS). :1—7.
Large enterprises and organizations from both private and public sectors typically outsource a platform solution, as part of the Managed Security Services (MSSs), from 3rd party providers (MSSPs) to monitor and analyze their data containing cyber security information. Sharing such data among these large entities is believed to improve their effectiveness and efficiency at tackling cybercrimes, via improved analytics and insights. However, MSS platform customers currently are not able or not willing to share data among themselves because of multiple reasons, including privacy and confidentiality concerns, even when they are using the same MSS platform. Therefore any proposed mechanism or technique to address such a challenge need to ensure that sharing is achieved in a secure and controlled way. In this paper, we propose a new architecture and use case driven designs to enable confidential, flexible and collaborative data sharing among such organizations using the same MSS platform. MSS platform is a complex environment where different stakeholders, including authorized MSSP personnel and customers' own users, have access to the same platform but with different types of rights and tasks. Hence we make every effort to improve the usability of the platform supporting sharing while keeping the existing rights and tasks intact. As an innovative and pioneering attempt to address the challenge of data sharing in the MSS platform, we hope to encourage further work to follow so that confidential and collaborative sharing eventually happens among MSS platform customers.
2020-07-09
Duan, Huayi, Zheng, Yifeng, Du, Yuefeng, Zhou, Anxin, Wang, Cong, Au, Man Ho.  2019.  Aggregating Crowd Wisdom via Blockchain: A Private, Correct, and Robust Realization. 2019 IEEE International Conference on Pervasive Computing and Communications (PerCom. :1—10.

Crowdsensing, driven by the proliferation of sensor-rich mobile devices, has emerged as a promising data sensing and aggregation paradigm. Despite useful, traditional crowdsensing systems typically rely on a centralized third-party platform for data collection and processing, which leads to concerns like single point of failure and lack of operation transparency. Such centralization hinders the wide adoption of crowdsensing by wary participants. We therefore explore an alternative design space of building crowdsensing systems atop the emerging decentralized blockchain technology. While enjoying the benefits brought by the public blockchain, we endeavor to achieve a consolidated set of desirable security properties with a proper choreography of latest techniques and our customized designs. We allow data providers to safely contribute data to the transparent blockchain with the confidentiality guarantee on individual data and differential privacy on the aggregation result. Meanwhile, we ensure the service correctness of data aggregation and sanitization by delicately employing hardware-assisted transparent enclave. Furthermore, we maintain the robustness of our system against faulty data providers that submit invalid data, with a customized zero-knowledge range proof scheme. The experiment results demonstrate the high efficiency of our designs on both mobile client and SGX-enabled server, as well as reasonable on-chain monetary cost of running our task contract on Ethereum.

Feyisetan, Oluwaseyi, Diethe, Tom, Drake, Thomas.  2019.  Leveraging Hierarchical Representations for Preserving Privacy and Utility in Text. 2019 IEEE International Conference on Data Mining (ICDM). :210—219.

Guaranteeing a certain level of user privacy in an arbitrary piece of text is a challenging issue. However, with this challenge comes the potential of unlocking access to vast data stores for training machine learning models and supporting data driven decisions. We address this problem through the lens of dx-privacy, a generalization of Differential Privacy to non Hamming distance metrics. In this work, we explore word representations in Hyperbolic space as a means of preserving privacy in text. We provide a proof satisfying dx-privacy, then we define a probability distribution in Hyperbolic space and describe a way to sample from it in high dimensions. Privacy is provided by perturbing vector representations of words in high dimensional Hyperbolic space to obtain a semantic generalization. We conduct a series of experiments to demonstrate the tradeoff between privacy and utility. Our privacy experiments illustrate protections against an authorship attribution algorithm while our utility experiments highlight the minimal impact of our perturbations on several downstream machine learning models. Compared to the Euclidean baseline, we observe \textbackslashtextgreater 20x greater guarantees on expected privacy against comparable worst case statistics.

Nisha, D, Sivaraman, E, Honnavalli, Prasad B.  2019.  Predicting and Preventing Malware in Machine Learning Model. 2019 10th International Conference on Computing, Communication and Networking Technologies (ICCCNT). :1—7.

Machine learning is a major area in artificial intelligence, which enables computer to learn itself explicitly without programming. As machine learning is widely used in making decision automatically, attackers have strong intention to manipulate the prediction generated my machine learning model. In this paper we study about the different types of attacks and its countermeasures on machine learning model. By research we found that there are many security threats in various algorithms such as K-nearest-neighbors (KNN) classifier, random forest, AdaBoost, support vector machine (SVM), decision tree, we revisit existing security threads and check what are the possible countermeasures during the training and prediction phase of machine learning model. In machine learning model there are 2 types of attacks that is causative attack which occurs during the training phase and exploratory attack which occurs during the prediction phase, we will also discuss about the countermeasures on machine learning model, the countermeasures are data sanitization, algorithm robustness enhancement, and privacy preserving techniques.

Ashouri, Mohammadreza.  2019.  Detecting Input Sanitization Errors in Scala. 2019 Seventh International Symposium on Computing and Networking Workshops (CANDARW). :313—319.

Scala programming language combines object-oriented and functional programming in one concise, high-level language, and the language supports static types that help to avoid bugs in complex programs. This paper proposes a dynamic taint analyzer called ScalaTaint for Scala applications. The analyzer traces the propagation of malicious inputs from untrusted sources to sensitive sink methods in programs that can be exploited by adversaries. In this work, we evaluated the accuracy of ScalaTaint with a security benchmark suite including 7 projects in Scala. As a result, our analyzer could report 49 vulnerabilities within 753,372 lines of code. Moreover, the result of our performance measurement on ScalaBench shows 67% runtime overhead that demonstrates the usefulness and efficiently of our technique in comparison with similar tools.

Kassem, Ali, Ács, Gergely, Castelluccia, Claude, Palamidessi, Catuscia.  2019.  Differential Inference Testing: A Practical Approach to Evaluate Sanitizations of Datasets. 2019 IEEE Security and Privacy Workshops (SPW). :72—79.

In order to protect individuals' privacy, data have to be "well-sanitized" before sharing them, i.e. one has to remove any personal information before sharing data. However, it is not always clear when data shall be deemed well-sanitized. In this paper, we argue that the evaluation of sanitized data should be based on whether the data allows the inference of sensitive information that is specific to an individual, instead of being centered around the concept of re-identification. We propose a framework to evaluate the effectiveness of different sanitization techniques on a given dataset by measuring how much an individual's record from the sanitized dataset influences the inference of his/her own sensitive attribute. Our intent is not to accurately predict any sensitive attribute but rather to measure the impact of a single record on the inference of sensitive information. We demonstrate our approach by sanitizing two real datasets in different privacy models and evaluate/compare each sanitized dataset in our framework.

Fahrenkrog-Petersen, Stephan A., van der Aa, Han, Weidlich, Matthias.  2019.  PRETSA: Event Log Sanitization for Privacy-aware Process Discovery. 2019 International Conference on Process Mining (ICPM). :1—8.

Event logs that originate from information systems enable comprehensive analysis of business processes, e.g., by process model discovery. However, logs potentially contain sensitive information about individual employees involved in process execution that are only partially hidden by an obfuscation of the event data. In this paper, we therefore address the risk of privacy-disclosure attacks on event logs with pseudonymized employee information. To this end, we introduce PRETSA, a novel algorithm for event log sanitization that provides privacy guarantees in terms of k-anonymity and t-closeness. It thereby avoids disclosure of employee identities, their membership in the event log, and their characterization based on sensitive attributes, such as performance information. Through step-wise transformations of a prefix-tree representation of an event log, we maintain its high utility for discovery of a performance-annotated process model. Experiments with real-world data demonstrate that sanitization with PRETSA yields event logs of higher utility compared to methods that exploit frequency-based filtering, while providing the same privacy guarantees.

Liu, Chuanyi, Han, Peiyi, Dong, Yingfei, Pan, Hezhong, Duan, Shaoming, Fang, Binxing.  2019.  CloudDLP: Transparent and Automatic Data Sanitization for Browser-Based Cloud Storage. 2019 28th International Conference on Computer Communication and Networks (ICCCN). :1—8.

Because cloud storage services have been broadly used in enterprises for online sharing and collaboration, sensitive information in images or documents may be easily leaked outside the trust enterprise on-premises due to such cloud services. Existing solutions to this problem have not fully explored the tradeoffs among application performance, service scalability, and user data privacy. Therefore, we propose CloudDLP, a generic approach for enterprises to automatically sanitize sensitive data in images and documents in browser-based cloud storage. To the best of our knowledge, CloudDLP is the first system that automatically and transparently detects and sanitizes both sensitive images and textual documents without compromising user experience or application functionality on browser-based cloud storage. To prevent sensitive information escaping from on-premises, CloudDLP utilizes deep learning methods to detect sensitive information in both images and textual documents. We have evaluated the proposed method on a number of typical cloud applications. Our experimental results show that it can achieve transparent and automatic data sanitization on the cloud storage services with relatively low overheads, while preserving most application functionalities.

Wang, Wei-Chen, Lin, Ping-Hsien, Li, Yung-Chun, Ho, Chien-Chung, Chang, Yu-Ming, Chang, Yuan-Hao.  2019.  Toward Instantaneous Sanitization through Disturbance-induced Errors and Recycling Programming over 3D Flash Memory. 2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). :1—8.

As data security has become one of the most crucial issues in modern storage system/application designs, the data sanitization techniques are regarded as the promising solution on 3D NAND flash-memory-based devices. Many excellent works had been proposed to exploit the in-place reprogramming, erasure and encryption techniques to achieve and implement the sanitization functionalities. However, existing sanitization approaches could lead to performance, disturbance overheads or even deciphered issues. Different from existing works, this work aims at exploring an instantaneous data sanitization scheme by taking advantage of programming disturbance properties. Our proposed design can not only achieve the instantaneous data sanitization by exploiting programming disturbance and error correction code properly, but also enhance the performance with the recycling programming design. The feasibility and capability of our proposed design are evaluated by a series of experiments on 3D NAND flash memory chips, for which we have very encouraging results. The experiment results show that the proposed design could achieve the instantaneous data sanitization with low overhead; besides, it improves the average response time and reduces the number of block erase count by up to 86.8% and 88.8%, respectively.

2019-12-16
Buenrostro, Issac, Tiwari, Abhishek, Rajamani, Vasanth, Pattuk, Erman, Chen, Zhixiong.  2018.  Single-Setup Privacy Enforcement for Heterogeneous Data Ecosystems. Proceedings of the 27th ACM International Conference on Information and Knowledge Management. :1943–1946.
Strong member privacy in technology enterprises involves, among other objectives, deleting or anonymizing various kinds of data that a company controls. Those requirements are complicated in a heterogeneous data ecosystem where data is stored in multiple stores with different semantics: different indexing or update capabilities require processes specific to a store or even schema. In this demo we showcase a method to enforce record controls of arbitrary data stores via a three step process: generate an offline snapshot, run a policy mechanism to select rows to delete/update, and apply the changes to the original store. The first and third steps work on any store by leveraging Apache Gobblin, an open source data integration framework. The policy computation step runs as a batch Gobblin job where each table can be customized via a dataset metadata tracking system and SQL expressions providing table-specific business logic. This setup allows enforcement of highly-customizable privacy requirements in a variety of systems from hosted databases to third party data storage systems.
Sun, Lin, Zhang, Lan, Ye, Xiaojun.  2018.  Randomized Bit Vector: Privacy-Preserving Encoding Mechanism. Proceedings of the 27th ACM International Conference on Information and Knowledge Management. :1263–1272.
Recently, many methods have been proposed to prevent privacy leakage in record linkage by encoding record pair data into another anonymous space. Nevertheless, they cannot perform well in some circumstances due to high computational complexities, low privacy guarantees or loss of data utility. In this paper, we propose distance-aware encoding mechanisms to compare numerical values in the anonymous space. We first embed numerical values into Hamming space by a low-computational encoding algorithm with randomized bit vector. To provide rigorous privacy guarantees, we use the random response based on differential privacy to keep global indistinguishability of original data and use Laplace noises via pufferfish mechanism to provide local indistinguishability. Besides, we provide an approach for embedding and privacy-related parameters selection to improve data utility. Experiments on datasets from different data distributions and application contexts validate that our approaches can be used efficiently in privacy-preserving record linkage tasks compared with previous works and have excellent performance even under very small privacy budgets.
Ding, Xiaofeng, Zhang, Xiaodong, Bao, Zhifeng, Jin, Hai.  2018.  Privacy-Preserving Triangle Counting in Large Graphs. Proceedings of the 27th ACM International Conference on Information and Knowledge Management. :1283–1292.
Triangle count is a critical parameter in mining relationships among people in social networks. However, directly publishing the findings obtained from triangle counts may bring potential privacy concern, which raises great challenges and opportunities for privacy-preserving triangle counting. In this paper, we choose to use differential privacy to protect triangle counting for large scale graphs. To reduce the large sensitivity caused in large graphs, we propose a novel graph projection method that can be used to obtain an upper bound for sensitivity in different distributions. In particular, we publish the triangle counts satisfying the node-differential privacy with two kinds of histograms: the triangle count distribution and the cumulative distribution. Moreover, we extend the research on privacy preserving triangle counting to one of its applications, the local clustering coefficient. Experimental results show that the cumulative distribution can fit the real statistical information better, and our proposed mechanism has achieved better accuracy for triangle counts while maintaining the requirement of differential privacy.
Duck, Gregory J., Yap, Roland H. C..  2018.  EffectiveSan: Type and Memory Error Detection Using Dynamically Typed C/C++. Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation. :181–195.
Low-level programming languages with weak/static type systems, such as C and C++, are vulnerable to errors relating to the misuse of memory at runtime, such as (sub-)object bounds overflows, (re)use-after-free, and type confusion. Such errors account for many security and other undefined behavior bugs for programs written in these languages. In this paper, we introduce the notion of dynamically typed C/C++, which aims to detect such errors by dynamically checking the "effective type" of each object before use at runtime. We also present an implementation of dynamically typed C/C++ in the form of the Effective Type Sanitizer (EffectiveSan). EffectiveSan enforces type and memory safety using a combination of low-fat pointers, type meta data and type/bounds check instrumentation. We evaluate EffectiveSan against the SPEC2006 benchmark suite and the Firefox web browser, and detect several new type and memory errors. We also show that EffectiveSan achieves high compatibility and reasonable overheads for the given error coverage. Finally, we highlight that EffectiveSan is one of only a few tools that can detect sub-object bounds errors, and uses a novel approach (dynamic type checking) to do so.
Lin, Jerry Chun-Wei, Zhang, Yuyu, Chen, Chun-Hao, Wu, Jimmy Ming-Tai, Chen, Chien-Ming, Hong, Tzung-Pei.  2018.  A Multiple Objective PSO-Based Approach for Data Sanitization. 2018 Conference on Technologies and Applications of Artificial Intelligence (TAAI). :148–151.
In this paper, a multi-objective particle swarm optimization (MOPSO)-based framework is presented to find the multiple solutions rather than a single one. The presented grid-based algorithm is used to assign the probability of the non-dominated solution for next iteration. Based on the designed algorithm, it is unnecessary to pre-define the weights of the side effects for evaluation but the non-dominated solutions can be discovered as an alternative way for data sanitization. Extensive experiments are carried on two datasets to show that the designed grid-based algorithm achieves good performance than the traditional single-objective evolution algorithms.
Wu, Jimmy Ming-Tai, Chun-Wei Lin, Jerry, Djenouri, Youcef, Fournier-Viger, Philippe, Zhang, Yuyu.  2019.  A Swarm-based Data Sanitization Algorithm in Privacy-Preserving Data Mining. 2019 IEEE Congress on Evolutionary Computation (CEC). :1461–1467.
In recent decades, data protection (PPDM), which not only hides information, but also provides information that is useful to make decisions, has become a critical concern. We present a sanitization algorithm with the consideration of four side effects based on multi-objective PSO and hierarchical clustering methods to find optimized solutions for PPDM. Experiments showed that compared to existing approaches, the designed sanitization algorithm based on the hierarchical clustering method achieves satisfactory performance in terms of hiding failure, missing cost, and artificial cost.
Lin, Ping-Hsien, Chang, Yu-Ming, Li, Yung-Chun, Wang, Wei-Chen, Ho, Chien-Chung, Chang, Yuan-Hao.  2018.  Achieving Fast Sanitization with Zero Live Data Copy for MLC Flash Memory. 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). :1–8.
As data security has become the major concern in modern storage systems with low-cost multi-level-cell (MLC) flash memories, it is not trivial to realize data sanitization in such a system. Even though some existing works employ the encryption or the built-in erase to achieve this requirement, they still suffer the risk of being deciphered or the issue of performance degradation. In contrast to the existing work, a fast sanitization scheme is proposed to provide the highest degree of security for data sanitization; that is, every old version of data could be immediately sanitized with zero live-data-copy overhead once the new version of data is created/written. In particular, this scheme further considers the reliability issue of MLC flash memories; the proposed scheme includes a one-shot sanitization design to minimize the disturbance during data sanitization. The feasibility and the capability of the proposed scheme were evaluated through extensive experiments based on real flash chips. The results demonstrate that this scheme can achieve the data sanitization with zero live-data-copy, where performance overhead is less than 1%.
Zhu, Yan, Yang, Shuai, Chu, William Cheng-Chung, Feng, Rongquan.  2019.  FlashGhost: Data Sanitization with Privacy Protection Based on Frequent Colliding Hash Table. 2019 IEEE International Conference on Services Computing (SCC). :90–99.

Today's extensive use of Internet creates huge volumes of data by users in both client and server sides. Normally users don't want to store all the data in local as well as keep archive in the server. For some unwanted data, such as trash, cache and private data, needs to be deleted periodically. Explicit deletion could be applied to the local data, while it is a troublesome job. But there is no transparency to users on the personal data stored in the server. Since we have no knowledge of whether they're cached, copied and archived by the third parties, or sold by the service provider. Our research seeks to provide an automatic data sanitization system to make data could be self-destructing. Specifically, we give data a life cycle, which would be erased automatically when at the end of its life, and the destroyed data cannot be recovered by any effort. In this paper, we present FlashGhost, which is a system that meets this challenge through a novel integration of cryptography techniques with the frequent colliding hash table. In this system, data will be unreadable and rendered unrecoverable by overwriting multiple times after its validity period has expired. Besides, the system reliability is enhanced by threshold cryptography. We also present a mathematical model and verify it by a number of experiments, which demonstrate theoretically and experimentally our system is practical to use and meet the data auto-sanitization goal described above.

2019-11-04
Gunawan, Dedi, Mambo, Masahiro.  2018.  Set-valued Data Anonymization Maintaining Data Utility and Data Property. Proceedings of the 12th International Conference on Ubiquitous Information Management and Communication. :88:1–88:8.

Set-valued database publication has been attracting much attention due to its benefit for various applications like recommendation systems and marketing analysis. However, publishing original database directly is risky since an unauthorized party may violate individual privacy by associating and analyzing relations between individuals and set of items in the published database, which is known as identity linkage attack. Generally, an attack is performed based on attacker's background knowledge obtained by a prior investigation and such adversary knowledge should be taken into account in the data anonymization. Various data anonymization schemes have been proposed to prevent the identity linkage attack. However, in existing data anonymization schemes, either data utility or data property is reduced a lot after excessive database modification and consequently data recipients become to distrust the released database. In this paper, we propose a new data anonymization scheme, called sibling suppression, which causes minimum data utility lost and maintains data properties like database size and the number of records. The scheme uses multiple sets of adversary knowledge and items in a category of adversary knowledge are replaced by other items in the category. Several experiments with real dataset show that our method can preserve data utility with minimum lost and maintain data property as the same as original database.

2019-01-31
Mohammady, Meisam, Wang, Lingyu, Hong, Yuan, Louafi, Habib, Pourzandi, Makan, Debbabi, Mourad.  2018.  Preserving Both Privacy and Utility in Network Trace Anonymization. Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security. :459–474.

As network security monitoring grows more sophisticated, there is an increasing need for outsourcing such tasks to third-party analysts. However, organizations are usually reluctant to share their network traces due to privacy concerns over sensitive information, e.g., network and system configuration, which may potentially be exploited for attacks. In cases where data owners are convinced to share their network traces, the data are typically subjected to certain anonymization techniques, e.g., CryptoPAn, which replaces real IP addresses with prefix-preserving pseudonyms. However, most such techniques either are vulnerable to adversaries with prior knowledge about some network flows in the traces, or require heavy data sanitization or perturbation, both of which may result in a significant loss of data utility. In this paper, we aim to preserve both privacy and utility through shifting the trade-off from between privacy and utility to between privacy and computational cost. The key idea is for the analysts to generate and analyze multiple anonymized views of the original network traces; those views are designed to be sufficiently indistinguishable even to adversaries armed with prior knowledge, which preserves the privacy, whereas one of the views will yield true analysis results privately retrieved by the data owner, which preserves the utility. We formally analyze the privacy of our solution and experimentally evaluate it using real network traces provided by a major ISP. The results show that our approach can significantly reduce the level of information leakage (e.g., less than 1% of the information leaked by CryptoPAn) with comparable utility.

2018-09-12
Armknecht, Frederik, Boyd, Colin, Davies, Gareth T., Gjøsteen, Kristian, Toorani, Mohsen.  2017.  Side Channels in Deduplication: Trade-offs Between Leakage and Efficiency. Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security. :266–274.
Deduplication removes redundant copies of files or data blocks stored on the cloud. Client-side deduplication, where the client only uploads the file upon the request of the server, provides major storage and bandwidth savings, but introduces a number of security concerns. Harnik et al. (2010) showed how cross-user client-side deduplication inherently gives the adversary access to a (noisy) side-channel that may divulge whether or not a particular file is stored on the server, leading to leakage of user information. We provide formal definitions for deduplication strategies and their security in terms of adversarial advantage. Using these definitions, we provide a criterion for designing good strategies and then prove a bound characterizing the necessary trade-off between security and efficiency.
Doan, Khue, Quang, Minh Nguyen, Le, Bac.  2017.  Applied Cuckoo Algorithm for Association Rule Hiding Problem. Proceedings of the Eighth International Symposium on Information and Communication Technology. :26–33.
Nowadays, the database security problem is becoming significantly interesting in the data mining field. How can exploit legitimate data and avoid disclosing sensitive information. There have been many approaches in which the outstanding solution among them is privacy preservation in association rule mining to hide sensitive rules. In the recent years, a meta-heuristic algorithm is becoming effective for this goal, the algorithm is applied in the cuckoo optimization algorithm (COA4ARH). In this paper, an improved proposal of the COA4ARH to minimize the side effect of the missing non-sensitive rules will be introduced. The main contribution of this study is a new pre-process stage to determine the minimum number of necessary transactions for the process of initializing an initial habitat, thus restriction of modified operation on the original data. To evaluate the effectiveness of the proposed method, we conducted several experiments on the real datasets. The experimental results show that the improved approach has higher performance in compared to the original algorithm.
Nagaratna, M., Sowmya, Y..  2017.  M-sanit: Computing misusability score and effective sanitization of big data using Amazon elastic MapReduce. 2017 International Conference on Computation of Power, Energy Information and Commuincation (ICCPEIC). :029–035.
The invent of distributed programming frameworks like Hadoop paved way for processing voluminous data known as big data. Due to exponential growth of data, enterprises started to exploit the availability of cloud infrastructure for storing and processing big data. Insider attacks on outsourced data causes leakage of sensitive data. Therefore, it is essential to sanitize data so as to preserve privacy or non-disclosure of sensitive data. Privacy Preserving Data Publishing (PPDP) and Privacy Preserving Data Mining (PPDM) are the areas in which data sanitization plays a vital role in preserving privacy. The existing anonymization techniques for MapReduce programming can be improved to have a misusability measure for determining the level of sanitization to be applied to big data. To overcome this limitation we proposed a framework known as M-Sanit which has mechanisms to exploit misusability score of big data prior to performing sanitization using MapReduce programming paradigm. Our empirical study using the real world cloud eco system such as Amazon Elastic Cloud Compute (EC2) and Amazon Elastic MapReduce (EMR) reveals the effectiveness of misusability score based sanitization of big data prior to publishing or mining it.
2018-06-07
Wu, Xi, Li, Fengan, Kumar, Arun, Chaudhuri, Kamalika, Jha, Somesh, Naughton, Jeffrey.  2017.  Bolt-on Differential Privacy for Scalable Stochastic Gradient Descent-based Analytics. Proceedings of the 2017 ACM International Conference on Management of Data. :1307–1322.

While significant progress has been made separately on analytics systems for scalable stochastic gradient descent (SGD) and private SGD, none of the major scalable analytics frameworks have incorporated differentially private SGD. There are two inter-related issues for this disconnect between research and practice: (1) low model accuracy due to added noise to guarantee privacy, and (2) high development and runtime overhead of the private algorithms. This paper takes a first step to remedy this disconnect and proposes a private SGD algorithm to address both issues in an integrated manner. In contrast to the white-box approach adopted by previous work, we revisit and use the classical technique of output perturbation to devise a novel “bolt-on” approach to private SGD. While our approach trivially addresses (2), it makes (1) even more challenging. We address this challenge by providing a novel analysis of the L2-sensitivity of SGD, which allows, under the same privacy guarantees, better convergence of SGD when only a constant number of passes can be made over the data. We integrate our algorithm, as well as other state-of-the-art differentially private SGD, into Bismarck, a popular scalable SGD-based analytics system on top of an RDBMS. Extensive experiments show that our algorithm can be easily integrated, incurs virtually no overhead, scales well, and most importantly, yields substantially better (up to 4X) test accuracy than the state-of-the-art algorithms on many real datasets.