Visible to the public Biblio

Filters: Keyword is Data Sanitization  [Clear All Filters]
2018-05-24
Mehnaz, Shagufta, Bellala, Gowtham, Bertino, Elisa.  2017.  A Secure Sum Protocol and Its Application to Privacy-Preserving Multi-Party Analytics. Proceedings of the 22Nd ACM on Symposium on Access Control Models and Technologies. :219–230.

Many enterprises are transitioning towards data-driven business processes. There are numerous situations where multiple parties would like to share data towards a common goal if it were possible to simultaneously protect the privacy and security of the individuals and organizations described in the data. Existing solutions for multi-party analytics that follow the so called Data Lake paradigm have parties transfer their raw data to a trusted third-party (i.e., mediator), which then performs the desired analysis on the global data, and shares the results with the parties. However, such a solution does not fit many applications such as Healthcare, Finance, and the Internet-of-Things, where privacy is a strong concern. Motivated by the increasing demands for data privacy, we study the problem of privacy-preserving multi-party data analytics, where the goal is to enable analytics on multi-party data without compromising the data privacy of each individual party. In this paper, we first propose a secure sum protocol with strong security guarantees. The proposed secure sum protocol is resistant to collusion attacks even with N-2 parties colluding, where N denotes the total number of collaborating parties. We then use this protocol to propose two secure gradient descent algorithms, one for horizontally partitioned data, and the other for vertically partitioned data. The proposed framework is generic and applies to a wide class of machine learning problems. We demonstrate our solution for two popular use-cases, regression and classification, and evaluate the performance of the proposed solution in terms of the obtained model accuracy, latency and communication cost. In addition, we perform a scalability analysis to evaluate the performance of the proposed solution as the data size and the number of parties increase.

2018-01-10
Wu, Xiaotong, Dou, Wanchun, Ni, Qiang.  2017.  Game Theory Based Privacy Preserving Analysis in Correlated Data Publication. Proceedings of the Australasian Computer Science Week Multiconference. :73:1–73:10.

Privacy preserving on data publication has been an important research field over the past few decades. One of the fundamental challenges in privacy preserving data publication is the trade-off problem between privacy and utility of the single and independent data set. However, recent research works have shown that the advanced privacy mechanism, i.e., differential privacy, is vulnerable when multiple data sets are correlated. In this case, the trade-off problem between privacy and utility is evolved into a game problem, in which the payoff of each player is dependent not only on his privacy parameter, but also on his neighbors' privacy parameters. In this paper, we firstly present the definition of correlated differential privacy to evaluate the real privacy level of a single data set influenced by the other data sets. Then, we construct a game model of multiple players, who each publishes the data set sanitized by differential privacy. Next, we analyze the existence and uniqueness of the pure Nash Equilibrium and demonstrate the sufficient conditions in the game. Finally, we refer to a notion, i.e., the price of anarchy, to evaluate efficiency of the pure Nash Equilibrium.

Ping, Haoyue, Stoyanovich, Julia, Howe, Bill.  2017.  DataSynthesizer: Privacy-Preserving Synthetic Datasets. Proceedings of the 29th International Conference on Scientific and Statistical Database Management. :42:1–42:5.
To facilitate collaboration over sensitive data, we present DataSynthesizer, a tool that takes a sensitive dataset as input and generates a structurally and statistically similar synthetic dataset with strong privacy guarantees. The data owners need not release their data, while potential collaborators can begin developing models and methods with some confidence that their results will work similarly on the real dataset. The distinguishing feature of DataSynthesizer is its usability — the data owner does not have to specify any parameters to start generating and sharing data safely and effectively. DataSynthesizer consists of three high-level modules — DataDescriber, DataGenerator and ModelInspector. The first, DataDescriber, investigates the data types, correlations and distributions of the attributes in the private dataset, and produces a data summary, adding noise to the distributions to preserve privacy. DataGenerator samples from the summary computed by DataDescriber and outputs synthetic data. ModelInspector shows an intuitive description of the data summary that was computed by DataDescriber, allowing the data owner to evaluate the accuracy of the summarization process and adjust any parameters, if desired. We describe DataSynthesizer and illustrate its use in an urban science context, where sharing sensitive, legally encumbered data between agencies and with outside collaborators is reported as the primary obstacle to data-driven governance. The code implementing all parts of this work is publicly available at https://github.com/DataResponsibly/DataSynthesizer.
Deng, Xiyue, Mirkovic, Jelena.  2017.  Commoner Privacy And A Study On Network Traces. Proceedings of the 33rd Annual Computer Security Applications Conference. :566–576.
Differential privacy has emerged as a promising mechanism for privacy-safe data mining. One popular differential privacy mechanism allows researchers to pose queries over a dataset, and adds random noise to all output points to protect privacy. While differential privacy produces useful data in many scenarios, added noise may jeopardize utility for queries posed over small populations or over long-tailed datasets. Gehrke et al. proposed crowd-blending privacy, with random noise added only to those output points where fewer than k individuals (a configurable parameter) contribute to the point in the same manner. This approach has a lower privacy guarantee, but preserves more research utility than differential privacy. We propose an even more liberal privacy goal—commoner privacy—which fuzzes (omits, aggregates or adds noise to) only those output points where an individual's contribution to this point is an outlier. By hiding outliers, our mechanism hides the presence or absence of an individual in a dataset. We propose one mechanism that achieves commoner privacy—interactive k-anonymity. We also discuss query composition and show how we can guarantee privacy via either a pre-sampling step or via query introspection. We implement interactive k-anonymity and query introspection in a system called Patrol for network trace processing. Our evaluation shows that commoner privacy prevents common attacks while preserving orders of magnitude higher research utility than differential privacy, and at least 9-49 times the utility of crowd-blending privacy.
Zhang, Jun, Cormode, Graham, Procopiuc, Cecilia M., Srivastava, Divesh, Xiao, Xiaokui.  2017.  PrivBayes: Private Data Release via Bayesian Networks. ACM Trans. Database Syst.. 42:25:1–25:41.
Privacy-preserving data publishing is an important problem that has been the focus of extensive study. The state-of-the-art solution for this problem is differential privacy, which offers a strong degree of privacy protection without making restrictive assumptions about the adversary. Existing techniques using differential privacy, however, cannot effectively handle the publication of high-dimensional data. In particular, when the input dataset contains a large number of attributes, existing methods require injecting a prohibitive amount of noise compared to the signal in the data, which renders the published data next to useless. To address the deficiency of the existing methods, this paper presents PrivBayes, a differentially private method for releasing high-dimensional data. Given a dataset D, PrivBayes first constructs a Bayesian network N, which (i) provides a succinct model of the correlations among the attributes in D and (ii) allows us to approximate the distribution of data in D using a set P of low-dimensional marginals of D. After that, PrivBayes injects noise into each marginal in P to ensure differential privacy and then uses the noisy marginals and the Bayesian network to construct an approximation of the data distribution in D. Finally, PrivBayes samples tuples from the approximate distribution to construct a synthetic dataset, and then releases the synthetic data. Intuitively, PrivBayes circumvents the curse of dimensionality, as it injects noise into the low-dimensional marginals in P instead of the high-dimensional dataset D. Private construction of Bayesian networks turns out to be significantly challenging, and we introduce a novel approach that uses a surrogate function for mutual information to build the model more accurately. We experimentally evaluate PrivBayes on real data and demonstrate that it significantly outperforms existing solutions in terms of accuracy.
He, Zaobo, Cai, Zhipeng, Sun, Yunchuan, Li, Yingshu, Cheng, Xiuzhen.  2017.  Customized Privacy Preserving for Inherent Data and Latent Data. Personal Ubiquitous Comput.. 21:43–54.
The huge amount of sensory data collected from mobile devices has offered great potentials to promote more significant services based on user data extracted from sensor readings. However, releasing user data could also seriously threaten user privacy. It is possible to directly collect sensitive information from released user data without user permissions. Furthermore, third party users can also infer sensitive information contained in released data in a latent manner by utilizing data mining techniques. In this paper, we formally define these two types of threats as inherent data privacy and latent data privacy and construct a data-sanitization strategy that can optimize the tradeoff between data utility and customized two types of privacy. The key novel idea lies that the developed strategy can combat against powerful third party users with broad knowledge about users and launching optimal inference attacks. We show that our strategy does not reduce the benefit brought by user data much, while sensitive information can still be protected. To the best of our knowledge, this is the first work that preserves both inherent data privacy and latent data privacy.
Aissaoui, K., idar, H. Ait, Belhadaoui, H., Rifi, M..  2017.  Survey on data remanence in Cloud Computing environment. 2017 International Conference on Wireless Technologies, Embedded and Intelligent Systems (WITS). :1–4.

The Cloud Computing is a developing IT concept that faces some issues, which are slowing down its evolution and adoption by users across the world. The lack of security has been the main concern. Organizations and entities need to ensure, inter alia, the integrity and confidentiality of their outsourced sensible data within a cloud provider server. Solutions have been examined in order to strengthen security models (strong authentication, encryption and fragmentation before storing, access control policies...). More particularly, data remanence is undoubtedly a major threat. How could we be sure that data are, when is requested, truly and appropriately deleted from remote servers? In this paper, we aim to produce a survey about this interesting subject and to address the problem of residual data in a cloud-computing environment, which is characterized by the use of virtual machines instantiated in remote servers owned by a third party.

Vakilinia, I., Tosh, D. K., Sengupta, S..  2017.  3-Way game model for privacy-preserving cybersecurity information exchange framework. MILCOM 2017 - 2017 IEEE Military Communications Conference (MILCOM). :829–834.

With the growing number of cyberattack incidents, organizations are required to have proactive knowledge on the cybersecurity landscape for efficiently defending their resources. To achieve this, organizations must develop the culture of sharing their threat information with others for effectively assessing the associated risks. However, sharing cybersecurity information is costly for the organizations due to the fact that the information conveys sensitive and private data. Hence, making the decision for sharing information is a challenging task and requires to resolve the trade-off between sharing advantages and privacy exposure. On the other hand, cybersecurity information exchange (CYBEX) management is crucial in stabilizing the system through setting the correct values for participation fees and sharing incentives. In this work, we model the interaction of organizations, CYBEX, and attackers involved in a sharing system using dynamic game. With devising appropriate payoff models for each player, we analyze the best strategies of the entities by incorporating the organizations' privacy component in the sharing model. Using the best response analysis, the simulation results demonstrate the efficiency of our proposed framework.

Jeyaprabha, T. J., Sumathi, G., Nivedha, P..  2017.  Smart and secure data storage using Encrypt-interleaving. 2017 Innovations in Power and Advanced Computing Technologies (i-PACT). :1–6.

In the recent years many companies are shifting towards cloud for expanding their business profit with least additional cost. Cloud computing is a growing technology which has emerged from the development of grid computing, virtualization and utility computing. Cloud computing is a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources like networks, servers, storage, applications, and services that can be rapidly provisioned and released with minimal management effort or service provider interaction. There was a huge data loss during the recent Chennai floods during Dec 2015. If these data would have been stored at distributed data centers great loss could have been prevented. Though, such natural calamities are tempting many users to shift towards the cloud storage, security threats are inhibiting them to shift towards the cloud. Many solutions have been addressed for these security issues but they do not give guaranteed security. By guaranteed security we mean confidentiality, integrity and availability. Some of the existing techniques for providing security are Cryptographic Protocols, Data Sanitization, Predicate Logic, Access Control Mechanism, Honeypots, Sandboxing, Erasure Coding, RAID(Redundant Arrays of Independent Disks), Homomorphic Encryption and Split-Key Encryption. All these techniques either cannot work alone or adds computational and time complexity. An alternate scheme of combining encryption and channel coding schemes at one-go is proposed for increasing the levels of security. Hybrid encryption scheme is proposed to be used in the interleaver block of Turbo coder for avoiding burst error. Hybrid encryption avoids sharing of secret key via the unsecured channel. This provides both security and reliability by reducing error propagation effect with small additional cost and computational overhead. Time complexity can be reduced when encryption and encoding are done as a single process.

Zaman, A. N. K., Obimbo, C., Dara, R. A..  2017.  An improved differential privacy algorithm to protect re-identification of data. 2017 IEEE Canada International Humanitarian Technology Conference (IHTC). :133–138.

In the present time, there has been a huge increase in large data repositories by corporations, governments, and healthcare organizations. These repositories provide opportunities to design/improve decision-making systems by mining trends and patterns from the data set (that can provide credible information) to improve customer service (e.g., in healthcare). As a result, while data sharing is essential, it is an obligation to maintaining the privacy of the data donors as data custodians have legal and ethical responsibilities to secure confidentiality. This research proposes a 2-layer privacy preserving (2-LPP) data sanitization algorithm that satisfies ε-differential privacy for publishing sanitized data. The proposed algorithm also reduces the re-identification risk of the sanitized data. The proposed algorithm has been implemented, and tested with two different data sets. Compared to other existing works, the results obtained from the proposed algorithm show promising performance.

2017-12-20
Mohammadi, M., Chu, B., Lipford, H. R..  2017.  Detecting Cross-Site Scripting Vulnerabilities through Automated Unit Testing. 2017 IEEE International Conference on Software Quality, Reliability and Security (QRS). :364–373.

The best practice to prevent Cross Site Scripting (XSS) attacks is to apply encoders to sanitize untrusted data. To balance security and functionality, encoders should be applied to match the web page context, such as HTML body, JavaScript, and style sheets. A common programming error is the use of a wrong encoder to sanitize untrusted data, leaving the application vulnerable. We present a security unit testing approach to detect XSS vulnerabilities caused by improper encoding of untrusted data. Unit tests for the XSS vulnerability are automatically constructed out of each web page and then evaluated by a unit test execution framework. A grammar-based attack generator is used to automatically generate test inputs. We evaluate our approach on a large open source medical records application, demonstrating that we can detect many 0-day XSS vulnerabilities with very low false positives, and that the grammar-based attack generator has better test coverage than industry best practices.