Biblio
We explore a new security model for secure computation on large datasets. We assume that two servers have been employed to compute on private data that was collected from many users, and, in order to improve the efficiency of their computation, we establish a new tradeoff with privacy. Specifically, instead of claiming that the servers learn nothing about the input values, we claim that what they do learn from the computation preserves the differential privacy of the input. Leveraging this relaxation of the security model allows us to build a protocol that leaks some information in the form of access patterns to memory, while also providing a formal bound on what is learned from the leakage. We then demonstrate that this leakage is useful in a broad class of computations. We show that computations such as histograms, PageRank and matrix factorization, which can be performed in common graph-parallel frameworks such as MapReduce or Pregel, benefit from our relaxation. We implement a protocol for securely executing graph-parallel computations, and evaluate the performance on the three examples just mentioned above. We demonstrate marked improvement over prior implementations for these computations.
Nowadays, many applications involve big data and big data analysis methods appear in many fields. As a preliminary attempt to solve the challenge of big data analysis, this paper presents a distributed online learning algorithm based on differential privacy. Since online learning can effectively process sensitive data, we introduce the concept of differential privacy in distributed online learning algorithms, with the aim at ensuring data privacy during online learning to prevent adversarial nodes from inferring any important data information. In particular, for different adversary models, we consider different type graphs to tolerate a limited number of adversaries near each regular node or tolerate a global limited number of adversaries.
The sensitivity of a function is the maximum change of its output for a unit change of its input. In this paper we present a method for determining the sensitivity of SQL queries, seen as functions from databases to datasets, where the change is measured in the number of rows that differ. Given a query, a database schema and a number, our method constructs a formula that is satisfiable only if the sensitivity of the query is bigger than this number. Our method is composable, and can also be applied to SQL workflows. Our results can be used to calibrate the amount of noise that has to be added to the output of the query to obtain a certain level of differential privacy.
In this paper, we propose a new randomized response algorithm that can achieve differential-privacy and utility guarantees for consumer's behaviors, and process a batch of data at each time. Firstly, differing from traditional differential private approach-es, we add randomized response noise into the behavior signa-tures matrix to achieve an acceptable utility-privacy tradeoff. Secondly, a behavior signature modeling method based on sparse coding is proposed. After some lightweight trainings us-ing the energy consumption data, the dictionary will be associat-ed with the behavior characteristics of the electric appliances. At last, through the experimental results verification, we find that our Algorithm can preserve consumer's privacy without comprising utility.
Smart meters migrate conventional electricity grid into digitally enabled Smart Grid (SG), which is more reliable and efficient. Fine-grained energy consumption data collected by smart meters helps utility providers accurately predict users' demands and significantly reduce power generation cost, while it imposes severe privacy risks on consumers and may discourage them from using those “espionage meters". To enjoy the benefits of smart meter measured data without compromising the users' privacy, in this paper, we try to integrate distributed differential privacy (DDP) techniques into data-driven optimization, and propose a novel scheme that not only minimizes the cost for utility providers but also preserves the DDP of users' energy profiles. Briefly, we add differential private noises to the users' energy consumption data before the smart meters send it to the utility provider. Due to the uncertainty of the users' demand distribution, the utility provider aggregates a given set of historical users' differentially private data, estimates the users' demands, and formulates the data- driven cost minimization based on the collected noisy data. We also develop algorithms for feasible solutions, and verify the effectiveness of the proposed scheme through simulations using the simulated energy consumption data generated from the utility company's real data analysis.
In rural/remote areas, resource constrained smart micro-grid (RCSMG) architectures can offer a cost-effective power management and supply alternative to national power grid connections. RCSMG architectures handle communications over distributed lossy networks to minimize operation costs. However, the unreliable nature of lossy networks makes privacy an important consideration. Existing anonymisation works on data perturbation work mainly by distortion with additive noise. Apply these solutions to RCSMGs is problematic, because deliberate noise additions must be distinguishable both from system and adversarial generated noise. In this paper, we present a brief survey of privacy risks in RCSMGs centered on inference, and propose a method of mitigating these risks. The lesson here is that while RCSMGs give users more control over power management and distribution, good anonymisation is essential to protecting personal information on RCSMGs.
Many applications, such as anonymous communication systems, privacy-enhancing database queries, or privacy-enhancing machine-learning methods, require robust guarantees under thousands and sometimes millions of observations. The notion of r-fold approximate differential privacy (ADP) offers a well-established framework with a precise characterization of the degree of privacy after r observations of an attacker. However, existing bounds for r-fold ADP are loose and, if used for estimating the required degree of noise for an application, can lead to over-cautious choices for perturbation randomness and thus to suboptimal utility or overly high costs. We present a numerical and widely applicable method for capturing the privacy loss of differentially private mechanisms under composition, which we call privacy buckets. With privacy buckets we compute provable upper and lower bounds for ADP for a given number of observations. We compare our bounds with state-of-the-art bounds for r-fold ADP, including Kairouz, Oh, and Viswanath's composition theorem (KOV), concentrated differential privacy and the moments accountant. While KOV proved optimal bounds for heterogeneous adaptive k-fold composition, we show that for concrete sequences of mechanisms tighter bounds can be derived by taking the mechanisms' structure into account. We compare previous bounds for the Laplace mechanism, the Gauss mechanism, for a timing leakage reduction mechanism, and for the stochastic gradient descent and we significantly improve over their results (except that we match the KOV bound for the Laplace mechanism, for which it seems tight). Our lower bounds almost meet our upper bounds, showing that no significantly tighter bounds are possible.
Private communication over the Internet remains a challenging problem. Even if messages are encrypted, it is hard to deliver them without revealing metadata about which pairs of users are communicating. Scalable anonymity systems, such as Tor, are susceptible to traffic analysis attacks that leak metadata. In contrast, the largest-scale systems with metadata privacy require passing all messages through a small number of providers, requiring a high operational cost for each provider and limiting their deployability in practice. This paper presents Stadium, a point-to-point messaging system that provides metadata and data privacy while scaling its work efficiently across hundreds of low-cost providers operated by different organizations. Much like Vuvuzela, the current largest-scale metadata-private system, Stadium achieves its provable guarantees through differential privacy and the addition of noisy cover traffic. The key challenge in Stadium is limiting the information revealed from the many observable traffic links of a highly distributed system, without requiring an overwhelming amount of noise. To solve this challenge, Stadium introduces techniques for distributed noise generation and differentially private routing as well as a verifiable parallel mixnet design where the servers collaboratively check that others follow the protocol. We show that Stadium can scale to support 4x more users than Vuvuzela using servers that cost an order of magnitude less to operate than Vuvuzela nodes.
Nowadays, data has become more important as the core resource for the information society. However, with the development of data analysis techniques, the privacy violation such as leakage of sensitive data and personal identification exposure are also increasing. Differential privacy is the technique to satisfy the requirement that any additional information should not be disclosed except information from the database itself. It is well known for protecting the privacy from arbitrary attack. However, recent research argues that there is a several ways to infer sensitive information from data although the differential privacy is applied. One of this inference method is to use the correlation between the data. In this paper, we investigate the new privacy threats using attribute correlation which are not covered by traditional studies and propose a privacy preserving technique that configures the differential privacy's noise parameter to solve this new threat. In the experiment, we show the weaknesses of traditional differential privacy method and validate that the proposed noise parameter configuration method provide a sufficient privacy protection and maintain an accuracy of data utility.
As more and more technologies to store and analyze massive amount of data become available, it is extremely important to make privacy-sensitive data de-identified so that further analysis can be conducted by different parties. For example, data needs to go through data de-identification process before being transferred to institutes for further value added analysis. As such, privacy protection issues associated with the release of data and data mining have become a popular field of study in the domain of big data. As a strict and verifiable definition of privacy, differential privacy has attracted noteworthy attention and widespread research in recent years. Nevertheless, differential privacy is not practical for most applications due to its performance of synthetic dataset generation for data query. Moreover, the definition of data protection by randomized noise in native differential privacy is abstract to users. Therefore, we design a pragmatic DP-based data de-identification protection and risk of data disclosure estimation system, in which a DP-based noise addition mechanism is applied to generate synthetic datasets. Furthermore, the risk of data disclosure to these synthetic datasets can be evaluated before releasing to buyers/consumers.
Differential privacy is an approach that preserves patient privacy while permitting researchers access to medical data. This paper presents mechanisms proposed to satisfy differential privacy while answering a given workload of range queries. Representing input data as a vector of counts, these methods partition the vector according to relationships between the data and the ranges of the given queries. After partitioning the vector into buckets, the counts of each bucket are estimated privately and split among the bucket's positions to answer the given query set. The performance of the proposed method was evaluated using different workloads over several attributes. The results show that partitioning the vector based on the data can produce more accurate answers, while partitioning the vector based on the given workload improves privacy. This paper's two main contributions are: (1) improving earlier work on partitioning mechanisms by building a greedy algorithm to partition the counts' vector efficiently, and (2) its adaptive algorithm considers the sensitivity of the given queries before providing results.
It is a challenging problem to preserve the friendly-correlations between individuals when publishing social-network data. To alleviate this problem, uncertain graph has been presented recently. The main idea of uncertain graph is converting an original graph into an uncertain form, where the correlations between individuals is an associated probability. However, the existing methods of uncertain graph lack rigorous guarantees of privacy and rely on the assumption of adversary's knowledge. In this paper we first introduced a general model for constructing uncertain graphs. Then, we proposed an algorithm under the model which is based on differential privacy and made an analysis of algorithm's privacy. Our algorithm provides rigorous guarantees of privacy and against the background knowledge attack. Finally, the algorithm we proposed satisfied differential privacy and showed feasibility in the experiments. And then, we compare our algorithm with (k, ε)-obfuscation algorithm in terms of data utility, the importance of nodes for network in our algorithm is similar to (k, ε)-obfuscation algorithm.
To enhance privacy protection and improve data availability, a differential privacy data protection method ICMD-DP is proposed. Based on insensitive clustering algorithm, ICMD-DP performs differential privacy on the results of ICMD (insensitive clustering method for mixed data). The combination of clustering and differential privacy realizes the differentiation of query sensitivity from single record to group record. At the meanwhile, it reduces the risk of information loss and information disclosure. In addition, to satisfy the requirement of maintaining differential privacy for mixed data, ICMD-DP uses different methods to calculate the distance and centroid of categorical and numerical attributes. Finally, experiments are given to illustrate the availability of the method.
Comparing with the traditional grid, energy internet will collect data widely and connect more broader. The analysis of electrical data use of Non-intrusive Load Monitoring (NILM) can infer user behavior privacy. Consideration both data security and availability is a problem must be addressed. Due to its rigid and provable privacy guarantee, Differential Privacy has proverbially reached and applied to privacy preserving data release and data mining. Because of its high sensitivity, increases the noise directly will led to data unavailable. In this paper, we propose a differentially private mechanism to protect energy internet privacy. Our focus is the aggregated data be released by data owner after added noise in disaggregated data. The theoretically proves and experiments show that our scheme can achieve the purpose of privacy-preserving and data availability.
Trying to solve the risk of data privacy disclosure in classification process, a Random Forest algorithm under differential privacy named DPRF-gini is proposed in the paper. In the process of building decision tree, the algorithm first disturbed the process of feature selection and attribute partition by using exponential mechanism, and then meet the requirement of differential privacy by adding Laplace noise to the leaf node. Compared with the original algorithm, Empirical results show that protection of data privacy is further enhanced while the accuracy of the algorithm is slightly reduced.
Distributed data aggregation via summation (counting) helped us to learn the insights behind the raw data. However, such computing suffered from a high privacy risk of malicious collusion attacks. That is, the colluding adversaries infer a victim's privacy from the gaps between the aggregation outputs and their source data. Among the solutions against such collusion attacks, Distributed Differential Privacy (DDP) shows a significant effect of privacy preservation. Specifically, a DDP scheme guarantees the global differential privacy (the presence or absence of any data curator barely impacts the aggregation outputs) by ensuring local differential privacy at the end of each data curator. To guarantee an overall privacy performance of a distributed data aggregation system against malicious collusion attacks, part of the existing work on such DDP scheme aim to provide an estimated lower bound of privacy budget for the global differential privacy. However, there are two main problems: low data utility from using a large global function sensitivity; unknown privacy guarantee when the aggregation sensitivity of the whole system is less than the sum of the data curator's aggregation sensitivity. To address these problems while ensuring distributed differential privacy, we provide a new lower bound of privacy budget, which works with an unconditional aggregation sensitivity of the whole distributed system. Moreover, we study the performance of our privacy bound in different scenarios of data updates. Both theoretical and experimental evaluations show that our privacy bound offers better global privacy performance than the existing work.
Ubiquitous deployment of low-cost mobile positioning devices and the widespread use of high-speed wireless networks enable massive collection of large-scale trajectory data of individuals moving on road networks. Trajectory data mining finds numerous applications including understanding users' historical travel preferences and recommending places of interest to new visitors. Privacy-preserving trajectory mining is an important and challenging problem as exposure of sensitive location information in the trajectories can directly invade the location privacy of the users associated with the trajectories. In this paper, we propose a differentially private trajectory analysis algorithm for points-of-interest recommendation to users that aims at maximizing the accuracy of the recommendation results while protecting the privacy of the exposed trajectories with differential privacy guarantees. Our algorithm first transforms the raw trajectory dataset into a bipartite graph with nodes representing the users and the points-of-interest and the edges representing the visits made by the users to the locations, and then extracts the association matrix representing the bipartite graph to inject carefully calibrated noise to meet έ-differential privacy guarantees. A post-processing of the perturbed association matrix is performed to suppress noise prior to performing a Hyperlink-Induced Topic Search (HITS) on the transformed data that generates an ordered list of recommended points-of-interest. Extensive experiments on a real trajectory dataset show that our algorithm is efficient, scalable and demonstrates high recommendation accuracy while meeting the required differential privacy guarantees.
While significant progress has been made separately on analytics systems for scalable stochastic gradient descent (SGD) and private SGD, none of the major scalable analytics frameworks have incorporated differentially private SGD. There are two inter-related issues for this disconnect between research and practice: (1) low model accuracy due to added noise to guarantee privacy, and (2) high development and runtime overhead of the private algorithms. This paper takes a first step to remedy this disconnect and proposes a private SGD algorithm to address both issues in an integrated manner. In contrast to the white-box approach adopted by previous work, we revisit and use the classical technique of output perturbation to devise a novel “bolt-on” approach to private SGD. While our approach trivially addresses (2), it makes (1) even more challenging. We address this challenge by providing a novel analysis of the L2-sensitivity of SGD, which allows, under the same privacy guarantees, better convergence of SGD when only a constant number of passes can be made over the data. We integrate our algorithm, as well as other state-of-the-art differentially private SGD, into Bismarck, a popular scalable SGD-based analytics system on top of an RDBMS. Extensive experiments show that our algorithm can be easily integrated, incurs virtually no overhead, scales well, and most importantly, yields substantially better (up to 4X) test accuracy than the state-of-the-art algorithms on many real datasets.
Differential privacy is a rigorous privacy standard that has been applied to a range of data analysis tasks. To broaden the application scenarios of differential privacy when data records have dependencies, the notion of Bayesian differential privacy has been recently proposed. However, it is unknown whether Bayesian differential privacy preserves three nice properties of differential privacy: sequential composability, parallel composability, and post-processing. In this paper, we provide an affirmative answer to this question; i.e., Bayesian differential privacy still have these properties. The idea behind sequential composability is that if we have m algorithms Y1, Y2,łdots, Ym, where Y$\mathscrl$ is independently $ε\mathscrl$-Bayesian differential private for $\mathscrl$ = 1,2,łdots, m, then by feeding the result of Y1 into Y2, the result of Y2 into Y3, and so on, we will finally have an $Σ$m$\mathscrl$=;1 $ε\mathscrl$-Bayesian differential private algorithm. For parallel composability, we consider the situation where a database is partitioned into m disjoint subsets. The $\mathscrl$-th subset is input to a Bayesian differential private algorithm Y$\mathscrl$, for $\mathscrl$= 1, 2,łdots, m. Then the parallel composition of Y1, Y2,łdots, Ym will be maxm$\mathscrl$=;1=1 $ε\mathscrl$-Bayesian differential private. The postprocessing property means that a data analyst, without additional knowledge abo- t the private database, cannot compute a function of the output of a Bayesian differential private algorithm and reduce its privacy guarantee.
Private record linkage (PRL) is the problem of identifying pairs of records that are similar as per an input matching rule from databases held by two parties that do not trust one another. We identify three key desiderata that a PRL solution must ensure: (1) perfect precision and high recall of matching pairs, (2) a proof of end-to-end privacy, and (3) communication and computational costs that scale subquadratically in the number of input records. We show that all of the existing solutions for PRL? including secure 2-party computation (S2PC), and their variants that use non-private or differentially private (DP) blocking to ensure subquadratic cost – violate at least one of the three desiderata. In particular, S2PC techniques guarantee end-to-end privacy but have either low recall or quadratic cost. In contrast, no end-to-end privacy guarantee has been formalized for solutions that achieve subquadratic cost. This is true even for solutions that compose DP and S2PC: DP does not permit the release of any exact information about the databases, while S2PC algorithms for PRL allow the release of matching records. In light of this deficiency, we propose a novel privacy model, called output constrained differential privacy, that shares the strong privacy protection of DP, but allows for the truthful release of the output of a certain function applied to the data. We apply this to PRL, and show that protocols satisfying this privacy model permit the disclosure of the true matching records, but their execution is insensitive to the presence or absence of a single non-matching record. We find that prior work that combine DP and S2PC techniques even fail to satisfy this end-to-end privacy model. Hence, we develop novel protocols that provably achieve this end-to-end privacy guarantee, together with the other two desiderata of PRL. Our empirical evaluation also shows that our protocols obtain high recall, scale near linearly in the size of the input databases and the output set of matching pairs, and have communication and computational costs that are at least 2 orders of magnitude smaller than S2PC baselines.