Biblio
In order to protect individuals' privacy, data have to be "well-sanitized" before sharing them, i.e. one has to remove any personal information before sharing data. However, it is not always clear when data shall be deemed well-sanitized. In this paper, we argue that the evaluation of sanitized data should be based on whether the data allows the inference of sensitive information that is specific to an individual, instead of being centered around the concept of re-identification. We propose a framework to evaluate the effectiveness of different sanitization techniques on a given dataset by measuring how much an individual's record from the sanitized dataset influences the inference of his/her own sensitive attribute. Our intent is not to accurately predict any sensitive attribute but rather to measure the impact of a single record on the inference of sensitive information. We demonstrate our approach by sanitizing two real datasets in different privacy models and evaluate/compare each sanitized dataset in our framework.
With the growth of Internet in many different aspects of life, users are required to share private information more than ever. Hence, users need a privacy management tool that can enforce complex and customized privacy policies. In this paper, we propose a privacy management system that not only allows users to define complex privacy policies for data sharing actions, but also monitors users' behavior and relationships to generate realistic policies. In addition, the proposed system utilizes formal modeling and model-checking approach to prove that information disclosures are valid and privacy policies are consistent with one another.
We present CARiSMA, a tool that is originally designed to support model-based security analysis of IT systems. In our recent work, we added several new functionalities to CARiSMA to support the privacy of personal data. Moreover, we introduced a mechanism to assist the system designers to perform a CARiSMA analysis by automatically initializing an appropriate CARiSMA analysis concerning security and privacy requirements. The motivation for our work is Article 25 of Regulation (EU) 2016/679, which requires appropriate technical and organizational controls must be implemented for ensuring that, by default, the processing of personal data complies with the principles on processing of personal data. This implies that initially IT systems must be analyzed to verify if such principles are respected. System models allow the system developers to handle the complexity of systems and to focus on key aspects such as privacy and security. CARiSMA is available at http://carisma.umlsec.de and our screen cast at https://youtu.be/b5zeHig3ARw.
Many enterprises are transitioning towards data-driven business processes. There are numerous situations where multiple parties would like to share data towards a common goal if it were possible to simultaneously protect the privacy and security of the individuals and organizations described in the data. Existing solutions for multi-party analytics that follow the so called Data Lake paradigm have parties transfer their raw data to a trusted third-party (i.e., mediator), which then performs the desired analysis on the global data, and shares the results with the parties. However, such a solution does not fit many applications such as Healthcare, Finance, and the Internet-of-Things, where privacy is a strong concern. Motivated by the increasing demands for data privacy, we study the problem of privacy-preserving multi-party data analytics, where the goal is to enable analytics on multi-party data without compromising the data privacy of each individual party. In this paper, we first propose a secure sum protocol with strong security guarantees. The proposed secure sum protocol is resistant to collusion attacks even with N-2 parties colluding, where N denotes the total number of collaborating parties. We then use this protocol to propose two secure gradient descent algorithms, one for horizontally partitioned data, and the other for vertically partitioned data. The proposed framework is generic and applies to a wide class of machine learning problems. We demonstrate our solution for two popular use-cases, regression and classification, and evaluate the performance of the proposed solution in terms of the obtained model accuracy, latency and communication cost. In addition, we perform a scalability analysis to evaluate the performance of the proposed solution as the data size and the number of parties increase.
Deep Learning has recently become hugely popular in machine learning for its ability to solve end-to-end learning systems, in which the features and the classifiers are learned simultaneously, providing significant improvements in classification accuracy in the presence of highly-structured and large databases. Its success is due to a combination of recent algorithmic breakthroughs, increasingly powerful computers, and access to significant amounts of data. Researchers have also considered privacy implications of deep learning. Models are typically trained in a centralized manner with all the data being processed by the same training algorithm. If the data is a collection of users' private data, including habits, personal pictures, geographical positions, interests, and more, the centralized server will have access to sensitive information that could potentially be mishandled. To tackle this problem, collaborative deep learning models have recently been proposed where parties locally train their deep learning structures and only share a subset of the parameters in the attempt to keep their respective training sets private. Parameters can also be obfuscated via differential privacy (DP) to make information extraction even more challenging, as proposed by Shokri and Shmatikov at CCS'15. Unfortunately, we show that any privacy-preserving collaborative deep learning is susceptible to a powerful attack that we devise in this paper. In particular, we show that a distributed, federated, or decentralized deep learning approach is fundamentally broken and does not protect the training sets of honest participants. The attack we developed exploits the real-time nature of the learning process that allows the adversary to train a Generative Adversarial Network (GAN) that generates prototypical samples of the targeted training set that was meant to be private (the samples generated by the GAN are intended to come from the same distribution as the training data). Interestingly, we show that record-level differential privacy applied to the shared parameters of the model, as suggested in previous work, is ineffective (i.e., record-level DP is not designed to address our attack).
This paper presents an approach to formalizing and enforcing a class of use privacy properties in data-driven systems. In contrast to prior work, we focus on use restrictions on proxies (i.e. strong predictors) of protected information types. Our definition relates proxy use to intermediate computations that occur in a program, and identify two essential properties that characterize this behavior: 1) its result is strongly associated with the protected information type in question, and 2) it is likely to causally affect the final output of the program. For a specific instantiation of this definition, we present a program analysis technique that detects instances of proxy use in a model, and provides a witness that identifies which parts of the corresponding program exhibit the behavior. Recognizing that not all instances of proxy use of a protected information type are inappropriate, we make use of a normative judgment oracle that makes this inappropriateness determination for a given witness. Our repair algorithm uses the witness of an inappropriate proxy use to transform the model into one that provably does not exhibit proxy use, while avoiding changes that unduly affect classification accuracy. Using a corpus of social datasets, our evaluation shows that these algorithms are able to detect proxy use instances that would be difficult to find using existing techniques, and subsequently remove them while maintaining acceptable classification performance.
User privacy is an important issue in a blockchain based transaction system. Bitcoin, being one of the most widely used blockchain based transaction system, fails to provide enough protection on users' privacy. Many subsequent studies focus on establishing a system that hides the linkage between the identities (pseudonyms) of users and the transactions they carry out in order to provide a high level of anonymity. Examples include Zerocoin, Zerocash and so on. It thus becomes an interesting question whether such new transaction systems do provide enough protection on users' privacy. In this paper, we propose a novel and effective approach for de-anonymizing these transaction systems by leveraging information in the system that is not directly related, including the number of transactions made by each identity and time stamp of sending and receiving. Combining probability studies with optimization tools, we establish a model which allows us to determine, among all possible ways of linking between transactions and identities, the one that is most likely to be true. Subsequent transaction graph analysis could then be carried out, leading to the de-anonymization of the system. To solve the model, we provide exact algorithms based on mixed integer linear programming. Our research also establishes interesting relationships between the de-anonymization problem and other problems studied in the literature of theoretical computer science, e.g., the graph matching problem and scheduling problem.
Personalization, recommendations, and user modeling can be pow- erful tools to improve people?s experiences with technology and to help them nd information. However, we also know that people underestimate how much of their personal information is used by our technology and they generally do not understand how much algorithms can discover about them. Both privacy and ethical tech- nology have issues of consent at their heart. This talk will look at how to consider issues of privacy and consent when users cannot explicitly state their preferences, The Creepy Factor, and how to balance users? concerns with the bene ts personalized technology can o er.
The security of Wireless Sensor Networks (WSNs) is vital in several applications such as the tracking and monitoring of endangered species such as pandas in a national park or soldiers in a battlefield. This kind of applications requires the anonymity of the source, known as Source Location Privacy (SLP). The main aim is to prevent an adversary from tracing back a real event to the originator by analyzing the network traffic. Previous techniques have achieved high anonymity such as Dummy Uniform Distribution (DUD), Dummy Adaptive Distribution (DAD) and Controlled Dummy Adaptive Distribution (CAD). However, these techniques increase the overall overhead of the network. To overcome this shortcoming, a new technique is presented: Exponential Dummy Adaptive Distribution (EDAD). In this technique, an exponential distribution is used instead of the uniform distribution to reduce the overhead without sacrificing the anonymity of the source. The exponential distribution improves the lifetime of the network since it decreases the number of transmitted packets within the network. It is straightforward and easy to implement because it has only one parameter $łambda$ that controls the transmitting rate of the network nodes. The conducted adversary model is global, which has a full view of the network and is able to perform sophisticated attacks such as rate monitoring and time correlation. The simulation results show that the proposed technique provides less overhead and high anonymity with reasonable delay and delivery ratio. Three different analysis models are developed to confirm the validation of our technique. These models are visualization model, a neural network model, and a steganography model.
Generating synthetic data is a well-known option to limit disclosure risk in sensitive data releases. The usual approach is to build a model for the population and then generate a synthetic data set solely based on the model. We argue that building an accurate population model is difficult and we propose instead to approximate the original data as closely as privacy constraints permit. To enforce an ex ante privacy level when generating synthetic data, we introduce a new privacy model called $ε$ synthetic privacy. Then, we describe a synthetic data generation method that satisfies $ε$-synthetic privacy. Finally, we evaluate the utility of the synthetic data generated with our method.
The main issue with big data in cloud is the processed or used always need to be by third party. It is very important for the owners of data or clients to trust and to have the guarantee of privacy for the information stored in cloud or analyzed as big data. The privacy models studied in previous research showed that privacy infringement for big data happened because of limitation, privacy guarantee rate or dissemination of accurate data which is obtainable in the data set. In addition, there are various privacy models. In order to determine the best and the most appropriate model to be applied in the future, which also guarantees big data privacy, it is necessary to invest in research and study. In the next part, we surfed some of the privacy models in order to determine the advantages and disadvantages of each model in privacy assurance for big data in cloud. The present study also proposes combined Diff-Anonym algorithm (K-anonymity and differential models) to provide data anonymity with guarantee to keep balance between ambiguity of private data and clarity of general data.
Privacy preserving on data publication has been an important research field over the past few decades. One of the fundamental challenges in privacy preserving data publication is the trade-off problem between privacy and utility of the single and independent data set. However, recent research works have shown that the advanced privacy mechanism, i.e., differential privacy, is vulnerable when multiple data sets are correlated. In this case, the trade-off problem between privacy and utility is evolved into a game problem, in which the payoff of each player is dependent not only on his privacy parameter, but also on his neighbors' privacy parameters. In this paper, we firstly present the definition of correlated differential privacy to evaluate the real privacy level of a single data set influenced by the other data sets. Then, we construct a game model of multiple players, who each publishes the data set sanitized by differential privacy. Next, we analyze the existence and uniqueness of the pure Nash Equilibrium and demonstrate the sufficient conditions in the game. Finally, we refer to a notion, i.e., the price of anarchy, to evaluate efficiency of the pure Nash Equilibrium.
Smart Spaces are composed of heterogeneous sensors and devices that collect and share information. This information may contain personal information of the users. Thus, securing the data and preserving the privacy are of paramount importance. In this paper, we propose techniques for information security and privacy protection for Smart Spaces based on the Smart-M3 platform. We propose a) a security framework, and b) a context-aware role-based access control scheme. We model our access control scheme using ontological techniques and Web Ontology Language (OWL), and implement it via CLIPS rules. To evaluate the efficiency of our access control scheme, we measure the time it takes to check the access rights of the access requests. The results demonstrate that the highest response time is approximately 0.2 seconds in a set of 100000 triples. We conclude that the proposed access control scheme produces low overhead and is therefore, an efficient approach for Smart Spaces.
We introduce a Markov-model-based framework for Moving Target Defense (MTD) analysis. The framework allows modeling of a broad range of MTD strategies, provides general theorems about how the probability of a successful adversary defeating an MTD strategy is related to the amount of time/cost spent by the adversary, and shows how a multilevel composition of MTD strategies can be analyzed by a straightforward combination of the analysis for each one of these strategies. Within the proposed framework we define the concept of security capacity which measures the strength or effectiveness of an MTD strategy: the security capacity depends on MTD specific parameters and more general system parameters. We apply our framework to two concrete MTD strategies.
We study the value of data privacy in a game-theoretic model of trading private data, where a data collector purchases private data from strategic data subjects (individuals) through an incentive mechanism. The private data of each individual represents her knowledge about an underlying state, which is the information that the data collector desires to learn. Different from most of the existing work on privacy-aware surveys, our model does not assume the data collector to be trustworthy. Then, an individual takes full control of its own data privacy and reports only a privacy-preserving version of her data. In this paper, the value of ε units of privacy is measured by the minimum payment of all nonnegative payment mechanisms, under which an individual's best response at a Nash equilibrium is to report the data with a privacy level of ε. The higher ε is, the less private the reported data is. We derive lower and upper bounds on the value of privacy which are asymptotically tight as the number of data subjects becomes large. Specifically, the lower bound assures that it is impossible to use less amount of payment to buy ε units of privacy, and the upper bound is given by an achievable payment mechanism that we designed. Based on these fundamental limits, we further derive lower and upper bounds on the minimum total payment for the data collector to achieve a given learning accuracy target, and show that the total payment of the designed mechanism is at most one individual's payment away from the minimum.
Maintaining and updating signature databases is a tedious task that normally requires a large amount of user effort. The problem becomes harder when features can be distorted by observation noise, which we call volatility. To address this issue, we propose algorithms and models to automatically generate signatures in the presence of noise, with a focus on stack fingerprinting, which is a research area that aims to discover the operating system (OS) of remote hosts using TCP/IP packets. Armed with this framework, we construct a database with 420 network stacks, label the signatures, develop a robust classifier for this database, and fingerprint 66M visible webservers on the Internet.
In this paper, we introduce Entropy/IP: a system that discovers Internet address structure based on analyses of a subset of IPv6 addresses known to be active, i.e., training data, gleaned by readily available passive and active means. The system is completely automated and employs a combination of information-theoretic and machine learning techniques to probabilistically model IPv6 addresses. We present results showing that our system is effective in exposing structural characteristics of portions of the active IPv6 Internet address space, populated by clients, services, and routers. In addition to visualizing the address structure for exploration, the system uses its models to generate candidate addresses for scanning. For each of 15 evaluated datasets, we train on 1K addresses and generate 1M candidates for scanning. We achieve some success in 14 datasets, finding up to 40% of the generated addresses to be active. In 11 of these datasets, we find active network identifiers (e.g., /64 prefixes or "subnets") not seen in training. Thus, we provide the first evidence that it is practical to discover subnets and hosts by scanning probabilistically selected areas of the IPv6 address space not known to contain active hosts a priori.
We introduce a model for differentially private analysis of weighted graphs in which the graph topology (υ,ε) is assumed to be public and the private information consists only of the edge weights ω : ε → R+. This can express hiding congestion patterns in a known system of roads. Differential privacy requires that the output of an algorithm provides little advantage, measured by privacy parameters ε and δ, for distinguishing between neighboring inputs, which are thought of as inputs that differ on the contribution of one individual. In our model, two weight functions w,w' are considered to be neighboring if they have l1 distance at most one. We study the problems of privately releasing a short path between a pair of vertices and of privately releasing approximate distances between all pairs of vertices. We are concerned with the approximation error, the difference between the length of the released path or released distance and the length of the shortest path or actual distance. For the problem of privately releasing a short path between a pair of vertices, we prove a lower bound of Ω(textbarυtextbar) on the additive approximation error for fixed privacy parameters ε,δ. We provide a differentially private algorithm that matches this error bound up to a logarithmic factor and releases paths between all pairs of vertices, not just a single pair. The approximation error achieved by our algorithm can be bounded by the number of edges on the shortest path, so we achieve better accuracy than the worst-case bound for pairs of vertices that are connected by a low-weight path consisting of o(textbarυtextbar) vertices. For the problem of privately releasing all-pairs distances, we show that for trees we can release all-pairs distances with approximation error \$O(log2.5textbarυtextbar) for fixed privacy parameters. For arbitrary bounded-weight graphs with edge weights in [0,M] we can brelease all distances with approximation error Õ(√textgreater(textbarυtextbarM).
Randomness is a vital resource for modern-day information processing, especially for cryptography. A wide range of applications critically rely on abundant, high-quality random numbers generated securely. Here, we show how to expand a random seed at an exponential rate without trusting the underlying quantum devices. Our approach is secure against the most general adversaries, and has the following new features: cryptographic level of security, tolerating a constant level of imprecision in devices, requiring only unit size quantum memory (for each device component) in an honest implementation, and allowing a large natural class of constructions for the protocol. In conjunction with a recent work by Chung et al. [2014], it also leads to robust unbounded expansion using just 2 multipart devices. When adapted for distributing cryptographic keys, our method achieves, for the first time, exponential expansion combined with cryptographic security and noise tolerance. The proof proceeds by showing that the Rényi divergence of the outputs of the protocol (for a specific bounding operator) decreases linearly as the protocol iterates. At the heart of the proof are a new uncertainty principle on quantum measurements and a method for simulating trusted measurements with untrusted devices.
Tracking and maintaining satisfactory QoE for video streaming services is becoming a greater challenge for mobile network operators than ever before. Downloading and watching video content on mobile devices is currently a growing trend among users, that is causing a demand for higher bandwidth and better provisioning throughout the network infrastructure. At the same time, popular demand for privacy has led many online streaming services to adopt end-to-end encryption, leaving providers with only a handful of indicators for identifying QoE issues. In order to address these challenges, we propose a novel methodology for detecting video streaming QoE issues from encrypted traffic. We develop predictive models for detecting different levels of QoE degradation that is caused by three key influence factors, i.e. stalling, the average video quality and the quality variations. The models are then evaluated on the production network of a large scale mobile operator, where we show that despite encryption our methodology is able to accurately detect QoE problems with 72\textbackslash%-92\textbackslash% accuracy, while even higher performance is achieved when dealing with cleartext traffic
Given a set D of tuples defined on a domain Omega, we study differentially private algorithms for constructing a histogram over Omega to approximate the tuple distribution in D. Existing solutions for the problem mostly adopt a hierarchical decomposition approach, which recursively splits Omega into sub-domains and computes a noisy tuple count for each sub-domain, until all noisy counts are below a certain threshold. This approach, however, requires that we (i) impose a limit h on the recursion depth in the splitting of Omega and (ii) set the noise in each count to be proportional to h. The choice of h is a serious dilemma: a small h makes the resulting histogram too coarse-grained, while a large h leads to excessive noise in the tuple counts used in deciding whether sub-domains should be split. Furthermore, h cannot be directly tuned based on D; otherwise, the choice of h itself reveals private information and violates differential privacy. To remedy the deficiency of existing solutions, we present PrivTree, a histogram construction algorithm that adopts hierarchical decomposition but completely eliminates the dependency on a pre-defined h. The core of PrivTree is a novel mechanism that (i) exploits a new analysis on the Laplace distribution and (ii) enables us to use only a constant amount of noise in deciding whether a sub-domain should be split, without worrying about the recursion depth of splitting. We demonstrate the application of PrivTree in modelling spatial data, and show that it can be extended to handle sequence data (where the decision in sub-domain splitting is not based on tuple counts but a more sophisticated measure). Our experiments on a variety of real datasets show that PrivTree considerably outperforms the states of the art in terms of data utility.
Anonymous messaging applications have recently gained popularity as a means for sharing opinions without fear of judgment or repercussion. These messages propagate anonymously over a network, typically de ned by social connections or physical proximity. However, recent advances in rumor source detection show that the source of such an anonymous message can be inferred by certain statistical inference attacks. Adaptive di usion was recently proposed as a solution that achieves optimal source obfuscation over regular trees. However, in real social networks, the degrees difer from node to node, and adaptive di usion can be signicantly sub-optimal. This gap increases as the degrees become more irregular.
In order to quantify this gap, we model the underlying network as coming from standard branching processes with i.i.d. degree distributions. Building upon the analysis techniques from branching processes, we give an analytical characterization of the dependence of the probability of detection achieved by adaptive di usion on the degree distribution. Further, this analysis provides a key insight: passing a rumor to a friend who has many friends makes the source more ambiguous. This leads to a new family of protocols that we call Preferential Attachment Adaptive Di usion (PAAD). When messages are propagated according to PAAD, we give both the MAP estimator for nding the source and also an analysis of the probability of detection achieved by this adversary. The analytical results are not directly comparable, since the adversary's observed information has a di erent distribution under adaptive di usion than under PAAD. Instead, we present results from numerical experiments that suggest that PAAD achieves a lower probability of detection, at the cost of increased communication for coordination.