Biblio
We present a novel, and use case agnostic method of identifying and circumventing private data exposure across distributed and high-dimensional data repositories. Examples of distributed high-dimensional data repositories include medical research and treatment data, where oftentimes more than 300 describing attributes appear. As such, providing strong guarantees of data anonymity in these repositories is a hard constraint in adhering to privacy legislation. Yet, when applied to distributed high-dimensional data, existing anonymisation algorithms incur high levels of information loss and do not guarantee privacy defeating the purpose of anonymisation. In this paper, we address this issue by using Bayesian networks to handle data transformation for anonymisation. By evaluating every attribute combination to determine the privacy exposure risk, the conditional probability linking attribute pairs is computed. Pairs with a high conditional probability expose the risk of deanonymisation similar to quasi-identifiers and can be separated instead of deleted, as in previous algorithms. Attribute separation removes the risk of privacy exposure, and deletion avoidance results in a significant reduction in information loss. In other words, assimilating the conditional probability of outliers directly in the adjacency matrix in a greedy fashion is quick and thwarts de-anonymisation. Since identifying every privacy violating attribute combination is a W[2]-complete problem, we optimise the procedure with a multigrid solver method by evaluating the conditional probabilities between attribute pairs, and aggregating state space explosion of attribute pairs through manifold learning. Finally, incremental processing of new data is achieved through inexpensive, continuous (delta) learning.
Inclusion dependencies form one of the most fundamental classes of integrity constraints. Their importance in classical data management is reinforced by modern applications such as data profiling, data cleaning, entity resolution and schema matching. Their discovery in an unknown dataset is at the core of any data analysis effort. Therefore, several research approaches have focused on their efficient discovery in a given, static dataset. However, none of these approaches are appropriate for applications on dynamic datasets, such as transactional datasets, scientific applications, and social network. In these cases, discovery techniques should be able to efficiently update the inclusion dependencies after an update in the dataset, without reprocessing the entire dataset. We present the first approach for incrementally updating the unary inclusion dependencies. In particular, our approach is based on the concept of attribute clustering from which the unary inclusion dependencies are efficiently derivable. We incrementally update the clusters after each update of the dataset. Updating the clusters does not need to access the dataset because of special data structures designed to efficiently support the updating process. We perform an exhaustive analysis of our approach by applying it to large datasets with several hundred attributes and more than 116,200,000 million tuples. The results show that the incremental discovery significantly reduces the runtime needed by the static discovery. This reduction in the runtime is up to 99.9996 % for both the insert and the delete.
The relevance of identity data leaks on the Internet is more present than ever. Almost every month we read about leakage of databases with more than a million users in the news. Smaller but not less dangerous leaks happen even multiple times a day. The public availability of such leaked data is a major threat to the victims, but also creates the opportunity to learn not only about security of service providers but also the behavior of users when choosing passwords. Our goal is to analyze this data and generate knowledge that can be used to increase security awareness and security, respectively. This paper presents a novel approach to automatic analysis of a vast majority of bigger and smaller leaks. Our contribution is the concept and a prototype implementation of a parser, composed of a syntactic and a semantic module, and a data analyzer for identity leaks. In this context, we deal with the two major challenges of a huge amount of different formats and the recognition of leaks' unknown data types. Based on the data collected, this paper reveals how easy it is for criminals to collect lots of passwords, which are plain text or only weakly hashed.
The relevance of identity data leaks on the Internet is more present than ever. Almost every month we read about leakage of databases with more than a million users in the news. Smaller but not less dangerous leaks happen even multiple times a day. The public availability of such leaked data is a major threat to the victims, but also creates the opportunity to learn not only about security of service providers but also the behavior of users when choosing passwords. Our goal is to analyze this data and generate knowledge that can be used to increase security awareness and security, respectively. This paper presents a novel approach to automatic analysis of a vast majority of bigger and smaller leaks. Our contribution is the concept and a prototype implementation of a parser, composed of a syntactic and a semantic module, and a data analyzer for identity leaks. In this context, we deal with the two major challenges of a huge amount of different formats and the recognition of leaks' unknown data types. Based on the data collected, this paper reveals how easy it is for criminals to collect lots of passwords, which are plain text or only weakly hashed.