Visible to the public Identifying Data Exposure Across Distributed High-Dimensional Health Data Silos through Bayesian Networks Optimised by Multigrid and Manifold

TitleIdentifying Data Exposure Across Distributed High-Dimensional Health Data Silos through Bayesian Networks Optimised by Multigrid and Manifold
Publication TypeConference Paper
Year of Publication2019
AuthorsPodlesny, Nikolai J., Kayem, Anne V.D.M., Meinel, Christoph
Conference Name2019 IEEE Intl Conf on Dependable, Autonomic and Secure Computing, Intl Conf on Pervasive Intelligence and Computing, Intl Conf on Cloud and Big Data Computing, Intl Conf on Cyber Science and Technology Congress (DASC/PiCom/CBDCom/CyberSciTech)
Date PublishedAug. 2019
PublisherIEEE
ISBN Number978-1-7281-3024-8
Keywordsanonymisation algorithms, attribute combination, attribute pairs, attribute separation, Bayes methods, Bayesian networks, Bayesian networks optimised, belief networks, case agnostic method, Data Anonymisation, data anonymity, data deletion, Data models, data privacy, data transformation, Distributed databases, distributed high-dimensional data repositories, distributed high-dimensional health data silos, high conditional probability, identifying data exposure, information loss, learning (artificial intelligence), manifold, Manifolds, medical information systems, medical research, multigrid, multigrid solver method, privacy, privacy exposure risk, privacy legislation, private data exposure, Probabilistic logic, probability, pubcrawl, quasiidentifiers, Scalability, security of data, strong guarantees, thwarts de-anonymisation, treatment data
Abstract

We present a novel, and use case agnostic method of identifying and circumventing private data exposure across distributed and high-dimensional data repositories. Examples of distributed high-dimensional data repositories include medical research and treatment data, where oftentimes more than 300 describing attributes appear. As such, providing strong guarantees of data anonymity in these repositories is a hard constraint in adhering to privacy legislation. Yet, when applied to distributed high-dimensional data, existing anonymisation algorithms incur high levels of information loss and do not guarantee privacy defeating the purpose of anonymisation. In this paper, we address this issue by using Bayesian networks to handle data transformation for anonymisation. By evaluating every attribute combination to determine the privacy exposure risk, the conditional probability linking attribute pairs is computed. Pairs with a high conditional probability expose the risk of deanonymisation similar to quasi-identifiers and can be separated instead of deleted, as in previous algorithms. Attribute separation removes the risk of privacy exposure, and deletion avoidance results in a significant reduction in information loss. In other words, assimilating the conditional probability of outliers directly in the adjacency matrix in a greedy fashion is quick and thwarts de-anonymisation. Since identifying every privacy violating attribute combination is a W[2]-complete problem, we optimise the procedure with a multigrid solver method by evaluating the conditional probabilities between attribute pairs, and aggregating state space explosion of attribute pairs through manifold learning. Finally, incremental processing of new data is achieved through inexpensive, continuous (delta) learning.

URLhttps://ieeexplore.ieee.org/document/8890399
DOI10.1109/DASC/PiCom/CBDCom/CyberSciTech.2019.00110
Citation Keypodlesny_identifying_2019