Identifying Data Exposure Across Distributed High-Dimensional Health Data Silos through Bayesian Networks Optimised by Multigrid and Manifold

Submitted by grigby1 on Fri, 07/10/2020 - 11:40am

Title	Identifying Data Exposure Across Distributed High-Dimensional Health Data Silos through Bayesian Networks Optimised by Multigrid and Manifold
Publication Type	Conference Paper
Year of Publication	2019
Authors	Podlesny, Nikolai J., Kayem, Anne V.D.M., Meinel, Christoph
Conference Name	2019 IEEE Intl Conf on Dependable, Autonomic and Secure Computing, Intl Conf on Pervasive Intelligence and Computing, Intl Conf on Cloud and Big Data Computing, Intl Conf on Cyber Science and Technology Congress (DASC/PiCom/CBDCom/CyberSciTech)
Date Published	Aug. 2019
Publisher	IEEE
ISBN Number	978-1-7281-3024-8
Keywords	anonymisation algorithms, attribute combination, attribute pairs, attribute separation, Bayes methods, Bayesian networks, Bayesian networks optimised, belief networks, case agnostic method, Data Anonymisation, data anonymity, data deletion, Data models, data privacy, data transformation, Distributed databases, distributed high-dimensional data repositories, distributed high-dimensional health data silos, high conditional probability, identifying data exposure, information loss, learning (artificial intelligence), manifold, Manifolds, medical information systems, medical research, multigrid, multigrid solver method, privacy, privacy exposure risk, privacy legislation, private data exposure, Probabilistic logic, probability, pubcrawl, quasiidentifiers, Scalability, security of data, strong guarantees, thwarts de-anonymisation, treatment data
Abstract	We present a novel, and use case agnostic method of identifying and circumventing private data exposure across distributed and high-dimensional data repositories. Examples of distributed high-dimensional data repositories include medical research and treatment data, where oftentimes more than 300 describing attributes appear. As such, providing strong guarantees of data anonymity in these repositories is a hard constraint in adhering to privacy legislation. Yet, when applied to distributed high-dimensional data, existing anonymisation algorithms incur high levels of information loss and do not guarantee privacy defeating the purpose of anonymisation. In this paper, we address this issue by using Bayesian networks to handle data transformation for anonymisation. By evaluating every attribute combination to determine the privacy exposure risk, the conditional probability linking attribute pairs is computed. Pairs with a high conditional probability expose the risk of deanonymisation similar to quasi-identifiers and can be separated instead of deleted, as in previous algorithms. Attribute separation removes the risk of privacy exposure, and deletion avoidance results in a significant reduction in information loss. In other words, assimilating the conditional probability of outliers directly in the adjacency matrix in a greedy fashion is quick and thwarts de-anonymisation. Since identifying every privacy violating attribute combination is a W[2]-complete problem, we optimise the procedure with a multigrid solver method by evaluating the conditional probabilities between attribute pairs, and aggregating state space explosion of attribute pairs through manifold learning. Finally, incremental processing of new data is achieved through inexpensive, continuous (delta) learning.
URL	https://ieeexplore.ieee.org/document/8890399
DOI	10.1109/DASC/PiCom/CBDCom/CyberSciTech.2019.00110
Citation Key	podlesny_identifying_2019

Groups:

Science of Security VO