Visible to the public Biblio

Filters: Keyword is data lineage  [Clear All Filters]
2020-07-10
Cai, Zhipeng, Miao, Dongjing, Li, Yingshu.  2019.  Deletion Propagation for Multiple Key Preserving Conjunctive Queries: Approximations and Complexity. 2019 IEEE 35th International Conference on Data Engineering (ICDE). :506—517.

This paper studies the deletion propagation problem in terms of minimizing view side-effect. It is a problem funda-mental to data lineage and quality management which could be a key step in analyzing view propagation and repairing data. The investigated problem is a variant of the standard deletion propagation problem, where given a source database D, a set of key preserving conjunctive queries Q, and the set of views V obtained by the queries in Q, we try to identify a set T of tuples from D whose elimination prevents all the tuples in a given set of deletions on views △V while preserving any other results. The complexity of this problem has been well studied for the case with only a single query. Dichotomies, even trichotomies, for different settings are developed. However, no results on multiple queries are given which is a more realistic case. We study the complexity and approximations of optimizing the side-effect on the views, i.e., find T to minimize the additional damage on V after removing all the tuples of △V. We focus on the class of key-preserving conjunctive queries which is a dichotomy for the single query case. It is surprising to find that except the single query case, this problem is NP-hard to approximate within any constant even for a non-trivial set of multiple project-free conjunctive queries in terms of view side-effect. The proposed algorithm shows that it can be approximated within a bound depending on the number of tuples of both V and △V. We identify a class of polynomial tractable inputs, and provide a dynamic programming algorithm to solve the problem. Besides data lineage, study on this problem could also provide important foundations for the computational issues in data repairing. Furthermore, we introduce some related applications of this problem, especially for query feedback based data cleaning.

2019-09-23
Yazici, I. M., Karabulut, E., Aktas, M. S..  2018.  A Data Provenance Visualization Approach. 2018 14th International Conference on Semantics, Knowledge and Grids (SKG). :84–91.
Data Provenance has created an emerging requirement for technologies that enable end users to access, evaluate, and act on the provenance of data in recent years. In the era of Big Data, the amount of data created by corporations around the world has grown each year. As an example, both in the Social Media and e-Science domains, data is growing at an unprecedented rate. As the data has grown rapidly, information on the origin and lifecycle of the data has also grown. In turn, this requires technologies that enable the clarification and interpretation of data through the use of data provenance. This study proposes methodologies towards the visualization of W3C-PROV-O Specification compatible provenance data. The visualizations are done by summarization and comparison of the data provenance. We facilitated the testing of these methodologies by providing a prototype, extending an existing open source visualization tool. We discuss the usability of the proposed methodologies with an experimental study; our initial results show that the proposed approach is usable, and its processing overhead is negligible.
2017-02-23
Y. Cao, J. Yang.  2015.  "Towards Making Systems Forget with Machine Unlearning". 2015 IEEE Symposium on Security and Privacy. :463-480.

Today's systems produce a rapidly exploding amount of data, and the data further derives more data, forming a complex data propagation network that we call the data's lineage. There are many reasons that users want systems to forget certain data including its lineage. From a privacy perspective, users who become concerned with new privacy risks of a system often want the system to forget their data and lineage. From a security perspective, if an attacker pollutes an anomaly detector by injecting manually crafted data into the training data set, the detector must forget the injected data to regain security. From a usability perspective, a user can remove noise and incorrect entries so that a recommendation engine gives useful recommendations. Therefore, we envision forgetting systems, capable of forgetting certain data and their lineages, completely and quickly. This paper focuses on making learning systems forget, the process of which we call machine unlearning, or simply unlearning. We present a general, efficient unlearning approach by transforming learning algorithms used by a system into a summation form. To forget a training data sample, our approach simply updates a small number of summations – asymptotically faster than retraining from scratch. Our approach is general, because the summation form is from the statistical query learning in which many machine learning algorithms can be implemented. Our approach also applies to all stages of machine learning, including feature selection and modeling. Our evaluation, on four diverse learning systems and real-world workloads, shows that our approach is general, effective, fast, and easy to use.