Biblio
Data cleaning is frequently an iterative process tailored to the requirements of a specific analysis task. The design and implementation of iterative data cleaning tools presents novel challenges, both technical and organizational, to the community. In this paper, we present results from a user survey (N = 29) of data analysts and infrastructure engineers from industry and academia. We highlight three important themes: (1) the iterative nature of data cleaning, (2) the lack of rigor in evaluating the correctness of data cleaning, and (3) the disconnect between the analysts who query the data and the infrastructure engineers who design the cleaning pipelines. We conclude by presenting a number of recommendations for future work in which we envision an interactive data cleaning system that accounts for the observed challenges.
We propose a new view on data cleaning: Not data itself but the degrees of uncertainty attributed to data are dirty. Applying possibility theory, tuples are assigned degrees of possibility with which they occur, and constraints are assigned degrees of certainty that say to which tuples they apply. Classical data cleaning modifies some minimal set of tuples. Instead, we marginally reduce their degrees of possibility. This reduction leads to a new qualitative version of the vertex cover problem. Qualitative vertex cover can be mapped to a linear-weighted constraint satisfaction problem. However, any off-the-shelf solver cannot solve the problem more efficiently than classical vertex cover. Instead, we utilize the degrees of possibility and certainty to develop a dedicated algorithm that is fixed parameter tractable in the size of the qualitative vertex cover. Experiments show that our algorithm is faster than solvers for the classical vertex cover problem by several orders of magnitude, and performance improves with higher numbers of uncertainty degrees.
An increasing number of applications in all aspects of society rely on data. Despite the long line of research in data cleaning and repairs, data correctness has been an elusive goal. Errors in the data can be extremely disruptive, and are detrimental to the effectiveness and proper function of data-driven applications. Even when data is cleaned, new errors can be introduced by applications and users who interact with the data. Subsequent valid updates can obscure these errors and propagate them through the dataset causing more discrepancies. Any discovered errors tend to be corrected superficially, on a case-by-case basis, further obscuring the true underlying cause, and making detection of the remaining errors harder. In this demo proposal, we outline the design of QFix, a query-centric framework that derives explanations and repairs for discrepancies in relational data based on potential errors in the queries that operated on the data. This is a marked departure from traditional data-centric techniques that directly fix the data. We then describe how users will use QFix in a demonstration scenario. Participants will be able to select from a number of transactional benchmarks, introduce errors into the queries that are executed, and compare the fixes to the queries proposed by QFix as well as existing alternative algorithms such as decision trees.
Poor data quality has become a persistent challenge for organizations as data continues to grow in complexity and size. Existing data cleaning solutions focus on identifying repairs to the data to minimize either a cost function or the number of updates. These techniques, however, fail to consider underlying data privacy requirements that exist in many real data sets containing sensitive and personal information. In this demonstration, we present PARC, a Privacy-AwaRe data Cleaning system that corrects data inconsistencies w.r.t. a set of FDs, and limits the disclosure of sensitive values during the cleaning process. The system core contains modules that evaluate three key metrics during the repair search, and solves a multi-objective optimization problem to identify repairs that balance the privacy vs. utility tradeoff. This demonstration will enable users to understand: (1) the characteristics of a privacy-preserving data repair; (2) how to customize data cleaning and data privacy requirements using two real datasets; and (3) the distinctions among the repair recommendations via visualization summaries.
Recent advances in differential privacy make it possible to guarantee user privacy while preserving the main characteristics of the data. However, most differential privacy mechanisms assume that the underlying dataset is clean. This paper explores the link between data cleaning and differential privacy in a framework we call PrivateClean. PrivateClean includes a technique for creating private datasets of numerical and discrete-valued attributes, a formalism for privacy-preserving data cleaning, and techniques for answering sum, count, and avg queries after cleaning. We show: (1) how the degree of privacy affects subsequent aggregate query accuracy, (2) how privacy potentially amplifies certain types of errors in a dataset, and (3) how this analysis can be used to tune the degree of privacy. The key insight is to maintain a bipartite graph relating dirty values to clean values and use this graph to estimate biases due to the interaction between cleaning and privacy. We validate these results on four datasets with a variety of well-studied cleaning techniques including using functional dependencies, outlier filtering, and resolving inconsistent attributes.
Dealing with increasing amounts of data creates the need to deal with redundant, inconsistent and/or complementary repositories which may be different in their data models and/or in their schema. Current data cleaning techniques developed to tackle data quality problems are just suitable for scenarios were all repositories share the same model and schema. Recently, an ontology-based methodology was proposed to overcome this limitation. In this paper, this methodology is briefly described and applied to a real scenario in the health domain with data quality problems.
We present Falcon, an interactive, deterministic, and declarative data cleaning system, which uses SQL update queries as the language to repair data. Falcon does not rely on the existence of a set of pre-defined data quality rules. On the contrary, it encourages users to explore the data, identify possible problems, and make updates to fix them. Bootstrapped by one user update, Falcon guesses a set of possible sql update queries that can be used to repair the data. The main technical challenge addressed in this paper consists in finding a set of sql update queries that is minimal in size and at the same time fixes the largest number of errors in the data. We formalize this problem as a search in a lattice-shaped space. To guarantee that the chosen updates are semantically correct, Falcon navigates the lattice by interacting with users to gradually validate the set of sql update queries. Besides using traditional one-hop based traverse algorithms (e.g., BFS or DFS), we describe novel multi-hop search algorithms such that Falcon can dive over the lattice and conduct the search efficiently. Our novel search strategy is coupled with a number of optimization techniques to further prune the search space and efficiently maintain the lattice. We have conducted extensive experiments using both real-world and synthetic datasets to show that Falcon can effectively communicate with users in data repairing.
Errors are prevalent in data sequences, such as GPS trajectories or sensor readings. Existing methods on cleaning sequential data employ a constraint on value changing speeds and perform constraint-based repairing. While such speed constraints are effective in identifying large spike errors, the small errors that do not significantly deviate from the truth and indeed satisfy the speed constraints can hardly be identified and repaired. To handle such small errors, in this paper, we propose a statistical based cleaning method. Rather than declaring a broad constraint of max/min speeds, we model the probability distribution of speed changes. The repairing problem is thus to maximize the likelihood of the sequence w.r.t. the probability of speed changes. We formalize the likelihood-based cleaning problem, show its NP-hardness, devise exact algorithms, and propose several approximate/heuristic methods to trade off effectiveness for efficiency. Experiments on real data sets (in various applications) demonstrate the superiority of our proposal.
Services in modern data center networks pose growing performance demands. However, the widely existed special traffic patterns, such as micro-burst, highly concurrent flows, on-off pattern of flow transmission, exacerbate the performance of transport protocols. In this work, an clean-slate explicit transport control mechanism, called Token Flow Control (TFC), is proposed for data center networks to achieve high link utilization, ultra-low latency, fast convergence, and rare packets dropping. TFC uses tokens to represent the link bandwidth resource and define the concept of effective flows to stand for consumers. The total tokens will be explicitly allocated to each consumer every time slot. TFC excludes in-network buffer space from the flow pipeline and thus achieves zero-queueing. Besides, a packet delay function is added at switches to prevent packets dropping with highly concurrent flows. The performance of TFC is evaluated using both experiments on a small real testbed and large-scale simulations. The results show that TFC achieves high throughput, fast convergence, near zero-queuing and rare packets loss in various scenarios.
Internet has shown itself to be a catalyst for economic growth and social equity but its potency is thwarted by the fact that the Internet is off limits for the vast majority of human beings. Mobile phones—the fastest growing technology in the world that now reaches around 80% of humanity—can enable universal Internet access if it can resolve coverage problems that have historically plagued previous cellular architectures (2G, 3G, and 4G). These conventional architectures have not been able to sustain universal service provisioning since these architectures depend on having enough users per cell for their economic viability and thus are not well suited to rural areas (which are by definition sparsely populated). The new generation of mobile cellular technology (5G), currently in a formative phase and expected to be finalized around 2020, is aimed at orders of magnitude performance enhancement. 5G offers a clean slate to network designers and can be molded into an architecture also amenable to universal Internet provisioning. Keeping in mind the great social benefits of democratizing Internet and connectivity, we believe that the time is ripe for emphasizing universal Internet provisioning as an important goal on the 5G research agenda. In this paper, we investigate the opportunities and challenges in utilizing 5G for global access to the Internet for all (GAIA). We have also identified the major technical issues involved in a 5G-based GAIA solution and have set up a future research agenda by defining open research problems.
Today's cellular core, which connects the radio access network to the Internet, relies on fixed hardware appliances placed at a few dedicated locations and uses relatively static routing policies. As such, today's core design has key limitations—it induces inefficient provisioning tradeoffs and is poorly equipped to handle overload, failure scenarios, and diverse application requirements. To address these limitations, ongoing efforts envision "clean slate" solutions that depart from cellular standards and routing protocols; e.g., via programmable switches at base stations and per-flow SDN-like orchestration. The driving question of this work is to ask if a clean-slate redesign is necessary and if not, how can we design a flexible cellular core that is minimally disruptive. We propose KLEIN, a design that stays within the confines of current cellular standards and addresses the above limitations by combining network functions virtualization with smart resource management. We address key challenges w.r.t. scalability and responsiveness in realizing KLEIN via backwards-compatible orchestration mechanisms. Our evaluations through data-driven simulations and real prototype experiments using OpenAirInterface show that KLEIN can scale to billions of devices and is close to optimal for wide variety of traffic and deployment parameters.
Cleaning spreadsheet data types is a common problem faced by millions of spreadsheet users. Data types such as date, time, name, and units are ubiquitous in spreadsheets, and cleaning transformations on these data types involve parsing and pretty printing their string representations. This presents many challenges to users because cleaning such data requires some background knowledge about the data itself and moreover this data is typically non-uniform, unstructured, and ambiguous. Spreadsheet systems and Programming Languages provide some UI-based and programmatic solutions for this problem but they are either insufficient for the user's needs or are beyond their expertise. In this paper, we present a programming by example methodology of cleaning data types that learns the desired transformation from a few input-output examples. We propose a domain specific language with probabilistic semantics that is parameterized with declarative data type definitions. The probabilistic semantics is based on three key aspects: (i) approximate predicate matching, (ii) joint learning of data type interpretation, and (iii) weighted branches. This probabilistic semantics enables the language to handle non-uniform, unstructured, and ambiguous data. We then present a synthesis algorithm that learns the desired program in this language from a set of input-output examples. We have implemented our algorithm as an Excel add-in and present its successful evaluation on 55 benchmark problems obtained from online help forums and Excel product team.
Twitter provides a public streaming API that is strictly limited, making it difficult to simultaneously achieve good coverage and relevance when monitoring tweets for a specific topic of interest. In this paper, we address the tweet acquisition challenge to enhance monitoring of tweets based on the client/application needs in an online adaptive manner such that the quality and quantity of the results improves over time. We propose a Tweet Acquisition System (TAS), that iteratively selects phrases to track based on an explore-exploit strategy. Our experimental studies show that TAS significantly improves recall of relevant tweets and the performance improves when the topics are more specific.
Semantic Big Data is about the creation of new applications exploiting the richness and flexibility of declarative semantics combined with scalable and highly distributed data management systems. In this work, we present an application scenario in which a domain ontology, Open Refine and the Okkam Entity Name System enable a frictionless and scalable data integration process leading to a knowledge base for tax assessment. Further, we introduce the concept of Entiton as a flexible and efficient data model suitable for large scale data inference and analytic tasks. We successfully tested our data processing pipeline on a real world dataset, supporting ACI Informatica in the investigation for Vehicle Excise Duty (VED) evasion in Aosta Valley region (Italy). Besides useful business intelligence indicators, we implemented a distributed temporal inference engine to unveil VED evasion and circulation ban violations. The results of the integration are presented to the tax agents in a powerful Siren Solution KiBi dashboard, enabling seamless data exploration and business intelligence.
We analyze the workload from a multi-year deployment of a database-as-a-service platform targeting scientists and data scientists with minimal database experience. Our hypothesis was that relatively minor changes to the way databases are delivered can increase their use in ad hoc analysis environments. The web-based SQLShare system emphasizes easy dataset-at-a-time ingest, relaxed schemas and schema inference, easy view creation and sharing, and full SQL support. We find that these features have helped attract workloads typically associated with scripts and files rather than relational databases: complex analytics, routine processing pipelines, data publishing, and collaborative analysis. Quantitatively, these workloads are characterized by shorter dataset "lifetimes", higher query complexity, and higher data complexity. We report on usage scenarios that suggest SQL is being used in place of scripts for one-off data analysis and ad hoc data sharing. The workload suggests that a new class of relational systems emphasizing short-term, ad hoc analytics over engineered schemas may improve uptake of database technology in data science contexts. Our contributions include a system design for delivering databases into these contexts, a description of a public research query workload dataset released to advance research in analytic data systems, and an initial analysis of the workload that provides evidence of new use cases under-supported in existing systems.
Garbage is an endemic problem in developing cities due to the continual influx of migrants from rural areas coupled with deficient municipal capacity planning. In cities like Dhaka, open waste dumps contribute to the prevalence of disease, environmental contamination, catastrophic flooding, and deadly fires. Recent interest in the garbage problem has prompted cursory proposals to introduce technology solutions for mapping and fundraising. Yet, the role of technology and its potential benefits are unexplored in this large-scale problem. In this paper, we contribute to the understanding of the waste ecology in Dhaka and how the various actors acquire, perform, negotiate, and coordinate their roles. Within this context, we explore design opportunities for using computing technologies to support collaboration between waste pickers and residents of these communities. We find opportunities in the presence of technology and the absence of mechanisms to facilitate coordination of community funding and crowd work.
This paper presents an unsupervised method for systematically identifying anomalies in music datasets. The model integrates categorical regression and robust estimation techniques to infer anomalous scores in music clips. When applied to a music genre recognition dataset, the new method is able to detect corrupted, distorted, or mislabeled audio samples based on commonly used features in music information retrieval. The evaluation results show that the algorithm outperforms other anomaly detection methods and is capable of finding problematic samples identified by human experts. The proposed method introduces a preliminary framework for anomaly detection in music data that can serve as a useful tool to improve data integrity in the future.
FeatureIDE is an open-source framework to model, develop, and analyze feature-oriented software product lines. It is mainly developed in a cooperation between University of Magdeburg and Metop GmbH. Nevertheless, many other institutions contributed to it in the past decade. Goal of this tutorial is to illustrate how FeatureIDE can be used to clean variable code, whereas we will focus on dependencies in feature models and on variability implemented with preprocessors. The hands-on tutorial will be highly interactive and is devoted to practitioners facing problems with variability, lecturers teaching product lines, and researchers who want to safe resources in building product line tools.
Repairing erroneous or conflicting data that violate a set of constraints is an important problem in data management. Many automatic or semi-automatic data-repairing algorithms have been proposed in the last few years, each with its own strengths and weaknesses. Bart is an open-source error-generation system conceived to support thorough experimental evaluations of these data-repairing systems. The demo is centered around three main lessons. To start, we discuss how generating errors in data is a complex problem, with several facets. We introduce the important notions of detectability and repairability of an error, that stand at the core of Bart. Then, we show how, by changing the features of errors, it is possible to influence quite significantly the performance of the tools. Finally, we concretely put to work five data-repairing algorithms on dirty data of various kinds generated using Bart, and discuss their performance.
In recent years, many researchers have focused on log-structured file systems (LFS), because it gracefully enhances the random write performance and efficiently resolves the consistency issue. However, the write policy of LFS can cause a file fragmentation problem, which degrades sequential read performance of the file system. In this paper, we analyze the relationship between file fragmentation and the sequential read performance, considering the characteristics of underlying storage devices. We also propose a novel file defragmentation scheme on LFS to effectively address the file fragmentation problem. Our scheme reorders the valid data blocks belonging to a victim segment based on the inode numbers during the cleaning process of LFS. In our experiments, our scheme eliminates file fragmentation by up to 98.5% when compared with traditional LFS.
Detecting and repairing dirty data is one of the perennial challenges in data analytics, and failure to do so can result in inaccurate analytics and unreliable decisions. Over the past few years, there has been a surge of interest from both industry and academia on data cleaning problems including new abstractions, interfaces, approaches for scalability, and statistical techniques. To better understand the new advances in the field, we will first present a taxonomy of the data cleaning literature in which we highlight the recent interest in techniques that use constraints, rules, or patterns to detect errors, which we call qualitative data cleaning. We will describe the state-of-the-art techniques and also highlight their limitations with a series of illustrative examples. While traditionally such approaches are distinct from quantitative approaches such as outlier detection, we also discuss recent work that casts such approaches into a statistical estimation framework including: using Machine Learning to improve the efficiency and accuracy of data cleaning and considering the effects of data cleaning on statistical analysis.
Databases can be corrupted with various errors such as missing, incorrect, or inconsistent values. Increasingly, modern data analysis pipelines involve Machine Learning, and the effects of dirty data can be difficult to debug.Dirty data is often sparse, and naive sampling solutions are not suited for high-dimensional models. We propose ActiveClean, a progressive framework for training Machine Learning models with data cleaning. Our framework updates a model iteratively as the analyst cleans small batches of data, and includes numerous optimizations such as importance weighting and dirty data detection. We designed a visual interface to wrap around this framework and demonstrate ActiveClean for a video classification problem and a topic modeling problem.
As the number of devices that gain connectivity and join the category of smart-objects increases every year reaching unprecedented numbers, new challenges are imposed on our networks. While specialized solutions for certain use cases have been proposed, more flexible and scalable new approaches to networking will be required to deal with billions or trillions of smart objects connected to the Internet. With this paper, we take a step back looking at the set of basic problems that are posed by this group of devices. In order to develop an analysis on how these issues could be approached, we define which fundamental abstractions might help solving or at least reducing their impact on the network by offering support for fundamental matters such as mobility, group based delivery and support for distributed computing resources. Based on the concept of named-objects, we propose a set of solutions that network and show how this approach can address both scalability and functional requirements. Finally, we describe a comprehensive clean-slate network architecture (MobiityFirst) which attempts to realize the proposed capabilities.
Cellular networks play a dominant role in how we communicate. But, the current cellular architecture and protocols are overly complex. The 'control plane' protocol includes setting up explicit tunnels for every session and exchanging a large number of packets among the different entities (mobile device, base station, the packet gateways and mobility management) to ensure state is exchanged in a consistent manner. This limits scalability. As we evolve to having to support an increasing number of users, cell-sites (e.g., 5G) and the consequent mobility, and the incoming wave of IoT devices, a re-thinking of the architecture and control protocols is required. In this work we propose CleanG, a simplified software-based architecture for the Mobile Core Network (MCN) and a simplified control protocol for cellular networks. Network Function Virtualization enables dynamic management of capacity in the cloud to support the MCN of future cellular networks. We develop a simplified protocol that substantially reduces the number of control messages exchanged to support the various events, while retaining the current functionality expected from the network. CleanG, we believe will scale better and have lower latency.
Web archives about school shootings consist of webpages that may or may not be relevant to the events of interest. There are 3 main goals of this work; first is to clean the webpages, which involves getting rid of the stop words and non-relevant parts of a webpage. The second goal is to select just webpages relevant to the events of interest. The third goal is to upload the cleaned and relevant webpages to Apache Solr so that they are easily accessible. We show the details of all the steps required to achieve these goals. The results show that representative Web archives are noisy, with 2% - 40% relevant content. By cleaning the archives, we aid researchers to focus on relevant content for their analysis.