Biblio
Zero trust security model has been picking up adoption in various organizations due to its various advantages. Data quality is still one of the fundamental challenges in data curation in many organizations where data consumers don’t trust data due to associated quality issues. As a result, there is a lack of confidence in making business decisions based on data. We design a model based on the zero trust security model to demonstrate how the trust of data consumers can be established. We present a sample application to distinguish the traditional approach from the zero trust based data quality framework.
Data value (DV) is a novel concept that is introduced as one of the Big Data phenomenon features. While continuing an investigation of the DV ontology and its relationship with the data quality (DQ) on the conceptual level, this paper researches possible applications and use of the DV in the practical design of security and privacy protection systems and tools. We present a novel approach to DV evaluation that maps DQ metrics into DV value. Developed methods allow DV and DQ use in a wide range of application domains. To demonstrate DQ and DV concept employment in real tasks we present two real-life scenarios. The first use case demonstrates the DV use in crowdsensing application design. It shows up how DV can be calculated by integrating various metrics characterizing data application functionality, accuracy, and security. The second one incorporates the privacy consideration into DV calculus by exploring the relationship between privacy, DQ, and DV in the defense against web-site fingerprinting in The Onion Router (TOR) networks. These examples demonstrate how our methods of the DV and DQ evaluation may be employed in the design of real systems with security and privacy consideration.
Cyber-physical systems (CPS) depend on cybersecurity to ensure functionality, data quality, cyberattack resilience, etc. There are known and unknown cyber threats and attacks that pose significant risks. Information assurance and information security are critical. Many systems are vulnerable to intelligence exploitation and cyberattacks. By investigating cybersecurity risks and formal representation of CPS using spatiotemporal dynamic graphs and networks, this paper investigates topics and solutions aimed to examine and empower: (1) Cybersecurity capabilities; (2) Information assurance and system vulnerabilities; (3) Detection of cyber threat and attacks; (4) Situational awareness; etc. We introduce statistically-characterized dynamic graphs, novel entropy-centric algorithms and calculi which promise to ensure near-real-time capabilities.
Due to safety concerns and legislation implemented by various governments, the maritime sector adopted Automatic Identification System (AIS). Whilst governments and state agencies have an increasing reliance on AIS data, the underlying technology can be found to be fundamentally insecure. This study identifies and describes a number of potential attack vectors and suggests conceptual countermeasures to mitigate such attacks. With interception by Navy and Coast Guard as well as marine navigation and obstacle avoidance, the vulnerabilities within AIS call into question the multiple deployed overlapping AIS networks, and what the future holds for the protocol.
The large amounts of synchrophasor data obtained by Phasor Measurement Units (PMUs) provide dynamic visibility into power systems. Extracting reliable information from the data can enhance power system situational awareness. The data quality often suffers from data losses, bad data, and cyber data attacks. Data privacy is also an increasing concern. In this paper, we discuss our recently proposed framework of data recovery, error correction, data privacy enhancement, and event identification methods by exploiting the intrinsic low-dimensional structures in the high-dimensional spatial-temporal blocks of PMU data. Our data-driven approaches are computationally efficient with provable analytical guarantees. The data recovery method can recover the ground-truth data even if simultaneous and consecutive data losses and errors happen across all PMU channels for some time. We can identify PMU channels that are under false data injection attacks by locating abnormal dynamics in the data. The data recovery method for the operator can extract the information accurately by collectively processing the privacy-preserving data from many PMUs. A cyber intruder with access to partial measurements cannot recover the data correctly even using the same approach. A real-time event identification method is also proposed, based on the new idea of characterizing an event by the low-dimensional subspace spanned by the dominant singular vectors of the data matrix.
Cybersecurity plays a critical role in protecting sensitive information and the structural integrity of networked systems. As networked systems continue to expand in numbers as well as in complexity, so does the threat of malicious activity and the necessity for advanced cybersecurity solutions. Furthermore, both the quantity and quality of available data on malicious content as well as the fact that malicious activity continuously evolves makes automated protection systems for this type of environment particularly challenging. Not only is the data quality a concern, but the volume of the data can be quite small for some of the classes. This creates a class imbalance in the data used to train a classifier; however, many classifiers are not well equipped to deal with class imbalance. One such example is detecting malicious HMTL files from static features. Unfortunately, collecting malicious HMTL files is extremely difficult and can be quite noisy from HTML files being mislabeled. This paper evaluates a specific application that is afflicted by these modern cybersecurity challenges: detection of malicious HTML files. Previous work presented a general framework for malicious HTML file classification that we modify in this work to use a $\chi$2 feature selection technique and synthetic minority oversampling technique (SMOTE). We experiment with different classifiers (i.e., AdaBoost, Gentle-Boost, RobustBoost, RusBoost, and Random Forest) and a pure detection model (i.e., Isolation Forest). We benchmark the different classifiers using SMOTE on a real dataset that contains a limited number of malicious files (40) with respect to the normal files (7,263). It was found that the modified framework performed better than the previous framework's results. However, additional evidence was found to imply that algorithms which train on both the normal and malicious samples are likely overtraining to the malicious distribution. We demonstrate the likely overtraining by determining that a subset of the malicious files, while suspicious, did not come from a malicious source.
Community Health Workers (CHWs) have been using Mobile Health Data Collection Systems (MDCSs) for supporting the delivery of primary healthcare and carrying out public health surveys, feeding national-level databases with families' personal data. Such systems are used for public surveillance and to manage sensitive data (i.e., health data), so addressing the privacy issues is crucial for successfully deploying MDCSs. In this paper we present a comprehensive privacy threat analysis for MDCSs, discuss the privacy challenges and provide recommendations that are specially useful to health managers and developers. We ground our analysis on a large-scale MDCS used for primary care (GeoHealth) and a well-known Privacy Impact Assessment (PIA) methodology. The threat analysis is based on a compilation of relevant privacy threats from the literature as well as brain-storming sessions with privacy and security experts. Among the main findings, we observe that existing MDCSs do not employ adequate controls for achieving transparency and interveinability. Thus, threatening fundamental privacy principles regarded as data quality, right to access and right to object. Furthermore, it is noticeable that although there has been significant research to deal with data security issues, the attention with privacy in its multiple dimensions is prominently lacking.
Knowledge Graphs (KG) are becoming core components of most artificial intelligence applications. Linked Data, as a method of publishing KGs, allows applications to traverse within, and even out of, the graph thanks to global dereferenceable identifiers denoting entities, in the form of IRIs. However, as we show in this work, after analyzing several popular datasets (namely DBpedia, LOD Cache, and Web Data Commons JSON-LD data) many entities are being represented using literal strings where IRIs should be used, diminishing the advantages of using Linked Data. To remedy this, we propose an approach for identifying such strings and replacing them with their corresponding entity IRIs. The proposed approach is based on identifying relations between entities based on both ontological axioms as well as data profiling information and converting strings to entity IRIs based on the types of entities linked by each relation. Our approach showed 98% recall and 76% precision in identifying such strings and 97% precision in converting them to their corresponding IRI in the considered KG. Further, we analyzed how the connectivity of the KG is increased when new relevant links are added to the entities as a result of our method. Our experiments on a subset of the Spanish DBpedia data show that it could add 25% more links to the KG and improve the overall connectivity by 17%.
A database is a vast collection of data which helps us to collect, retrieve, organize and manage the data in an efficient and effective manner. Databases are critical assets. They store client details, financial information, personal files, company secrets and other data necessary for business. Today people are depending more on the corporate data for decision making, management of customer service and supply chain management etc. Any loss, corrupted data or unavailability of data may seriously affect its performance. The database security should provide protected access to the contents of a database and should preserve the integrity, availability, consistency, and quality of the data This paper describes the architecture based on placing the Elliptical curve cryptography module inside database management software (DBMS), just above the database cache. Using this method only selected part of the database can be encrypted instead of the whole database. This architecture allows us to achieve very strong data security using ECC and increase performance using cache.
Poor data quality has become a persistent challenge for organizations as data continues to grow in complexity and size. Existing data cleaning solutions focus on identifying repairs to the data to minimize either a cost function or the number of updates. These techniques, however, fail to consider underlying data privacy requirements that exist in many real data sets containing sensitive and personal information. In this demonstration, we present PARC, a Privacy-AwaRe data Cleaning system that corrects data inconsistencies w.r.t. a set of FDs, and limits the disclosure of sensitive values during the cleaning process. The system core contains modules that evaluate three key metrics during the repair search, and solves a multi-objective optimization problem to identify repairs that balance the privacy vs. utility tradeoff. This demonstration will enable users to understand: (1) the characteristics of a privacy-preserving data repair; (2) how to customize data cleaning and data privacy requirements using two real datasets; and (3) the distinctions among the repair recommendations via visualization summaries.
Twitter provides a public streaming API that is strictly limited, making it difficult to simultaneously achieve good coverage and relevance when monitoring tweets for a specific topic of interest. In this paper, we address the tweet acquisition challenge to enhance monitoring of tweets based on the client/application needs in an online adaptive manner such that the quality and quantity of the results improves over time. We propose a Tweet Acquisition System (TAS), that iteratively selects phrases to track based on an explore-exploit strategy. Our experimental studies show that TAS significantly improves recall of relevant tweets and the performance improves when the topics are more specific.
Detecting and repairing dirty data is one of the perennial challenges in data analytics, and failure to do so can result in inaccurate analytics and unreliable decisions. Over the past few years, there has been a surge of interest from both industry and academia on data cleaning problems including new abstractions, interfaces, approaches for scalability, and statistical techniques. To better understand the new advances in the field, we will first present a taxonomy of the data cleaning literature in which we highlight the recent interest in techniques that use constraints, rules, or patterns to detect errors, which we call qualitative data cleaning. We will describe the state-of-the-art techniques and also highlight their limitations with a series of illustrative examples. While traditionally such approaches are distinct from quantitative approaches such as outlier detection, we also discuss recent work that casts such approaches into a statistical estimation framework including: using Machine Learning to improve the efficiency and accuracy of data cleaning and considering the effects of data cleaning on statistical analysis.
With the increasing incentive of enterprises to ingest as much data as they can in what is commonly referred to as "data lakes", and with the recent development of multiple technologies to support this "load-first" paradigm, the new environment presents serious data management challenges. Among them, the assessment of data quality and cleaning large volumes of heterogeneous data sources become essential tasks in unveiling the value of big data. The coveted use of unstructured and semi-structured data in large volumes makes current data cleaning tools (primarily designed for relational data) not directly adoptable. We present CLAMS, a system to discover and enforce expressive integrity constraints from large amounts of lake data with very limited schema information (e.g., represented as RDF triples). This demonstration shows how CLAMS is able to discover the constraints and the schemas they are defined on simultaneously. CLAMS also introduces a scale-out solution to efficiently detect errors in the raw data. CLAMS interacts with human experts to both validate the discovered constraints and to suggest data repairs. CLAMS has been deployed in a real large-scale enterprise data lake and was experimented with a real data set of 1.2 billion triples. It has been able to spot multiple obscure data inconsistencies and errors early in the data processing stack, providing huge value to the enterprise.
Wikidata is the new, large-scale knowledge base of the Wikimedia Foundation. Its knowledge is increasingly used within Wikipedia itself and various other kinds of information systems, imposing high demands on its integrity. Wikidata can be edited by anyone and, unfortunately, it frequently gets vandalized, exposing all information systems using it to the risk of spreading vandalized and falsified information. In this paper, we present a new machine learning-based approach to detect vandalism in Wikidata. We propose a set of 47 features that exploit both content and context information, and we report on 4 classifiers of increasing effectiveness tailored to this learning task. Our approach is evaluated on the recently published Wikidata Vandalism Corpus WDVC-2015 and it achieves an area under curve value of the receiver operating characteristic, ROC-AUC, of 0.991. It significantly outperforms the state of the art represented by the rule-based Wikidata Abuse Filter (0.865 ROC-AUC) and a prototypical vandalism detector recently introduced by Wikimedia within the Objective Revision Evaluation Service (0.859 ROC-AUC).
Background: The NASA datasets have previously been used extensively in studies of software defects. In 2013 Shepperd et al. presented an essential set of rules for removing erroneous data from the NASA datasets making this data more reliable to use. Objective: We have now found additional rules necessary for removing problematic data which were not identified by Shepperd et al. Results: In this paper, we demonstrate the level of erroneous data still present even after cleaning using Shepperd et al.'s rules and apply our new rules to remove this erroneous data. Conclusion: Even after systematic data cleaning of the NASA MDP datasets, we found new erroneous data. Data quality should always be explicitly considered by researchers before use.