Visible to the public Biblio

Filters: Keyword is data quality  [Clear All Filters]
2022-12-02
Mohammed, Mahmood, Talburt, John R., Dagtas, Serhan, Hollingsworth, Melissa.  2021.  A Zero Trust Model Based Framework For Data Quality Assessment. 2021 International Conference on Computational Science and Computational Intelligence (CSCI). :305—307.

Zero trust security model has been picking up adoption in various organizations due to its various advantages. Data quality is still one of the fundamental challenges in data curation in many organizations where data consumers don’t trust data due to associated quality issues. As a result, there is a lack of confidence in making business decisions based on data. We design a model based on the zero trust security model to demonstrate how the trust of data consumers can be established. We present a sample application to distinguish the traditional approach from the zero trust based data quality framework.

2021-06-24
Habib ur Rehman, Muhammad, Mukhtar Dirir, Ahmed, Salah, Khaled, Svetinovic, Davor.  2020.  FairFed: Cross-Device Fair Federated Learning. 2020 IEEE Applied Imagery Pattern Recognition Workshop (AIPR). :1–7.
Federated learning (FL) is the rapidly developing machine learning technique that is used to perform collaborative model training over decentralized datasets. FL enables privacy-preserving model development whereby the datasets are scattered over a large set of data producers (i.e., devices and/or systems). These data producers train the learning models, encapsulate the model updates with differential privacy techniques, and share them to centralized systems for global aggregation. However, these centralized models are always prone to adversarial attacks (such as data-poisoning and model poisoning attacks) due to a large number of data producers. Hence, FL methods need to ensure fairness and high-quality model availability across all the participants in the underlying AI systems. In this paper, we propose a novel FL framework, called FairFed, to meet fairness and high-quality data requirements. The FairFed provides a fairness mechanism to detect adversaries across the devices and datasets in the FL network and reject their model updates. We use a Python-simulated FL framework to enable large-scale training over MNIST dataset. We simulate a cross-device model training settings to detect adversaries in the training network. We used TensorFlow Federated and Python to implement the fairness protocol, the deep neural network, and the outlier detection algorithm. We thoroughly test the proposed FairFed framework with random and uniform data distributions across the training network and compare our initial results with the baseline fairness scheme. Our proposed work shows promising results in terms of model accuracy and loss.
2021-05-05
Zhao, Bushi, Zhang, Hao, Luo, Yixi.  2020.  Automatic Error Correction Technology for the Same Field in the Same Kind of Power Equipment Account Data. 2020 IEEE 3rd International Conference of Safe Production and Informatization (IICSPI). :153—157.
Account data of electrical power system is the link of all businesses in the whole life cycle of equipment. It is of great significance to improve the data quality of power equipment account data for improving the information level of power enterprises. In the past, there was only the error correction technology to check whether it was empty and whether it contained garbled code. The error correction technology for same field of the same kind of power equipment account data is proposed in this paper. Combined with the characteristics of production business, the possible similar power equipment can be found through the function location type and other fields of power equipment account data. Based on the principle of search scoring, the horizontal comparison is used to search and score in turn. Finally, the potential spare parts and existing data quality are identified according to the scores. And judge whether it is necessary to carry out inspection maintenance.
2021-04-27
Khokhlov, I., Reznik, L..  2020.  What is the Value of Data Value in Practical Security Applications. 2020 IEEE Systems Security Symposium (SSS). :1—8.

Data value (DV) is a novel concept that is introduced as one of the Big Data phenomenon features. While continuing an investigation of the DV ontology and its relationship with the data quality (DQ) on the conceptual level, this paper researches possible applications and use of the DV in the practical design of security and privacy protection systems and tools. We present a novel approach to DV evaluation that maps DQ metrics into DV value. Developed methods allow DV and DQ use in a wide range of application domains. To demonstrate DQ and DV concept employment in real tasks we present two real-life scenarios. The first use case demonstrates the DV use in crowdsensing application design. It shows up how DV can be calculated by integrating various metrics characterizing data application functionality, accuracy, and security. The second one incorporates the privacy consideration into DV calculus by exploring the relationship between privacy, DQ, and DV in the defense against web-site fingerprinting in The Onion Router (TOR) networks. These examples demonstrate how our methods of the DV and DQ evaluation may be employed in the design of real systems with security and privacy consideration.

2021-04-09
Lyshevski, S. E., Aved, A., Morrone, P..  2020.  Information-Centric Cyberattack Analysis and Spatiotemporal Networks Applied to Cyber-Physical Systems. 2020 IEEE Microwave Theory and Techniques in Wireless Communications (MTTW). 1:172—177.

Cyber-physical systems (CPS) depend on cybersecurity to ensure functionality, data quality, cyberattack resilience, etc. There are known and unknown cyber threats and attacks that pose significant risks. Information assurance and information security are critical. Many systems are vulnerable to intelligence exploitation and cyberattacks. By investigating cybersecurity risks and formal representation of CPS using spatiotemporal dynamic graphs and networks, this paper investigates topics and solutions aimed to examine and empower: (1) Cybersecurity capabilities; (2) Information assurance and system vulnerabilities; (3) Detection of cyber threat and attacks; (4) Situational awareness; etc. We introduce statistically-characterized dynamic graphs, novel entropy-centric algorithms and calculi which promise to ensure near-real-time capabilities.

2021-03-22
Fan, X., Zhang, F., Turamat, E., Tong, C., Wu, J. H., Wang, K..  2020.  Provenance-based Classification Policy based on Encrypted Search. 2020 2nd International Conference on Industrial Artificial Intelligence (IAI). :1–6.
As an important type of cloud data, digital provenance is arousing increasing attention on improving system performance. Currently, provenance has been employed to provide cues regarding access control and to estimate data quality. However, provenance itself might also be sensitive information. Therefore, provenance might be encrypted and stored in the Cloud. In this paper, we provide a mechanism to classify cloud documents by searching specific keywords from their encrypted provenance, and we prove our scheme achieves semantic security. In term of application of the proposed techniques, considering that files are classified to store separately in the cloud, in order to facilitate the regulation and security protection for the files, the classification policies can use provenance as conditions to determine the category of a document. Such as the easiest sample policy goes like: the documents have been reviewed twice can be classified as “public accessible”, which can be accessed by the public.
2020-11-16
Choudhury, O., Sylla, I., Fairoza, N., Das, A..  2019.  A Blockchain Framework for Ensuring Data Quality in Multi-Organizational Clinical Trials. 2019 IEEE International Conference on Healthcare Informatics (ICHI). :1–9.
The cost and complexity of conducting multi-site clinical trials have significantly increased over time, with site monitoring, data management, and Institutional Review Board (IRB) amendments being key drivers. Trial sponsors, such as pharmaceutical companies, are also increasingly outsourcing trial management to multiple organizations. Enforcing compliance with standard operating procedures, such as preserving data privacy for human subject protection, is crucial for upholding the integrity of a study and its findings. Current efforts to ensure quality of data collected at multiple sites and by multiple organizations lack a secure, trusted, and efficient framework for fragmented data capture. To address this challenge, we propose a novel data management infrastructure based on a permissioned blockchain with private channels, smart contracts, and distributed ledgers. We use an example multi-organizational clinical trial to design and implement a blockchain network: generate activity-specific private channels to segregate data flow for confidentiality, write channel-specific smart contracts to enforce regulatory guidelines, monitor the immutable transaction log to detect protocol breach, and auto-generate audit trail. Through comprehensive experimental study, we demonstrate that our system handles high-throughput transactions, exhibits low-latency, and constitutes a trusted, scalable solution.
2020-08-28
McFadden, Danny, Lennon, Ruth, O’Raw, John.  2019.  AIS Transmission Data Quality: Identification of Attack Vectors. 2019 International Symposium ELMAR. :187—190.

Due to safety concerns and legislation implemented by various governments, the maritime sector adopted Automatic Identification System (AIS). Whilst governments and state agencies have an increasing reliance on AIS data, the underlying technology can be found to be fundamentally insecure. This study identifies and describes a number of potential attack vectors and suggests conceptual countermeasures to mitigate such attacks. With interception by Navy and Coast Guard as well as marine navigation and obstacle avoidance, the vulnerabilities within AIS call into question the multiple deployed overlapping AIS networks, and what the future holds for the protocol.

2020-04-20
Liu, Kai-Cheng, Kuo, Chuan-Wei, Liao, Wen-Chiuan, Wang, Pang-Chieh.  2018.  Optimized Data de-Identification Using Multidimensional k-Anonymity. 2018 17th IEEE International Conference On Trust, Security And Privacy In Computing And Communications/ 12th IEEE International Conference On Big Data Science And Engineering (TrustCom/BigDataSE). :1610–1614.
In the globalized knowledge economy, big data analytics have been widely applied in diverse areas. A critical issue in big data analysis on personal information is the possible leak of personal privacy. Therefore, it is necessary to have an anonymization-based de-identification method to avoid undesirable privacy leak. Such method can prevent published data form being traced back to personal privacy. Prior empirical researches have provided approaches to reduce privacy leak risk, e.g. Maximum Distance to Average Vector (MDAV), Condensation Approach and Differential Privacy. However, previous methods inevitably generate synthetic data of different sizes and is thus unsuitable for general use. To satisfy the need of general use, k-anonymity can be chosen as a privacy protection mechanism in the de-identification process to ensure the data not to be distorted, because k-anonymity is strong in both protecting privacy and preserving data authenticity. Accordingly, this study proposes an optimized multidimensional method for anonymizing data based on both the priority weight-adjusted method and the mean difference recommending tree method (MDR tree method). The results of this study reveal that this new method generate more reliable anonymous data and reduce the information loss rate.
Liu, Kai-Cheng, Kuo, Chuan-Wei, Liao, Wen-Chiuan, Wang, Pang-Chieh.  2018.  Optimized Data de-Identification Using Multidimensional k-Anonymity. 2018 17th IEEE International Conference On Trust, Security And Privacy In Computing And Communications/ 12th IEEE International Conference On Big Data Science And Engineering (TrustCom/BigDataSE). :1610–1614.
In the globalized knowledge economy, big data analytics have been widely applied in diverse areas. A critical issue in big data analysis on personal information is the possible leak of personal privacy. Therefore, it is necessary to have an anonymization-based de-identification method to avoid undesirable privacy leak. Such method can prevent published data form being traced back to personal privacy. Prior empirical researches have provided approaches to reduce privacy leak risk, e.g. Maximum Distance to Average Vector (MDAV), Condensation Approach and Differential Privacy. However, previous methods inevitably generate synthetic data of different sizes and is thus unsuitable for general use. To satisfy the need of general use, k-anonymity can be chosen as a privacy protection mechanism in the de-identification process to ensure the data not to be distorted, because k-anonymity is strong in both protecting privacy and preserving data authenticity. Accordingly, this study proposes an optimized multidimensional method for anonymizing data based on both the priority weight-adjusted method and the mean difference recommending tree method (MDR tree method). The results of this study reveal that this new method generate more reliable anonymous data and reduce the information loss rate.
2020-03-02
Wang, Meng, Chow, Joe H., Hao, Yingshuai, Zhang, Shuai, Li, Wenting, Wang, Ren, Gao, Pengzhi, Lackner, Christopher, Farantatos, Evangelos, Patel, Mahendra.  2019.  A Low-Rank Framework of PMU Data Recovery and Event Identification. 2019 International Conference on Smart Grid Synchronized Measurements and Analytics (SGSMA). :1–9.

The large amounts of synchrophasor data obtained by Phasor Measurement Units (PMUs) provide dynamic visibility into power systems. Extracting reliable information from the data can enhance power system situational awareness. The data quality often suffers from data losses, bad data, and cyber data attacks. Data privacy is also an increasing concern. In this paper, we discuss our recently proposed framework of data recovery, error correction, data privacy enhancement, and event identification methods by exploiting the intrinsic low-dimensional structures in the high-dimensional spatial-temporal blocks of PMU data. Our data-driven approaches are computationally efficient with provable analytical guarantees. The data recovery method can recover the ground-truth data even if simultaneous and consecutive data losses and errors happen across all PMU channels for some time. We can identify PMU channels that are under false data injection attacks by locating abnormal dynamics in the data. The data recovery method for the operator can extract the information accurately by collectively processing the privacy-preserving data from many PMUs. A cyber intruder with access to partial measurements cannot recover the data correctly even using the same approach. A real-time event identification method is also proposed, based on the new idea of characterizing an event by the low-dimensional subspace spanned by the dominant singular vectors of the data matrix.

2019-03-06
Hess, S., Satam, P., Ditzler, G., Hariri, S..  2018.  Malicious HTML File Prediction: A Detection and Classification Perspective with Noisy Data. 2018 IEEE/ACS 15th International Conference on Computer Systems and Applications (AICCSA). :1-7.

Cybersecurity plays a critical role in protecting sensitive information and the structural integrity of networked systems. As networked systems continue to expand in numbers as well as in complexity, so does the threat of malicious activity and the necessity for advanced cybersecurity solutions. Furthermore, both the quantity and quality of available data on malicious content as well as the fact that malicious activity continuously evolves makes automated protection systems for this type of environment particularly challenging. Not only is the data quality a concern, but the volume of the data can be quite small for some of the classes. This creates a class imbalance in the data used to train a classifier; however, many classifiers are not well equipped to deal with class imbalance. One such example is detecting malicious HMTL files from static features. Unfortunately, collecting malicious HMTL files is extremely difficult and can be quite noisy from HTML files being mislabeled. This paper evaluates a specific application that is afflicted by these modern cybersecurity challenges: detection of malicious HTML files. Previous work presented a general framework for malicious HTML file classification that we modify in this work to use a $\chi$2 feature selection technique and synthetic minority oversampling technique (SMOTE). We experiment with different classifiers (i.e., AdaBoost, Gentle-Boost, RobustBoost, RusBoost, and Random Forest) and a pure detection model (i.e., Isolation Forest). We benchmark the different classifiers using SMOTE on a real dataset that contains a limited number of malicious files (40) with respect to the normal files (7,263). It was found that the modified framework performed better than the previous framework's results. However, additional evidence was found to imply that algorithms which train on both the normal and malicious samples are likely overtraining to the malicious distribution. We demonstrate the likely overtraining by determining that a subset of the malicious files, while suspicious, did not come from a malicious source.

2018-12-10
Oyekanlu, E..  2018.  Distributed Osmotic Computing Approach to Implementation of Explainable Predictive Deep Learning at Industrial IoT Network Edges with Real-Time Adaptive Wavelet Graphs. 2018 IEEE First International Conference on Artificial Intelligence and Knowledge Engineering (AIKE). :179–188.
Challenges associated with developing analytics solutions at the edge of large scale Industrial Internet of Things (IIoT) networks close to where data is being generated in most cases involves developing analytics solutions from ground up. However, this approach increases IoT development costs and system complexities, delay time to market, and ultimately lowers competitive advantages associated with delivering next-generation IoT designs. To overcome these challenges, existing, widely available, hardware can be utilized to successfully participate in distributed edge computing for IIoT systems. In this paper, an osmotic computing approach is used to illustrate how distributed osmotic computing and existing low-cost hardware may be utilized to solve complex, compute-intensive Explainable Artificial Intelligence (XAI) deep learning problem from the edge, through the fog, to the network cloud layer of IIoT systems. At the edge layer, the C28x digital signal processor (DSP), an existing low-cost, embedded, real-time DSP that has very wide deployment and integration in several IoT industries is used as a case study for constructing real-time graph-based Coiflet wavelets that could be used for several analytic applications including deep learning pre-processing applications at the edge and fog layers of IIoT networks. Our implementation is the first known application of the fixed-point C28x DSP to construct Coiflet wavelets. Coiflet Wavelets are constructed in the form of an osmotic microservice, using embedded low-level machine language to program the C28x at the network edge. With the graph-based approach, it is shown that an entire Coiflet wavelet distribution could be generated from only one wavelet stored in the C28x based edge device, and this could lead to significant savings in memory at the edge of IoT networks. Pearson correlation coefficient is used to select an edge generated Coiflet wavelet and the selected wavelet is used at the fog layer for pre-processing and denoising IIoT data to improve data quality for fog layer based deep learning application. Parameters for implementing deep learning at the fog layer using LSTM networks have been determined in the cloud. For XAI, communication network noise is shown to have significant impact on results of predictive deep learning at IIoT network fog layer.
2018-11-14
Iwaya, L. H., Fischer-Hübner, S., \AAhlfeldt, R., Martucci, L. A..  2018.  mHealth: A Privacy Threat Analysis for Public Health Surveillance Systems. 2018 IEEE 31st International Symposium on Computer-Based Medical Systems (CBMS). :42–47.

Community Health Workers (CHWs) have been using Mobile Health Data Collection Systems (MDCSs) for supporting the delivery of primary healthcare and carrying out public health surveys, feeding national-level databases with families' personal data. Such systems are used for public surveillance and to manage sensitive data (i.e., health data), so addressing the privacy issues is crucial for successfully deploying MDCSs. In this paper we present a comprehensive privacy threat analysis for MDCSs, discuss the privacy challenges and provide recommendations that are specially useful to health managers and developers. We ground our analysis on a large-scale MDCS used for primary care (GeoHealth) and a well-known Privacy Impact Assessment (PIA) methodology. The threat analysis is based on a compilation of relevant privacy threats from the literature as well as brain-storming sessions with privacy and security experts. Among the main findings, we observe that existing MDCSs do not employ adequate controls for achieving transparency and interveinability. Thus, threatening fundamental privacy principles regarded as data quality, right to access and right to object. Furthermore, it is noticeable that although there has been significant research to deal with data security issues, the attention with privacy in its multiple dimensions is prominently lacking.

2018-03-26
Mihindukulasooriya, Nandana, Rico, Mariano, Santana-Pérez, Idafen, Garc\'ıa-Castro, Raúl, Gómez-Pérez, Asunción.  2017.  Repairing Hidden Links in Linked Data: Enhancing the Quality of RDF Knowledge Graphs. Proceedings of the Knowledge Capture Conference. :6:1–6:8.

Knowledge Graphs (KG) are becoming core components of most artificial intelligence applications. Linked Data, as a method of publishing KGs, allows applications to traverse within, and even out of, the graph thanks to global dereferenceable identifiers denoting entities, in the form of IRIs. However, as we show in this work, after analyzing several popular datasets (namely DBpedia, LOD Cache, and Web Data Commons JSON-LD data) many entities are being represented using literal strings where IRIs should be used, diminishing the advantages of using Linked Data. To remedy this, we propose an approach for identifying such strings and replacing them with their corresponding entity IRIs. The proposed approach is based on identifying relations between entities based on both ontological axioms as well as data profiling information and converting strings to entity IRIs based on the types of entities linked by each relation. Our approach showed 98% recall and 76% precision in identifying such strings and 97% precision in converting them to their corresponding IRI in the considered KG. Further, we analyzed how the connectivity of the KG is increased when new relevant links are added to the entities as a result of our method. Our experiments on a subset of the Spanish DBpedia data show that it could add 25% more links to the KG and improve the overall connectivity by 17%.

2017-03-08
Mahajan, S., Katti, J., Walunj, A., Mahalunkar, K..  2015.  Designing a database encryption technique for database security solution with cache. 2015 IEEE International Advance Computing Conference (IACC). :357–360.

A database is a vast collection of data which helps us to collect, retrieve, organize and manage the data in an efficient and effective manner. Databases are critical assets. They store client details, financial information, personal files, company secrets and other data necessary for business. Today people are depending more on the corporate data for decision making, management of customer service and supply chain management etc. Any loss, corrupted data or unavailability of data may seriously affect its performance. The database security should provide protected access to the contents of a database and should preserve the integrity, availability, consistency, and quality of the data This paper describes the architecture based on placing the Elliptical curve cryptography module inside database management software (DBMS), just above the database cache. Using this method only selected part of the database can be encrypted instead of the whole database. This architecture allows us to achieve very strong data security using ECC and increase performance using cache.

2017-03-07
Huang, Dejun, Gairola, Dhruv, Huang, Yu, Zheng, Zheng, Chiang, Fei.  2016.  PARC: Privacy-Aware Data Cleaning. Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. :2433–2436.

Poor data quality has become a persistent challenge for organizations as data continues to grow in complexity and size. Existing data cleaning solutions focus on identifying repairs to the data to minimize either a cost function or the number of updates. These techniques, however, fail to consider underlying data privacy requirements that exist in many real data sets containing sensitive and personal information. In this demonstration, we present PARC, a Privacy-AwaRe data Cleaning system that corrects data inconsistencies w.r.t. a set of FDs, and limits the disclosure of sensitive values during the cleaning process. The system core contains modules that evaluate three key metrics during the repair search, and solves a multi-objective optimization problem to identify repairs that balance the privacy vs. utility tradeoff. This demonstration will enable users to understand: (1) the characteristics of a privacy-preserving data repair; (2) how to customize data cleaning and data privacy requirements using two real datasets; and (3) the distinctions among the repair recommendations via visualization summaries.

Sadri, Mehdi, Mehrotra, Sharad, Yu, Yaming.  2016.  Online Adaptive Topic Focused Tweet Acquisition. Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. :2353–2358.

Twitter provides a public streaming API that is strictly limited, making it difficult to simultaneously achieve good coverage and relevance when monitoring tweets for a specific topic of interest. In this paper, we address the tweet acquisition challenge to enhance monitoring of tweets based on the client/application needs in an online adaptive manner such that the quality and quantity of the results improves over time. We propose a Tweet Acquisition System (TAS), that iteratively selects phrases to track based on an explore-exploit strategy. Our experimental studies show that TAS significantly improves recall of relevant tweets and the performance improves when the topics are more specific.

Chu, Xu, Ilyas, Ihab F., Krishnan, Sanjay, Wang, Jiannan.  2016.  Data Cleaning: Overview and Emerging Challenges. Proceedings of the 2016 International Conference on Management of Data. :2201–2206.

Detecting and repairing dirty data is one of the perennial challenges in data analytics, and failure to do so can result in inaccurate analytics and unreliable decisions. Over the past few years, there has been a surge of interest from both industry and academia on data cleaning problems including new abstractions, interfaces, approaches for scalability, and statistical techniques. To better understand the new advances in the field, we will first present a taxonomy of the data cleaning literature in which we highlight the recent interest in techniques that use constraints, rules, or patterns to detect errors, which we call qualitative data cleaning. We will describe the state-of-the-art techniques and also highlight their limitations with a series of illustrative examples. While traditionally such approaches are distinct from quantitative approaches such as outlier detection, we also discuss recent work that casts such approaches into a statistical estimation framework including: using Machine Learning to improve the efficiency and accuracy of data cleaning and considering the effects of data cleaning on statistical analysis.

Farid, Mina, Roatis, Alexandra, Ilyas, Ihab F., Hoffmann, Hella-Franziska, Chu, Xu.  2016.  CLAMS: Bringing Quality to Data Lakes. Proceedings of the 2016 International Conference on Management of Data. :2089–2092.

With the increasing incentive of enterprises to ingest as much data as they can in what is commonly referred to as "data lakes", and with the recent development of multiple technologies to support this "load-first" paradigm, the new environment presents serious data management challenges. Among them, the assessment of data quality and cleaning large volumes of heterogeneous data sources become essential tasks in unveiling the value of big data. The coveted use of unstructured and semi-structured data in large volumes makes current data cleaning tools (primarily designed for relational data) not directly adoptable. We present CLAMS, a system to discover and enforce expressive integrity constraints from large amounts of lake data with very limited schema information (e.g., represented as RDF triples). This demonstration shows how CLAMS is able to discover the constraints and the schemas they are defined on simultaneously. CLAMS also introduces a scale-out solution to efficiently detect errors in the raw data. CLAMS interacts with human experts to both validate the discovered constraints and to suggest data repairs. CLAMS has been deployed in a real large-scale enterprise data lake and was experimented with a real data set of 1.2 billion triples. It has been able to spot multiple obscure data inconsistencies and errors early in the data processing stack, providing huge value to the enterprise.

Heindorf, Stefan, Potthast, Martin, Stein, Benno, Engels, Gregor.  2016.  Vandalism Detection in Wikidata. Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. :327–336.

Wikidata is the new, large-scale knowledge base of the Wikimedia Foundation. Its knowledge is increasingly used within Wikipedia itself and various other kinds of information systems, imposing high demands on its integrity. Wikidata can be edited by anyone and, unfortunately, it frequently gets vandalized, exposing all information systems using it to the risk of spreading vandalized and falsified information. In this paper, we present a new machine learning-based approach to detect vandalism in Wikidata. We propose a set of 47 features that exploit both content and context information, and we report on 4 classifiers of increasing effectiveness tailored to this learning task. Our approach is evaluated on the recently published Wikidata Vandalism Corpus WDVC-2015 and it achieves an area under curve value of the receiver operating characteristic, ROC-AUC, of 0.991. It significantly outperforms the state of the art represented by the rule-based Wikidata Abuse Filter (0.865 ROC-AUC) and a prototypical vandalism detector recently introduced by Wikimedia within the Objective Revision Evaluation Service (0.859 ROC-AUC).

Petrić, Jean, Bowes, David, Hall, Tracy, Christianson, Bruce, Baddoo, Nathan.  2016.  The Jinx on the NASA Software Defect Data Sets. Proceedings of the 20th International Conference on Evaluation and Assessment in Software Engineering. :13:1–13:5.

Background: The NASA datasets have previously been used extensively in studies of software defects. In 2013 Shepperd et al. presented an essential set of rules for removing erroneous data from the NASA datasets making this data more reliable to use. Objective: We have now found additional rules necessary for removing problematic data which were not identified by Shepperd et al. Results: In this paper, we demonstrate the level of erroneous data still present even after cleaning using Shepperd et al.'s rules and apply our new rules to remove this erroneous data. Conclusion: Even after systematic data cleaning of the NASA MDP datasets, we found new erroneous data. Data quality should always be explicitly considered by researchers before use.