Visible to the public The Jinx on the NASA Software Defect Data Sets

TitleThe Jinx on the NASA Software Defect Data Sets
Publication TypeConference Paper
Year of Publication2016
AuthorsPetrić, Jean, Bowes, David, Hall, Tracy, Christianson, Bruce, Baddoo, Nathan
Conference NameProceedings of the 20th International Conference on Evaluation and Assessment in Software Engineering
PublisherACM
Conference LocationNew York, NY, USA
ISBN Number978-1-4503-3691-8
Keywordsdata quality, machine learning, pubcrawl170201, software defect prediction
Abstract

Background: The NASA datasets have previously been used extensively in studies of software defects. In 2013 Shepperd et al. presented an essential set of rules for removing erroneous data from the NASA datasets making this data more reliable to use. Objective: We have now found additional rules necessary for removing problematic data which were not identified by Shepperd et al. Results: In this paper, we demonstrate the level of erroneous data still present even after cleaning using Shepperd et al.'s rules and apply our new rules to remove this erroneous data. Conclusion: Even after systematic data cleaning of the NASA MDP datasets, we found new erroneous data. Data quality should always be explicitly considered by researchers before use.

URLhttp://doi.acm.org/10.1145/2915970.2916007
DOI10.1145/2915970.2916007
Citation Keypetric_jinx_2016