Biblio
Code-graph based software defect prediction methods have become a research focus in SDP field. Among them, Code Property Graph is used as a form of data representation for code defects due to its ability to characterize the structural features and dependencies of defect codes. However, since the coarse granularity of Code Property Graph, redundant information which is not related to defects often attached to the characterization of software defects. Thus, it is a problem to be solved in how to locate software defects at a finer granularity in Code Property Graph. Static analysis is a technique for identifying software defects using set defect rules, and there are many proven static analysis tools in the industry. In this paper, we propose a method for locating specific types of defects in the Code Property Graph based on the result of static analysis tool. Experiments show that the location method based on static analysis results can effectively predict the location of specific defect types in real software program.
Defect prediction is an active topic in software quality assurance, which can help developers find potential bugs and make better use of resources. To improve prediction performance, this paper introduces cross-entropy, one common measure for natural language, as a new code metric into defect prediction tasks and proposes a framework called DefectLearner for this process. We first build a recurrent neural network language model to learn regularities in source code from software repository. Based on the trained model, the cross-entropy of each component can be calculated. To evaluate the discrimination for defect-proneness, cross-entropy is compared with 20 widely used metrics on 12 open-source projects. The experimental results show that cross-entropy metric is more discriminative than 50% of the traditional metrics. Besides, we combine cross-entropy with traditional metric suites together for accurate defect prediction. With cross-entropy added, the performance of prediction models is improved by an average of 2.8% in F1-score.
Code churn has been successfully used to identify defect inducing changes in software development. Our recent analysis of the cross-release code churn showed that several design metrics exhibit moderate correlation with the number of defects in complex systems. The goal of this paper is to explore whether cross-release code churn can be used to identify critical design change and contribute to prediction of defects for software in evolution. In our case study, we used two types of data from consecutive releases of open-source projects, with and without cross-release code churn, to build standard prediction models. The prediction models were trained on earlier releases and tested on the following ones, evaluating the performance in terms of AUC, GM and effort aware measure Pop. The comparison of their performance was used to answer our research question. The obtained results showed that the prediction model performs better when cross-release code churn is included. Practical implication of this research is to use cross-release code churn to aid in safe planning of next release in software development.
In software quality estimation research, software defect prediction is a key topic. A defect prediction model is generally constructed using a variety of software attributes and each attribute may have positive, negative or neutral effect on a specific model. Selection of an optimal set of attributes for model development remains a vital yet unexplored issue. In this paper, we have introduced a new feature space transformation process with a normalization technique to improve the defect prediction accuracy. We proposed a feature space transformation technique and classify the instances using Support Vector Machine (SVM) with its histogram intersection kernel. The proposed method is evaluated using the data sets from NASA metric data repository and its application demonstrates acceptable accuracy.
Software defects will lead to software running error and system crashes. In order to detect software defect as early as possible at early stage of software development, a series of machine learning approaches have been studied and applied to predict defects in software modules. Unfortunately, the imbalanceof software defect datasets brings great challenge to software defect prediction model training. In this paper, a new manifold learning based subspace learning algorithm, Discriminative Locality Alignment(DLA), is introduced into software defects prediction. Experimental results demonstrate that DLA is consistently superior to LDA (Linear Discriminant Analysis) and PCA (Principal Component Analysis) in terms of discriminate information extraction and prediction performance. In addition, DLA reveals some attractive intrinsic properties for numeric calculation, e.g. it can overcome the matrix singular problem and small sample size problem in software defect prediction.
Background: The NASA datasets have previously been used extensively in studies of software defects. In 2013 Shepperd et al. presented an essential set of rules for removing erroneous data from the NASA datasets making this data more reliable to use. Objective: We have now found additional rules necessary for removing problematic data which were not identified by Shepperd et al. Results: In this paper, we demonstrate the level of erroneous data still present even after cleaning using Shepperd et al.'s rules and apply our new rules to remove this erroneous data. Conclusion: Even after systematic data cleaning of the NASA MDP datasets, we found new erroneous data. Data quality should always be explicitly considered by researchers before use.