Software defect prediction aims at identifying defect prone software modules in order to allocate optimal testing resources. The role of testing in software development life cycle is vital especially when software systems are becoming more and more complex, representing a suitable environment for defects. Several researchers have striven to develop models able to determine defective modules with the aim of reducing time and cost of software testing. Such models are typically trained on software measurements, known as software metrics, which describe the characteristics of a software project in terms of e.g., dimension and complexity. These metrics reduce the subjectivity of software quality assessment and can be relied on for decision making, e.g., to decide where to focus software tests.
The aim of our work is to employ both feature selection or construction techniques and machine learning techniques to build software defect prediction models on different kinds of software dataset metrics (derived from various software projects available in the NASA and Eclipse repositories), and assess their performances by considering accuracy, precision, recall and area under the curve. We have used non parametric tests to compute the statistical significance of the obtained results. The collected metrics belong to three main categories: dimension, complexity and object orientation. The involved datasets contain class labels, i.e., information on the defectiveness of the software modules.
To make our study available to research community, we have developed an open source and extensible R application that supports researchers to load the selected kinds of datasets, to filter them according to the their features and to apply machine learning techniques.