Machine Learning Techniques for Software Analysis of Unlabelled Program Modules

Ronchieri, Elisabetta; Canaparo, Marco; Salomoni, Davide

doi:10.22323/1.351.0018

Abstract

Machine Learning (ML) has proven to be of great value in a variety of Software Engineering (SE) tasks to conduct, e.g., software defect prediction and estimation and test code generation. To accomplish these tasks, software datasets (i.e. collections of the various modules, such as files and classes, with features, such as software metrics and defective data) have to be gathered and properly preprocessed before the application of ML techniques.

In SE practice, software datasets may lack some features' classification data, e.g. defective data are not included being difficult to collect in new projects or in projects with partial historical data. These datasets are called unlabelled datasets and are the vast majority of software datasets. The extraction of the complete set of features (defectiveness included) and the classification of the various instances imply effort and time.

In literature, there exist various approaches to build a prediction model on unlabelled datasets that entail a high number of time-consuming permutations. Cloud computing infrastructure, GPU-equipped resources and adequate ML framework can give the chance to overcome this problem.

In this study, we are going to present the usefulness of the Clustering, LAbeling, Metric selection, Instance selection approach in high energy physics by applying them to a Geant4 software unlabelled dataset as a case study; by implementing models in different available frameworks, such as TensorFlow and Keras; and by running them in Java, Python and R. We intend to reduce the distance between theory and practice by providing strengths and limitations of the considered frameworks to enable users to assess suitability for their requirements.