TY - JOUR
T1 - An empirical study of software change classification with imbalance data-handling methods
AU - Zhu, Xiaoyan
AU - Niu, Binbin
AU - Whitehead, E. James
AU - Sun, Zhongbin
N1 - Publisher Copyright:
© 2018 John Wiley & Sons, Ltd.
PY - 2018/11
Y1 - 2018/11
N2 - Bug prediction in software code changes can help developers to find out and fix bugs immediately when they are introduced, thus to improve the effectiveness and validity of bug fixing. In data mining, this problem can be regarded as a change classification task. However, one of its key characteristics, ie, class-imbalance, holds back the performance of standard classification methods. In this paper, we consider a quantity of imbalance data-handling methods and extract a more comprehensive groups of change features, aiming to achieve better change classification performance. Two different types of imbalance data-handling methods, namely, resampling and ensemble learning methods, are employed. Especially, we explore the performance of their combination. To compare the performance of different imbalance data-handling methods, an experiment with 10 open source projects is conducted. Four classification methods, including J48, Naïve Bayes, SMO, and Random Forest, are used as standard classifiers and as the base classifiers, respectively. Moreover, contribution of different groups of change features are evaluated. Experimental results show that imbalance data-handling methods can improve the performance of change classification and the combination methods, which take advantage of both ensemble learning and resampling, perform better than using ensemble learning methods or resampling methods individually. Of the studied imbalance data-handling methods, the combination of Bagging and random undersampling with J48 as the base classifier yields out better prediction results than those achieved by other methods. Additionally, of the collected change features, text vector features accounts for a larger proportion than others.
AB - Bug prediction in software code changes can help developers to find out and fix bugs immediately when they are introduced, thus to improve the effectiveness and validity of bug fixing. In data mining, this problem can be regarded as a change classification task. However, one of its key characteristics, ie, class-imbalance, holds back the performance of standard classification methods. In this paper, we consider a quantity of imbalance data-handling methods and extract a more comprehensive groups of change features, aiming to achieve better change classification performance. Two different types of imbalance data-handling methods, namely, resampling and ensemble learning methods, are employed. Especially, we explore the performance of their combination. To compare the performance of different imbalance data-handling methods, an experiment with 10 open source projects is conducted. Four classification methods, including J48, Naïve Bayes, SMO, and Random Forest, are used as standard classifiers and as the base classifiers, respectively. Moreover, contribution of different groups of change features are evaluated. Experimental results show that imbalance data-handling methods can improve the performance of change classification and the combination methods, which take advantage of both ensemble learning and resampling, perform better than using ensemble learning methods or resampling methods individually. Of the studied imbalance data-handling methods, the combination of Bagging and random undersampling with J48 as the base classifier yields out better prediction results than those achieved by other methods. Additionally, of the collected change features, text vector features accounts for a larger proportion than others.
KW - bug prediction
KW - change classification
KW - ensemble learning
KW - imbalance data
KW - resampling
UR - https://www.scopus.com/pages/publications/85054832061
U2 - 10.1002/spe.2606
DO - 10.1002/spe.2606
M3 - 文章
AN - SCOPUS:85054832061
SN - 0038-0644
VL - 48
SP - 1968
EP - 1999
JO - Software - Practice and Experience
JF - Software - Practice and Experience
IS - 11
ER -