Randomizing outputs to increase variable selection accuracy

Research output: Contribution to journalArticlepeer-review

6 Scopus citations

Abstract

Variable selection plays a key role in explanatory modeling and its aim is to identify the variables that are truly important to the outcome. Recently, ensemble learning techniques have manifested great potential in improving the performance of some traditional methods such as lasso, genetic algorithm, stepwise search. Following the main principle to build a variable selection ensemble, we propose in this paper a novel approach by randomizing outputs (i.e., adding some random noise to the response) to maximize variable selection accuracy. In order to generate multiple but slightly different importance measures for each variable, some Gaussian noise is artificially added to the response. The new training set (i.e, the original design matrix together with the new response vector) is then fed into genetic algorithm to perform variable selection. By repeating this process a number of trials and fusing the results by simple averaging, a more reliable importance measure is obtained for each candidate variable. The variables are then ranked and further determined to be important or not by a thresholding rule. The performance of the proposed method is studied with some simulated and real-world data in the framework of linear and logistic regression models. The results demonstrate that it compares favorably with several other existing methods.

Original languageEnglish
Pages (from-to)91-102
Number of pages12
JournalNeurocomputing
Volume218
DOIs
StatePublished - 19 Dec 2016

Keywords

  • Ensemble learning
  • Genetic algorithm
  • Output smearing
  • Selection accuracy
  • Variable selection
  • Variable selection ensemble

Fingerprint

Dive into the research topics of 'Randomizing outputs to increase variable selection accuracy'. Together they form a unique fingerprint.

Cite this