Comparison of variable selection methods in predictive models applied to near-infrared and genomic data
Many research areas have datasets that face the challenges of high dimensionality and multilinearity. Although existing methods are efficient for constructing a complete model, it is often necessary to select the most important explanatory variables to obtain more parsimonious models. We evaluated and built models using three methods of selection of variables applied to data of single nucleotide polymorphism (SNP) markers and near-infrared spectroscopy (NIR), in addition to assessing the improvement in prediction quality when compared to the use of complete data. These included ordered predictors selection associated with partial least squares regression (PLS-OPS), sparse partial least squares regression (SPLS), and Supervised BLasso, the latter being an adaptation of the Bayesian Lasso (BLasso) method for variables selection. We used simulated data sets evaluated in two scenarios, and three real data sets, composed of one set of SNPs and two sets of NIR data. The predictive quality of each model was evaluated based on the mean correlation coefficient between predicted and actual values, and the square root mean squared error. In the set of simulated data evaluated in the first scenario, in terms of predictive capacity, the models after variables selection were similar when compared to the use of the complete data model, whereas in the second scenario, on average, the models performed better after the selection of variables, with SPLS being superior to the other methods. In the real SNPs dataset, the PLS-OPS had a good performance, attesting the usefulness of this method for this kind of data. In the NIR datasets, the predictive quality of models after variable selection were close to those obtained with the complete data. In general, when using the selection methods, the models maintained a good predictive capacity and became simpler due to the considerable reduction in the number of variables.