An Empirical Map of Feature Selection Algorithms

Feature selection (or model selection in more general terms) is a critical — and perhaps one of the most opaque — components of the predictive workflow.1 Burnham and Anderson (2004) refer to the problem using the familiar language of a bias-variance trade-off: on the one hand, a more parsimonious model has fewer parameters and hence reduces the risk of overfitting, on the other hand, more features increase the amount of information incorporated into the fitting process. How to select the appropriate features remains a matter of some debate, with an almost unmanageable host of different algorithms to navigate.

In this analysis, I throw the proverbial kitchen sink at a macroeconomic feature selection problem. Correlation filtering all the way to Bayesian model averaging, lasso regression to random forest importance, genetic algorithms to Laplacian scores. The aim is to explore relationships and (dis)agreements among a multidisciplinary array of feature selection algorithms (23 in total) from several of what Molnar (2022) calls “modeling mindsets”, and to examine comparative robustness, breadth and out-of-sample relevance of the selected information.

As it turns out, there are a few things to learn — particularly in the way algorithms are naturally partitioned into 4 distinct clusters. In the next section, I outline key findings.

Executive summary

Feature selection is not flashy and tends to attract less attention than the sophisticated estimation techniques that it is often no more than a prelude to. Nonetheless, the problem of selecting the best subset of features remains hugely complex with an exponential search space and a vast number of potential solution algorithms to choose from. The multitude of feature selection algorithms that have been proposed in various statistical sub-domains is reflective of a variety of philosophical approaches that can be taken when solving the problem. Some approaches emphasize speed, some aim for asymptotic consistency, some emphasize tests of statistical significance, others predictive performance. I’ve taken a set of 23 algorithms from different domains and representing distinct approaches to feature selection and apply them to an empirical problem. An overview of the methods is provided here, while Fig. 1 visualizes an approximate relative popularity score for (a subset of) methods across different statistical domains:

 Popularity of feature selection algorithms

Figure 1: Popularity of feature selection algorithms

Fig. 1 is based on the number of mentions in document titles on Google Scholar. The most popular feature selection algorithms are genetic algorithms (GA), lasso, correlation, random forest importance and information gain. However, the result varies across fields. While the lasso is hugely popular in econometrics, for instance, the field also favors some less well-known methods such as Bayesian model averaging (BMA) or general-to-specific (GETS). Partial least squares (PLSR) enjoys significant popularity in chemometrics, while social sciences in general tend to make use of a broader set of algorithms than core statistical disciplines.

In this analysis, each of the 23 algorithms is used to select features from 100 bootstrap samples of the publicly available FRED-MD US macroeconomic data set , with the objective of identifying leading indicators for 4 different target variables from a set of 134 candidate features. Based on the resulting models, I measure the proximity in the selection behavior of the algorithms using a metric that can account for the correlated nature of the underlying data (see below). Then I project the result onto a two-dimensional graph (upper left panel Fig. 2), with each point (node) in the graph representing one bootstrap realization of the respective algorithm, and with adjacent nodes selecting similar information.

 **Feature Selection Map** with target variable: US CPI inflation  (CPIAUCSL), and $k\leq 30$

Figure 2: Feature Selection Map with target variable: US CPI inflation (CPIAUCSL), and \(k\leq 30\)

The selection map broadly comprises 4 clusters of algorithms that can be observed again and again across different target variables and model sizes (\(k\)):

  1. Penalized regressions, including (adaptive) lasso, elastic net and gradient boosting regressions (correlation, MRMR and random forest importance (RF) sometimes hover in the vicinity);
  2. Model sampling methods, including the three Bayesian techniques, forward selection and the genetic algorithm (GA);
  3. Backward selection, including recursive feature elimination (RFE), GETS, and — somewhat less intuitively — PLSR;
  4. Peripheral methods that produce very different models including, for instance, RReliefF and the unsupervised (Laplacian, FOS-MOD) or semi-supervised (PCR) algorithms.

Grouping along these 4 clusters appears to be highly robust and suggests a type of fundamental affinity between these methods that is not always obvious.

Fig. 2 displays two additional metrics: the out-of-bootstrap performance rank and the diversity of selected information. The latter measures the effective percentage of principal components contained in the selected model (i.e. how many orthogonal information sources the model taps into). For both measures, a higher value is favorable. Two important observations are (i) that the penalized regression methods perform consistently well on both metrics, and (ii) that there is a distinct positive correlation between the measures. If a feature selection algorithm selects features that represent a wide variety of distinct information sources, it is more likely to rank higher on out-of-sample performance.

Reproducing the above plots for different target variables and different model sizes (\(k\)) reveals remarkable robustness of the results, as shown in the results section. For the economically interested reader, that section also discusses which leading indicators are ultimately identified as the most predictive features for each target. Before examining these results, however, the following sections provide an overview of the algorithms, methodology and data used in this analysis.

Feature selection algorithms

Feature selection is a much discussed topic in several statistical domains and the landscape of available methods is accordingly complex and multifaceted (see Li et al. (2018) for a decent overview). Generally speaking, feature selection aims to strike a balance between the desire to build a parsimonious model for prediction or inference and the desire to leverage as much of the available information as possible. The data set used in this analysis, for instance contains 134 monthly time series of which we would like to select the \(k \leq 30\) features with the highest predictive power. This selection problem can be solved using a wide variety of algorithms leading to a variety of competing candidate models.

In order to develop a sense of the relative popularity of different feature selection algorithms, I decided to count the number of publications available on Google Scholar based on queries that use keywords relating to different algorithms in the context of feature selection as well as the statistical domain.2 Fig. 3 plots the result in the form of an approximate popularity score:

 Popularity of feature selection algorithms

Figure 3: Popularity of feature selection algorithms

The plot reveals some similarities and some disparities. The most popular methods (genetic algorithms, correlation, information gain, lasso, random forest) are consistently applied across many domains. Conversely, some fields, such as econometrics or chemometrics chart a slightly different course. Of the categories of feature selection algorithms discussed below, none seems to dominate outright.

Apart from research domains, algorithms can be grouped along several additional axes, as summarized in the tables below. One of the most generally accepted classifications distinguishes between filter, wrapper and embedded methods (Kumar 2014; Galli 2022). However, as with most statistical learning techniques, there are distinctions between supervised and unsupervised methods, between different types of objective functions, asymptotic properties, complexity and many other attributes.

Algorithms included in the analysis

In this article, I will not discuss the individual methods in great detail — a lot has been written on each (see list of references, or descriptions in the tables below). Instead, the aim is to evaluate a multidisciplinary subset of methods and explore possible similarities. Note that — as mentioned — the methods analyzed here represent a mere subset. Unlike many comparable studies (e.g. Porwal and Raftery (2022)), I have emphasized diversity with an attempt to cover as many of the broader classifications as possible, without being bogged down by the proliferation of variants associated with each of the individual methods. Furthermore, the scope of the analysis is limited to the regression case, and algorithms used exclusively in the context of discrete data analysis are omitted.

Filter methods

Filter methods are model-agnostic and perform univariate comparisons between each feature and the target. They encompass the simplest (and typically fastest) group of algorithms. Each feature is individually compared to the target and the degree of correspondence is scored. The \(k\) features with the highest scores are chosen.

Table 1: Overview of feature selection algorithms: Filter methods
 MethodObjectiveModel SizeR-Package
1 Pearson Correlation Coefficient
Linear Gaussian comparison of the target and each feature. The most correlated features are selected.
best fit exogenous stats
2 Information Gain (Mutual Information)
Nonlinear nonparametric comparison of the target and each feature based on information theory.
best (nonlinear) fit exogenous CORElearn
3 RReliefF
Selects features that best separate the target space using a nearest neighbors approach. The method accounts for feature interactions. See Robnik-Šikonja and Kononenko (2003).
best (nonlinear) fit exogenous CORElearn
4 Laplacian Score
Unsupervised method that uses a nearest neighbors approach to select features based on a distance graph. See He et al. (2005).
unsupervised exogenous Rdimtools
Wrapper methods

Wrapper methods fit a model in a sequential manner, eliminating or adding features based on some criterion of model fit or predictive accuracy. The algorithms may begin with a single feature and add features to the model, they may begin with all features and reduce the model by removing the feature with the lowest explanatory content, or some mixture thereof.

Table 2: Overview of feature selection algorithms: Wrapper methods
 MethodObjectiveModel SizeR-Packages
1 Recursive Feature Elimination (RFE) using AIC
Begins with a large model and sequentially removes the least important feature until the Akaike information criterion (AIC) is no longer improved.
best fit endogenous leaps
2 Recursive Feature Elimination (RFE) using LOO-CV
Begins with a large model and sequentially removes the least important feature until the leave-one-out validation error is no longer improved.
best prediction endogenous leaps
3 General-to-specific (GETS)
A recusive feature elimination algorithm widely used in the econometric setting that removes features based on multiple hypothesis tests (Pretis et al. 2018).
best fit endogenous gets
4 Forward Selection (FS) using AIC
Begins with a single feature model and sequentially adds the next best feature until the Akaike information criterion (AIC) is no longer improved.
best fit endogenous leaps
5 Forward Selection (FS) using LOO-CV
Begins with a single feature model and sequentially adds the next best feature until the leave-one-out validation error is no longer improved.
best prediction endogenous leaps
6 Sequential Replacement
Attempts to replace each feature in a model with another to improve fit. This proceeds sequentially until the fit can no longer be improved (Miller 2002).
best fit endogenous leaps
7 Genetic Algorithm (GA)
Evolutionary algorithm that begins with a ‘population’ of randomly sampled models, evaluates their ‘fitness’ using maximum likelihood and iteratively spawns new ‘generations’ of models by mixing the fittest parent models.
best fit endogenous gaselect
8 Minimum Redundancy, Maximum Relevance (MRMR)
Selects features using dual criteria of maximizing relevance with respect to the target while minimizing redundancy with respect to the model (Ding & Peng 2005).
best fit exogenous mRMRe
9 Forward Orthogonal Search (FOS-MOD)
An unsupervised algorithm that selects a desired number of features in a forward manner by ranking the features using the squared correlation coefficient and sequential orthogonalization. See Wei and Billings (2007).
unsupervised exogenous leaps
Embedded methods

Embedded methods combine model selection and estimation into a single step, for instance, by forcing a subset of the parameter weights to be exactly zero. The approach has the advantage that unlike wrapper methods, it is not subject to path-dependence. While the results are often superior, estimation is typically more computationally costly than filter or wrapper methods.

Table 3: Overview of feature selection algorithms: Embedded methods
 MethodObjectiveModel SizeR-Packages
1 Random Forest (RF) Importance
Flexible nonlinear estimation based on decision trees and feature selection using the highest overall feature importance (based on MSE reduction).
combined* exogenous randomForest
2 L1-Penalty (Lasso)
Lasso regression of Tibshirani (1996). Several variants exist, such as grouped or exclusive lasso, but have not been included here. Hyperparameters are determined using time series block-CV.
combined* endogenous glmnet
3 Adaptive Lasso
Weighted version of the lasso that ensures ‘oracle’ properties (Zou, 2006). Hyperparameters are determined using time series block-CV.
combined* endogenous glmnet
4 L1-L2 Penalty (ElasticNet)
Combined L1-L2 penalty to induce sparsity and feature grouping. See Zou and Hastie (2005). Hyperparameters (including the L1-L2 mixing parameter) are determined using time series block-CV.
combined* endogenous glmnet
5 Bayesian Lasso
Bayesian Lasso regression of Park and Casella (2008).
best fit endogenous monomvn
6 Bayesian Model Averaging (BMA)
Bayesian MCMC sampling of regression models to calculate a posterior inclusion probability for each feature. See Sala-I-Martin et al. (2004).
best fit endogenous BMS
7 Bayesian Spike & Slab Regression
Bayesian MCMC algorithm for linear regression models with a ‘spike-and-slab’ prior that places some amount of posterior probability at zero for a subset of the regression coefficients. See George and McCulloch (1997).
best fit endogenous BoomSpikeSlab
8 Gradient Boosting Regression
Machine learning method that iteratively fits a linear regression to reduce the error in each step.
combined* endogenous mboost
9 Partial Least Squares Regression (PLSR)
Latent factor method described in Martens (2001). Feature selection is performed using jackknife t-statistics. See Mehmood et al. (2012) for a review of feature selection methods in factor models. Optimal number of components determined using time series block-CV.
combined* exogenous pls
10 Principal Components Regression (PCR)
Two-stage semi-supervised approach that calculates principal components in the first stage, and regresses the target on the components in the second stage. Feature selection is performed using jackknife t-statistics. See Mehmood et al. (2012) for a review of feature selection methods in factor models. Optimal number of components determined using time series block-CV.
combined* exogenous pls
Notes:
* Combines best fit for model parameters and best prediction for hyperparemters.
The PLSR and PCR are not strictly feature selection techniques, but rather latent factor methods. They are included here for two reasons: firstly, they can be used for feature selection, for instance, by filtering on jackknife t-statistics; secondly, the methods are commonly used for feature selection in some domains.

 

Hyperparameter tuning and automated workflows

Several of the methods listed in the above tables have hyperparameters (e.g. lasso, elastic net, gradient boosting regression, PLSR). The optimal hyperparameter settings are determined using a time series block-CV procedure, as described in Racine (2000). The data set is divided into 10 validation blocks that account for the temporal ordering of the data and hyperparameters are selected by minimizing the validation mean-squared error using a standard grid search. Note that the blocks are constructed in such a way as to ensure no data leakage (overlap) occurs on account of bootstrap re-sampling.

The workflow developed in this article makes extensive use of tidyfit (see here) — an R package that streamlines regression and classification workflows with tidy data. The package provides efficient modelling wrappers for all of the supervised feature selection algorithms explored in this article, handles hyperparameter optimization, and makes working with multiple target variables almost trivially simple.

install.packages(tidyfit)
library(tidyfit)

# This is useful to show progress bars
# progressr::handlers(global = TRUE)

# Usage example for tidyfit::regress
data %>% 
  group_by(target_variable_name)
  regress(target_variable_value ~ .,   # Model specification
          m("lasso"),                  # Lasso regression (default hyperparameter grid)
          m("cor"),                    # Correlation
          m("relief"),                 # RReliefF
          .cv = "vfold_cv")

Code for this analysis is made available on Github.

Model comparison with correlated features

Up to this point, I have discussed how models are selected from a large number of potential feature candidates using a variety of different algorithms. Statements about the relative behavior of models, however, are only possible if a systematic framework of model comparison is available. Model comparisons could be done indirectly (e.g. examining the out-of-sample performance of the models) or directly (e.g. counting the number of overlapping features selected). Direct comparisons are arguably more informative (two models could perform equally poorly for vastly different reasons), but are complicated by correlation in the underlying data set.

To understand this, let’s begin by formalizing a notation for direct comparisons of candidate models. Suppose feature selection algorithm \(a \in \mathcal{A}\) generates a selection vector \(\mathbf{m}_a = \{m^{i}_a\}_{i\in 1,...,p}\), where

\[ m^i_a = \begin{cases} 1 \;\; \text{when feature $i$ is selected}\\ 0 \;\; \text{otherwise} \end{cases}, \] and \(p\) is the total number of available features.

We’ll measure the correlation of two selection vectors \(\mathbf{m}_a\) and \(\mathbf{m}_b\), with \(a,b\in\mathcal{A}\), using the percentage overlap \(\mathcal{J}(a,b)\) (Jaccard index) which ranges between 0 and 1 and can be defined as

\[ \mathcal{J}(a, b) = \frac{\sum \mathbf{m}_{a} \mathbf{m}_{b}}{\sqrt{\sum \mathbf{m}_{a} \sum \mathbf{m}_{b}}}. \]

When features are correlated, however, this measure begins to deteriorate, since different algorithms deal differently with multicollinearity.

Suppose the case of two highly correlated features \(x_1\) and \(x_2\). A lasso regression may (somewhat arbitrarily) select \(x_1\), while a correlation filtering approach may select both \(x_1\) and \(x_2\). Both algorithms have selected virtually the same information, but have an unjustifiably low value for \(\mathcal{J}(\cdot)\). To address this issue, I cluster the features and transform \(\mathbf{m}_a\) so that

\[ m^i_a = \begin{cases} 1 \;\; \text{when any feature from the same cluster as $i$ is selected}\\ 0 \;\; \text{otherwise} \end{cases}. \]

Calculating \(\mathcal{J}_g(a,b)\) using the revised definition of \(\mathbf{m}_a\) and \(\mathbf{m}_b\) leads to a similarity curve over the number of clusters (\(g\)). I use complete-linkage hierarchical clustering of the correlation matrix to generate \(p-1\) potential cluster vectors, and then calculate \(\mathcal{J}_g(a,b)\) for all \(g\in 1,...,p-1, p\) (here \(g=p\) is simply the unclustered case).

Fig. 4 illustrates the concept by plotting \(\mathcal{J}_g(\text{Lasso}, \text{Grad. Boost})\) (left) and \(\mathcal{J}_g(\text{Lasso}, \text{RReliefF})\) (right) over the number of clusters. Lasso and gradient boosting select similar models leading to high scores even when the number of clusters is high. The RReliefF algorithm, conversely, selects a highly distinct model. However as the granularity of the clustering is decreased, the selection profile of the algorithms converges.

Similarity between Lasso regression and alternative feature selection algorithms

Figure 4: Similarity between Lasso regression and alternative feature selection algorithms

The total similarity between two methods is now defined as the area under the above Jaccard index curve, \(\sum_g\mathcal{J}_g(a,b)\). This ensures that while RReliefF has a comparatively low similarity, the overall score is increased by the informational overlap revealed when examining clusters of correlated features in a joint manner.

Macroeconomic data set

The empirical problem addressed in this analysis is the selection of leading indicators from a large macroeconomic data set (FRED-MD). The FRED-MD data set contains 134 variables characterizing the US macro-economy with a monthly frequency and values (across all features) since the early 1990s. The data set is often used in academic research — primarily for the development of high-dimensional forecasting and nowcasting techniques. All variables have conveniently been transformed to ensure stationarity. A description of the data as well as transformations can be found in McCracken and Ng (2016).

The aim of the analysis is to select \(k \leq 10\), \(k \leq 20\) and \(k \leq 30\) features that can be used to forecast 4 different economic indicators one month ahead. The model size is chosen ex ante because not all feature selection algorithms can estimate \(k\) endogenously. For those few algorithms that do not permit an exact ex ante determination of model size, appropriate measures are used to select the most important features whenever the selected features exceed \(k\).3

The economic indicators that serve as targets are displayed in Table 4. I chose these specific indicators to cover three scenarios: (i) the broad case, with industrial production driven by many different economic factors, (ii) the narrow case, with unemployment and inflation depending to a large degree on labor market and monetary factors, respectively, and (iii) the noisy case, with the stock market not yielding well to any form of forecasting.

Table 4: Target variables used in feature selection analysis
TargetDescription
INDPRO IP Index
CPIAUCSL CPI : All Items
UNRATE Civilian Unemployment Rate
S&P 500 S&P s Common Stock Price Index: Composite

Since the aim is to forecast the target variables, they are shifted forward by one month. The target variables are also standardized.4 It is important to be aware that the context of the data is very likely to matter. The data set comprises monthly macroeconomic time series, and the core findings discussed in the following sub-sections may well fail to apply in other domains where the structure and statistical properties of the data are different.

Selection is performed on \(100\) bootstrap samples of the original data set, in order to explore not only algorithm but also sample variation in the selected models. Letting \(\mathbf{x}\) be the data set, \(\mathbf{x}_{b\in\mathcal{B}}\) represents one bootstrap sample with \(\mathcal{B} = \{1,...,100\}\).

Models are selected using each of the four respective targets from the entire data set of 134 economic indicators and using each of the 23 feature selection algorithms, with \(\mathcal{A}\) denoting the algorithm space. Thus, ultimately I generate \(|\mathcal{A}|\cdot |\mathcal{B}| = 2300\) selection vectors for each target.

Results

Perhaps unsurprisingly, feature selection is very much an art as well as a science, with the 23 algorithms producing as many distinct (often substantially so) models. Nonetheless, I attempt to distill some robust facts from the feature selection maps produced for all four target variables and three different settings of \(k\), in this section. I look at a direct comparison of the algorithms using feature selection maps, indirect comparisons of the performance and informational diversity, and finally at the actual leading indicators selected by the models.

Feature selection maps

The feature selection map displayed below is a graphical representation of the similarity matrix generated by applying \(\mathcal{J}_g(\cdot)\) to all pairwise combinations of selection vectors. Each point (node) in the graph is one of 2300 selection vectors, while distances between any two points are approximately equal to \(\mathcal{J}_g(\cdot)\) (approximate because the graph is projected to the two-dimensional space). Within-method variation reflects the bootstrap samples, while between-method variation reflects differences in the selection algorithms. In order to improve two-dimensional visualization, between-method edges are filtered using a 75% quantile threshold, removing low similarity values.

The selection map uncovers 4 broad clusters of algorithms that can be observed again and again across different target variables and model sizes (\(k\)):

  1. Penalized regressions, including (adaptive) lasso, elastic net and gradient boosting regressions (correlation, MRMR and random forest importance (RF) sometimes hover in the vicinity);
  2. Model sampling methods, including the three Bayesian techniques, forward selection and the genetic algorithm (GA);
  3. Backward selection, including recursive feature elimination (RFE), GETS, and — somewhat less intuitively — PLSR;
  4. Peripheral methods that produce very different models including, for instance, RReliefF and the unsupervised (Laplacian, FOS-MOD) or semi-supervised (PCR) algorithms.

Grouping along these 4 clusters appears to be highly robust and suggests a type of fundamental affinity between these methods that is not obvious in all cases. The plots below show all feature selection maps by target variable and model size. While there is some variation, the above for clusters appear to emerge more often than not:

 

Some interesting additional observations include the proximity (often, not always) of MRMR and RF, or of the Bayesian methods, which have less in common than one might initially assume. Furthermore, correlation filtering is (often, not always) located close to the methodologically quite distinct cluster of penalized regressions, while the latent factor methods PLSR and PCR tend to produce very different results. The algorithms that produce the most idiosyncratic models are the unsupervised algorithms (FOS-MOD, Laplacian), as well as RReliefF and information gain (both of which are more commonly applied in the context of classification problems).

Performance and diversity

The plots above include two additional metrics of indirect model comparison. The first is the out-of-bootstrap performance rank, which is the inverse rank of the mean squared error (MSE) over the observations not included in each bootstrap sample. The second is the diversity of selected information which is the effective percentage of principal components contained in the selected model (i.e. how many orthogonal information sources the model taps into). The metric is calculated by multiplying the selection vector with the PCA rotation matrix and calculating an entropy score of the resulting rotated vector, which is equal to zero when only one principal component is loaded and equal to one when all principal components have the same absolute loading.

Fig. 5 summarizes the results for these two indirect metrics across all targets and model sizes. The result is a striking positive relationship: a higher degree of information diversity is associated with better out-of-sample performance. Furthermore, algorithms generally performing best on both metrics are those contained in the cluster of penalized regressions, while those contained in the backward selection cluster perform worst.