An Empirical Map of Feature Selection Algorithms

Feature selection (or model selection in more general terms) is a critical — and perhaps one of the most opaque — components of the predictive workflow.1 Burnham and Anderson (2004) refer to the problem using the familiar language of a bias-variance trade-off: on the one hand, a more parsimonious model has fewer parameters and hence reduces the risk of overfitting, on the other hand, more features increase the amount of information incorporated into the fitting process. How to select the appropriate features remains a matter of some debate, with an almost unmanageable host of different algorithms to navigate.

In this analysis, I throw the proverbial kitchen sink at a macroeconomic feature selection problem. Correlation filtering all the way to Bayesian model averaging, lasso regression to random forest importance, genetic algorithms to Laplacian scores. The aim is to explore relationships and (dis)agreements among a multidisciplinary array of feature selection algorithms (23 in total) from several of what Molnar (2022) calls “modeling mindsets”, and to examine comparative robustness, breadth and out-of-sample relevance of the selected information.

As it turns out, there are a few things to learn — particularly in the way algorithms are naturally partitioned into 4 distinct clusters. In the next section, I outline key findings.

Inference in Neural Networks using an Explainable Parameter Encoder Network

A Parameter Encoder Neural Network (PENN) (Pfitzinger 2021) is an explainable machine learning technique that solves two problems associated with traditional XAI algorithms:

  1. It permits the calculation of local parameter distributions. Parameter distributions are often more interesting than feature contributions — particularly in economic and financial applications — since the parameters disentangle the effect from the observation (the contribution can roughly be defined as the demeaned product of effect and observation).
  2. It solves a problem of biased contributions that is inherent to many traditional XAI algorithms. Particularly in the setting where neural networks are powerful — in interactive, dependent processes — traditional XAI can be biased, by attributing effect to each feature independently.

At the end of the tutorial, I will have estimated the following highly nonlinear parameter functions for a simulated regression with three variables:

A Github version of the code can be found here.

tidyfit: Benchmarking regularized regression methods

This workflow demonstrates how tidyfit can be used to easily compare a large number of regularized regression methods in R. Using the Boston house prices data set, the analysis shows how Bayesian methods strongly outperform most alternatives:

Evolving Themescapes: Powerful Auto-ML for Thematic Investment with tidyfit

The recent years have been marked by an unusual amount of geopolitical upheaval and crisis. In this post, I explore the change in importance that this period has elicited in different investment themes. Which trends have grown in importance? What can be discovered about evolving market priorities and the brave new world ahead?

To explore these questions, I draw on a data set of MSCI Thematic and Sector index returns, and calculate the regression-based importance of each theme for each sector over time. The analytical workflow is typical to the quantitative finance setting, essentially requiring the estimation of a large number of linear regressions that provide orthogonal exposures to different investment themes. Here the R package tidyfit (available on CRAN) can be extremely helpful, since it automates much of the machine learning pipeline for regularized regressions (Pfitzinger 2022).

MSCI provides thematic equity indexes for 17 different themes that range from digital health and cybersecurity to millennials and future education. The following plot shows the average change in each theme’s importance — measured as the change in the absolute standardized beta — from before the COVID-19 pandemic to after the pandemic. The regression betas are estimated using an elastic net regression (discussed below). A positive value suggests that the theme has, on average, increased in recent years:

tidyfit: Extending the tidyverse with AutoML

tidyfit is an R-package that facilitates and automates linear regression and classification modeling in a tidy environment. The package includes several methods, such as Lasso, PLS and ElasticNet regressions, and can be augmented with custom methods. tidyfit builds on the tidymodels suite, but emphasizes automated modeling with a focus on the linear regression and classification coefficients, which are the primary output of tidyfit.