Deep Learning Autoencoders at Sawtooth Research Conference, New Orleans LA
May 7-9 2025
Be sure to join me at the Sawtooth Software for my presentation “Better Segmentation Results With Deep Learning”. A brief abstract is given below:
Feature engineering is widely recognized as a critical component of supervised learning and often considered more crucial than the choice of algorithm itself. Its significance in unsupervised learning however is less frequently acknowledged. This presentation explores the essential role of feature engineering in unsupervised learning and introduces advanced methods to address the challenges unique to this area.
Common tasks in feature engineering for unsupervised learning include such things as
- standardization of numeric features,
- ensuring features are approximately normally distributed,
- scaling numeric features to a common range,
- encoding categorical features (e.g., one-hot encoding, target encoding) for algorithms that require numeric input such as k-means, and
- handling missing values.
These tasks are essential to preparing data for effective unsupervised learning. A challenge unique to unsupervised learning is that of high data dimensionality, which can lead to poor quality partitions. In high-dimensional spaces, objects may appear equidistant, resulting in segments that differ little from one another which in turn lead to predictive models with unacceptably high error rates. High dimensionality often arises from redundant measures of underlying constructs, necessitating methods to detect and resolve these redundancies. Methods commonly used to detect redundant measures include:
- correlation matrix representation via heat maps
- variable clustering to group similar variables
- mutual information to measure shared information between variables
- Principal Component Analysis (PCA) to identify constructs contributing minimally to variance, indicating potential redundancy and
- utilizing domain knowledge for identifying redundant measures
Once identified, redundant measures can be resolved by selecting a subset of the most informative variables or by creating new variables that capture the underlying constructs with fewer features. While PCA is a common method for dimensionality reduction, it has limitations, including an inability to capture non-linear relationships and potential loss of information when selecting a subset of principal components.
An alternative approach involves the use of neural network-based feature autoencoding. Unlike PCA, neural network-based autoencoding offers several advantages:
- It can reproduce the original data and capture non-linear relationships.
- It facilitates anomaly detection and can be applied to a variety of data types.
- It retains more information than PCA.
- Some autoencoding methods are less computationally expensive than PCA which involves computing covariance matrices and eigenvalues and are therefore more scalable.
The presentation will demonstrate time-saving tools within the R tidyverse ecosystem that facilitate the identification and resolution of redundant constructs as well as useful feature engineering techniques for cluster analysis data. Specifically, this presentation will discuss a variety of new R packages designed for Exploratory Data Analysis (EDA), such as corrPlot, janitor, GWalkR, and dataxray, which provide excellent insights into data relationships. It will also illustrate feature engineering techniques implemented using R tidyverse workflows, allowing for simple, effective, and reproducible coding. Additionally, it will examine how deep learning autoencoding can be leveraged for dimensionality reduction and anomaly detection, contributing to higher quality partitions in cluster analysis datasets. Findings:
- Reproducible feature engineering and dimensionality reduction is straightforward using R tidyverse workflows.
- Deep learning autoencoding effectively addresses the challenge of high dimensionality in unsupervised learning.
- Autoencoding offers greater flexibility, captures non-linear relationships, and better preserves local structures compared to PCA.
Key Takeaways:
- This presentation highlights the importance of advanced feature engineering techniques and the use of deep learning autoencoding in enhancing partition quality and predictive accuracy of unsupervised learning scoring models.
- It will illustrate feature engineering with R tidyverse workflows for pre-cluster analysis and data cleaning.
- The advantages of deep learning autoencoding for dimensionality reduction and anomaly detection will be presented.
- The role of deep learning autoencoding in improving the quality of partitions in unsupervised learning will be examined.