Materials Genomics
Unit 9: Representation Learning and Feature Discovery

Prof. Dr. Philipp Pelz

FAU Erlangen-Nürnberg

Workflow Role: Replaces manual descriptor engineering with automated, data-driven feature extraction. {.fragment}
“Along the lips of the mystic portal he discovered writings which after a little study he was able to decipher.” — Nathanael West (McClarren 2021) {.fragment}
This unit explores how models “decipher” the language of crystal structures. {.fragment}

02. Learning outcomes for Unit 9

By the end of this unit, students can: - explain the bottleneck principle and the role of the latent space in autoencoders, {.fragment} - distinguish between linear (PCA) and nonlinear (Autoencoder) dimensionality reduction, {.fragment} - evaluate embedding quality using separability, transferability, and probe tests, {.fragment} - identify failure modes such as shortcut learning and over-compression in materials tasks, {.fragment} - implement a representation-learning pipeline for spectral or structural data. {.fragment}

03. Recap: Where we are in the curriculum

Unit 7: Regression on fixed features (descriptors). {.fragment}
Unit 8: Neural surrogates (MLPs) on fixed features. {.fragment}
Unit 9 (Today): The representation itself is now learned from the data. {.fragment}
Dependency: Builds on neural networks (MFML) and crystal structure fundamentals (MG Unit 2). {.fragment}

04. The bottleneck of hand-crafted descriptors

Many structure-property relations are too complex for fixed fingerprints (e.g., Magpie, SOAP). {.fragment}
Engineered features often saturate in performance or miss subtle structural interactions. {.fragment}
Feature Discovery: Instead of telling the model what to look for, we let the model find the most informative features (Neuer et al. 2024; Sandfeld et al. 2024). {.fragment}

05. Representation learning as an unsupervised task

Most materials data is unlabeled (structure exists, but property \(y\) is unknown). {.fragment}
Unsupervised learning seeks to uncover structure within the data itself. {.fragment}
Common paradigms: {.fragment}
- Principal Component Analysis (PCA): Linear transformation. {.fragment}
- Autoencoders: Nonlinear neural network-based compression. {.fragment}
- Manifold Learning: t-SNE, UMAP. {.fragment}

06. The Autoencoder (AE) Topology

graph LR
  X[Input **x**] --> E[Encoder]
  E --> Z((Latent Space **z**))
  Z --> D[Decoder]
  D --> X_hat[Reconstruction **x̂**]
  style Z fill:#f9f,stroke:#333,stroke-width:4px

An AE is a neural network trained to map the input to itself: \(f(\mathbf{x}) = \mathbf{x}\). {.fragment}
Encoder \(\mathcal{E}(\mathbf{x}) = \mathbf{z}\): Compresses input to a low-dimensional “code” \(\mathbf{z}\). {.fragment}
Decoder \(\mathcal{D}(\mathbf{z}) = \hat{\mathbf{x}}\): Reconstructs the input from the code. {.fragment}
The Bottleneck: A hidden layer with fewer neurons than the input, forcing information compression (Neuer et al. 2024; McClarren 2021). {.fragment}

07. Formalizing the Identity Mapping

Training objective: Minimize reconstruction error (loss \(J\)): {.fragment} \[ J = \sum_i \|\mathbf{x}_i - \mathcal{D}(\mathcal{E}(\mathbf{x}_i))\|^2 \]
If reconstruction is successful, the latent vector \(\mathbf{z}\) must contain all essential information about \(\mathbf{x}\). {.fragment}
\(\mathbf{z}\) is the learned representation or embedding (Bishop 2006). {.fragment}

08. PCA vs. Autoencoders: Linear vs. Nonlinear

Principal Component Analysis (PCA) - A linear projection onto the eigenspace of the covariance matrix. {.fragment} \[ \hat{\mathbf{x}}_i = \mathbf{x}_i \mathbf{S}_C \] - Focuses on preserving variance. {.fragment}

Autoencoder (AE) - Uses nonlinear activation functions (ReLU, Sigmoid) to “unwrap” complex manifolds. {.fragment} - Focuses on minimizing reconstruction error. {.fragment}

A single-layer AE with linear activations is mathematically equivalent to PCA (Bishop 2006; McClarren 2021).
Feature	PCA	Autoencoder
Map	Linear	Nonlinear (usually)
Optimizer	Eigendecomposition	Backpropagation
Manifold	Hyperplane	Curved/Arbitrary
Equivalent	AE with linear activations	General case

09. The Latent Space as a Coordinate System

The values in the bottleneck layer form the latent space \(\mathbf{z}\). {.fragment}
We treat \(\mathbf{z}\) as a continuous, searchable coordinate system for materials. {.fragment}
Intuition: Similar materials should lie close together in the latent space, forming a “materials manifold” (Sandfeld et al. 2024; Murphy 2012). {.fragment}

10. Feature Discovery: Interpreting Latent Dimensions

What does \(z_1\) or \(z_2\) actually mean physically? {.fragment}
Latent Traversal: Vary one dimension of \(\mathbf{z}\) while keeping others fixed and observe the decoded output \(\hat{\mathbf{x}}\). {.fragment}
Example: Dimension \(z_1\) might align with atomic volume, while \(z_2\) captures octahedral tilting, discovered without explicit labels (McClarren 2021). {.fragment}

11. Case Study: Compressing Leaf Spectra (McClarren 8.2)

Data: Reflectance/transmittance spectra (2051 wavelengths). {.fragment}
Goal: Reduce 4102 features to a latent space of size \(\ell=2\) or \(\ell=4\). {.fragment}
Result: \(\ell=4\) captures shape and intensity accurately (MAE 0.005). {.fragment}
Observation: Changing \(z_1\) changes the overall spectrum level nonlinearly (not just scaling). {.fragment}

12. Self-supervised learning in Materials Genomics

“Self-supervised” means the data provides the label. {.fragment}
Leverages massive unlabeled databases (e.g., Materials Project). {.fragment}
Masked Atom Modeling: Predict missing atoms in a crystal structure. {.fragment}
Contrastive Learning: Learn that a rotated or translated crystal is the same material. {.fragment}

13. Embedding quality: separability and structure

A good embedding is more than just low reconstruction error. {.fragment}
Separability Test: Do chemistry families or prototypes cluster naturally? {.fragment}
Neighborhood Consistency: Do physical neighbors in “property space” remain neighbors in “latent space” (the space of \(\mathbf{z}\))? {.fragment}
Visualization tools: t-SNE and UMAP help diagnose these properties (Sandfeld et al. 2024). {.fragment}

14. Transferability: Embeddings as pre-trained featurizers

graph TD
  A[Large Unlabeled Dataset] --> B[Self-supervised Training (Autoencoder)]
  B --> C[Extract Encoder **E**]
  C --> D[Freeze Encoder **E**]
  D --> E[Small Labeled Dataset]
  E --> F[Encoder **E** + Task-specific MLP]
  F --> G[Target Property Prediction]
  style C fill:#ccf,stroke:#333,stroke-width:2px

Workflow: {.fragment}
1. Train AE on 1,000,000 unlabeled structures (e.g., MP). {.fragment}
2. Freeze the Encoder \(\mathcal{E}\). {.fragment}
3. Use \(\mathcal{E}(\mathbf{x})\) as input for a small-data property task (e.g., thermal conductivity). {.fragment}
The encoder has learned the “language” of crystal chemistry before seeing any property labels (Murphy 2012). {.fragment}

15. Probing embeddings: The Linear Readout Test

How “ready” is the representation for prediction? {.fragment}
Linear Probe: Train a simple linear regressor on \(\mathbf{z}\) to predict property \(y\). {.fragment}
If \(\mathbf{z}\) can predict \(y\) linearly, the autoencoder has successfully “linearized” the complex physics. {.fragment}
This is a standard diagnostic for representation quality. {.fragment}

16. The Latent Space of Microstructures (Sandfeld 15.6)

Input: Microscopy images (TEM, SEM). {.fragment}
AE learns latent factors corresponding to: {.fragment}
- Phase fraction {.fragment}
- Grain size distribution {.fragment}
- Orientation motifs {.fragment}
Bridges the “pixel world” and “parameter world” of constitutive models (Sandfeld et al. 2024). {.fragment}

17. Visualization Pitfalls: t-SNE and UMAP

Warning: Projections to 2D can create “hallucinated” clusters or hide real distances. {.fragment}
t-SNE does not preserve global topology; UMAP is better but still subject to artifacts. {.fragment}
Rule: Pretty pictures are for hypothesis generation, not scientific proof (Neuer et al. 2024; Sandfeld et al. 2024). {.fragment}

18. Failure Mode: Shortcut Learning

The AE might “cheat” by memorizing dataset indices or artifacts of the simulation pipeline. {.fragment}
Example: Reconstructing a crystal by memorizing the order in the file, not the chemistry. {.fragment}
Detection: Test on an out-of-distribution (OOD) chemical family. {.fragment}

19. Failure Mode: Over-compression

If the bottleneck is too narrow, critical physical nuances are lost. {.fragment}
A stable and an unstable polymorph might map to the same point \(z\). {.fragment}
Tradeoff: Compression vs. Fidelity. {.fragment}
Choose \(\text{dim}(z)\) by monitoring the elbow in reconstruction loss. {.fragment}

20. Variational Autoencoders (VAE) (Preview)

VAEs add a probabilistic constraint: the latent space must follow a standard normal distribution \(p(z) \sim \mathcal{N}(0, I)\).
Benefit: Smooths the latent space, making it better for interpolation and generation.
Prevents “holes” in the materials manifold where the decoder fails (Bishop 2006).

21. Embedding drift across domains

Embeddings are sensitive to the “source statistics” (e.g., DFT functional, relaxation settings).
An encoder trained on OQMD might fail on experimental data.
Mitigation: Domain adaptation or training on multi-source data.

22. Hybrid pipelines: Grey-Box Modeling

Don’t throw away human knowledge.
Concatenate engineered descriptors (which we trust) with learned embeddings (which capture complexity).
This “Grey-Box” approach balances interpretability and power (Sandfeld et al. 2024).

23. Uncertainty in the Latent Space

Is the prediction uncertain because the regressor is weak, or because the latent representation is ambiguous?
Anomaly detection: If \(x_{new}\) reconstructs poorly, the embedding is not valid for this region.
This is the foundation for active learning in materials discovery.

24. Summary of Unit 9

Representation learning replaces manual featurization with data-driven discovery.
Autoencoders use a bottleneck to extract salient structural information.
Latent spaces provide a continuous coordinate system for discovery and transfer.
Validation must go beyond reconstruction to include transfer and probe tests.

25. Bridge to Unit 10: Generative Modeling

If we can represent materials in a latent space (Unit 9)…
…we can generate new materials by sampling from that space (Unit 10).
Rule: Representation is the prerequisite for generation.

26. Exercise: Comparing Descriptors vs. Embeddings

Task: Implement a 100-32-10-32-100 AE for crystal data.
Comparison: Evaluate test RMSE using matminer descriptors vs. learned embeddings.
Goal: Identify which chemistry families benefit from learned features.

27. Exam Checklist

Explain the role of the encoder and decoder.
Why is a bottleneck necessary for feature discovery?
How does a linear probe test a representation?
What is shortcut learning in an unsupervised context?
Why is PCA insufficient for complex structural manifolds?

Bishop, Christopher M. 2006. Pattern Recognition and Machine Learning. Springer.

McClarren, Ryan G. 2021. Machine Learning for Engineers: Using Data to Solve Problems for Physical Systems. Springer.

Murphy, Kevin P. 2012. Machine Learning: A Probabilistic Perspective. MIT Press.

Neuer, Michael et al. 2024. Machine Learning for Engineers: Introduction to Physics-Informed, Explainable Learning Methods for AI in Engineering Applications. Springer Nature.

Sandfeld, Stefan et al. 2024. Materials Data Science. Springer.

Materials GenomicsUnit 9: Representation Learning and Feature Discovery