Materials Genomics
Unit 9: Representation Learning and Feature Discovery

Prof. Dr. Philipp Pelz

FAU Erlangen-Nürnberg

FAU Logo IMN Logo CENEM Logo ERC Logo Eclipse Logo
  • Workflow Role: Replaces manual descriptor engineering with automated, data-driven feature extraction. {.fragment}
  • “Along the lips of the mystic portal he discovered writings which after a little study he was able to decipher.” — Nathanael West (McClarren 2021) {.fragment}
  • This unit explores how models “decipher” the language of crystal structures. {.fragment}

02. Learning outcomes for Unit 9

By the end of this unit, students can: - explain the bottleneck principle and the role of the latent space in autoencoders, {.fragment} - distinguish between linear (PCA) and nonlinear (Autoencoder) dimensionality reduction, {.fragment} - evaluate embedding quality using separability, transferability, and probe tests, {.fragment} - identify failure modes such as shortcut learning and over-compression in materials tasks, {.fragment} - implement a representation-learning pipeline for spectral or structural data. {.fragment}

03. Recap: Where we are in the curriculum

  • Unit 7: Regression on fixed features (descriptors). {.fragment}
  • Unit 8: Neural surrogates (MLPs) on fixed features. {.fragment}
  • Unit 9 (Today): The representation itself is now learned from the data. {.fragment}
  • Dependency: Builds on neural networks (MFML) and crystal structure fundamentals (MG Unit 2). {.fragment}

04. The bottleneck of hand-crafted descriptors

  • Many structure-property relations are too complex for fixed fingerprints (e.g., Magpie, SOAP). {.fragment}
  • Engineered features often saturate in performance or miss subtle structural interactions. {.fragment}
  • Feature Discovery: Instead of telling the model what to look for, we let the model find the most informative features (Neuer et al. 2024; Sandfeld et al. 2024). {.fragment}

05. Representation learning as an unsupervised task

  • Most materials data is unlabeled (structure exists, but property \(y\) is unknown). {.fragment}
  • Unsupervised learning seeks to uncover structure within the data itself. {.fragment}
  • Common paradigms: {.fragment}
    • Principal Component Analysis (PCA): Linear transformation. {.fragment}
    • Autoencoders: Nonlinear neural network-based compression. {.fragment}
    • Manifold Learning: t-SNE, UMAP. {.fragment}

06. The Autoencoder (AE) Topology

graph LR
  X[Input **x**] --> E[Encoder]
  E --> Z((Latent Space **z**))
  Z --> D[Decoder]
  D --> X_hat[Reconstruction **x̂**]
  style Z fill:#f9f,stroke:#333,stroke-width:4px

  • An AE is a neural network trained to map the input to itself: \(f(\mathbf{x}) = \mathbf{x}\). {.fragment}
  • Encoder \(\mathcal{E}(\mathbf{x}) = \mathbf{z}\): Compresses input to a low-dimensional “code” \(\mathbf{z}\). {.fragment}
  • Decoder \(\mathcal{D}(\mathbf{z}) = \hat{\mathbf{x}}\): Reconstructs the input from the code. {.fragment}
  • The Bottleneck: A hidden layer with fewer neurons than the input, forcing information compression (Neuer et al. 2024; McClarren 2021). {.fragment}

07. Formalizing the Identity Mapping

  • Training objective: Minimize reconstruction error (loss \(J\)): {.fragment} \[ J = \sum_i \|\mathbf{x}_i - \mathcal{D}(\mathcal{E}(\mathbf{x}_i))\|^2 \]
  • If reconstruction is successful, the latent vector \(\mathbf{z}\) must contain all essential information about \(\mathbf{x}\). {.fragment}
  • \(\mathbf{z}\) is the learned representation or embedding (Bishop 2006). {.fragment}

08. PCA vs. Autoencoders: Linear vs. Nonlinear

Principal Component Analysis (PCA) - A linear projection onto the eigenspace of the covariance matrix. {.fragment} \[ \hat{\mathbf{x}}_i = \mathbf{x}_i \mathbf{S}_C \] - Focuses on preserving variance. {.fragment}

Autoencoder (AE) - Uses nonlinear activation functions (ReLU, Sigmoid) to “unwrap” complex manifolds. {.fragment} - Focuses on minimizing reconstruction error. {.fragment}

A single-layer AE with linear activations is mathematically equivalent to PCA (Bishop 2006; McClarren 2021).
Feature PCA Autoencoder
Map Linear Nonlinear (usually)
Optimizer Eigendecomposition Backpropagation
Manifold Hyperplane Curved/Arbitrary
Equivalent AE with linear activations General case

09. The Latent Space as a Coordinate System

  • The values in the bottleneck layer form the latent space \(\mathbf{z}\). {.fragment}
  • We treat \(\mathbf{z}\) as a continuous, searchable coordinate system for materials. {.fragment}
  • Intuition: Similar materials should lie close together in the latent space, forming a “materials manifold” (Sandfeld et al. 2024; Murphy 2012). {.fragment}

10. Feature Discovery: Interpreting Latent Dimensions

  • What does \(z_1\) or \(z_2\) actually mean physically? {.fragment}
  • Latent Traversal: Vary one dimension of \(\mathbf{z}\) while keeping others fixed and observe the decoded output \(\hat{\mathbf{x}}\). {.fragment}
  • Example: Dimension \(z_1\) might align with atomic volume, while \(z_2\) captures octahedral tilting, discovered without explicit labels (McClarren 2021). {.fragment}

11. Case Study: Compressing Leaf Spectra (McClarren 8.2)

  • Data: Reflectance/transmittance spectra (2051 wavelengths). {.fragment}
  • Goal: Reduce 4102 features to a latent space of size \(\ell=2\) or \(\ell=4\). {.fragment}
  • Result: \(\ell=4\) captures shape and intensity accurately (MAE 0.005). {.fragment}
  • Observation: Changing \(z_1\) changes the overall spectrum level nonlinearly (not just scaling). {.fragment}

12. Self-supervised learning in Materials Genomics

  • “Self-supervised” means the data provides the label. {.fragment}
  • Leverages massive unlabeled databases (e.g., Materials Project). {.fragment}
  • Masked Atom Modeling: Predict missing atoms in a crystal structure. {.fragment}
  • Contrastive Learning: Learn that a rotated or translated crystal is the same material. {.fragment}

13. Embedding quality: separability and structure

  • A good embedding is more than just low reconstruction error. {.fragment}
  • Separability Test: Do chemistry families or prototypes cluster naturally? {.fragment}
  • Neighborhood Consistency: Do physical neighbors in “property space” remain neighbors in “latent space” (the space of \(\mathbf{z}\))? {.fragment}
  • Visualization tools: t-SNE and UMAP help diagnose these properties (Sandfeld et al. 2024). {.fragment}

14. Transferability: Embeddings as pre-trained featurizers

graph TD
  A[Large Unlabeled Dataset] --> B[Self-supervised Training (Autoencoder)]
  B --> C[Extract Encoder **E**]
  C --> D[Freeze Encoder **E**]
  D --> E[Small Labeled Dataset]
  E --> F[Encoder **E** + Task-specific MLP]
  F --> G[Target Property Prediction]
  style C fill:#ccf,stroke:#333,stroke-width:2px

  • Workflow: {.fragment}
    1. Train AE on 1,000,000 unlabeled structures (e.g., MP). {.fragment}
    2. Freeze the Encoder \(\mathcal{E}\). {.fragment}
    3. Use \(\mathcal{E}(\mathbf{x})\) as input for a small-data property task (e.g., thermal conductivity). {.fragment}
  • The encoder has learned the “language” of crystal chemistry before seeing any property labels (Murphy 2012). {.fragment}

15. Probing embeddings: The Linear Readout Test

  • How “ready” is the representation for prediction? {.fragment}
  • Linear Probe: Train a simple linear regressor on \(\mathbf{z}\) to predict property \(y\). {.fragment}
  • If \(\mathbf{z}\) can predict \(y\) linearly, the autoencoder has successfully “linearized” the complex physics. {.fragment}
  • This is a standard diagnostic for representation quality. {.fragment}

16. The Latent Space of Microstructures (Sandfeld 15.6)

  • Input: Microscopy images (TEM, SEM). {.fragment}
  • AE learns latent factors corresponding to: {.fragment}
    • Phase fraction {.fragment}
    • Grain size distribution {.fragment}
    • Orientation motifs {.fragment}
  • Bridges the “pixel world” and “parameter world” of constitutive models (Sandfeld et al. 2024). {.fragment}

17. Visualization Pitfalls: t-SNE and UMAP

  • Warning: Projections to 2D can create “hallucinated” clusters or hide real distances. {.fragment}
  • t-SNE does not preserve global topology; UMAP is better but still subject to artifacts. {.fragment}
  • Rule: Pretty pictures are for hypothesis generation, not scientific proof (Neuer et al. 2024; Sandfeld et al. 2024). {.fragment}

18. Failure Mode: Shortcut Learning

  • The AE might “cheat” by memorizing dataset indices or artifacts of the simulation pipeline. {.fragment}
  • Example: Reconstructing a crystal by memorizing the order in the file, not the chemistry. {.fragment}
  • Detection: Test on an out-of-distribution (OOD) chemical family. {.fragment}

19. Failure Mode: Over-compression

  • If the bottleneck is too narrow, critical physical nuances are lost. {.fragment}
  • A stable and an unstable polymorph might map to the same point \(z\). {.fragment}
  • Tradeoff: Compression vs. Fidelity. {.fragment}
  • Choose \(\text{dim}(z)\) by monitoring the elbow in reconstruction loss. {.fragment}

20. Variational Autoencoders (VAE) (Preview)

  • VAEs add a probabilistic constraint: the latent space must follow a standard normal distribution \(p(z) \sim \mathcal{N}(0, I)\).
  • Benefit: Smooths the latent space, making it better for interpolation and generation.
  • Prevents “holes” in the materials manifold where the decoder fails (Bishop 2006).

21. Embedding drift across domains

  • Embeddings are sensitive to the “source statistics” (e.g., DFT functional, relaxation settings).
  • An encoder trained on OQMD might fail on experimental data.
  • Mitigation: Domain adaptation or training on multi-source data.

22. Hybrid pipelines: Grey-Box Modeling

  • Don’t throw away human knowledge.
  • Concatenate engineered descriptors (which we trust) with learned embeddings (which capture complexity).
  • This “Grey-Box” approach balances interpretability and power (Sandfeld et al. 2024).

23. Uncertainty in the Latent Space

  • Is the prediction uncertain because the regressor is weak, or because the latent representation is ambiguous?
  • Anomaly detection: If \(x_{new}\) reconstructs poorly, the embedding is not valid for this region.
  • This is the foundation for active learning in materials discovery.

24. Summary of Unit 9

  • Representation learning replaces manual featurization with data-driven discovery.
  • Autoencoders use a bottleneck to extract salient structural information.
  • Latent spaces provide a continuous coordinate system for discovery and transfer.
  • Validation must go beyond reconstruction to include transfer and probe tests.

25. Bridge to Unit 10: Generative Modeling

  • If we can represent materials in a latent space (Unit 9)…
  • …we can generate new materials by sampling from that space (Unit 10).
  • Rule: Representation is the prerequisite for generation.

26. Exercise: Comparing Descriptors vs. Embeddings

  • Task: Implement a 100-32-10-32-100 AE for crystal data.
  • Comparison: Evaluate test RMSE using matminer descriptors vs. learned embeddings.
  • Goal: Identify which chemistry families benefit from learned features.

27. Exam Checklist

Bishop, Christopher M. 2006. Pattern Recognition and Machine Learning. Springer.
McClarren, Ryan G. 2021. Machine Learning for Engineers: Using Data to Solve Problems for Physical Systems. Springer.
Murphy, Kevin P. 2012. Machine Learning: A Probabilistic Perspective. MIT Press.
Neuer, Michael et al. 2024. Machine Learning for Engineers: Introduction to Physics-Informed, Explainable Learning Methods for AI in Engineering Applications. Springer Nature.
Sandfeld, Stefan et al. 2024. Materials Data Science. Springer.