Materials Genomics
Unit 9: Representation Learning and Feature Discovery

Prof. Dr. Philipp Pelz

FAU Erlangen-Nürnberg

FAU Logo IMN Logo CENEM Logo ERC Logo Eclipse Logo

01. Title: Representation Learning and Feature Discovery

  • Core Goal: Move from engineered features to learned embeddings that discover hidden structure in materials.
  • Workflow Role: Replaces manual descriptor engineering with automated, data-driven feature extraction.
  • “Along the lips of the mystic portal he discovered writings which after a little study he was able to decipher.” — Nathanael West (McClarren 2021)
  • This unit explores how models “decipher” the language of crystal structures.

02. Learning outcomes for Unit 9

By the end of this unit, students can: - explain the bottleneck principle and the role of the latent space in autoencoders, - distinguish between linear (PCA) and nonlinear (Autoencoder) dimensionality reduction, - evaluate embedding quality using separability, transferability, and probe tests, - identify failure modes such as shortcut learning and over-compression in materials tasks, - implement a representation-learning pipeline for spectral or structural data.

03. Recap: Where we are in the curriculum

  • Unit 7: Regression on fixed features (descriptors).
  • Unit 8: Neural surrogates (MLPs) on fixed features.
  • Unit 9 (Today): The representation itself is now learned from the data.
  • Dependency: Builds on neural networks (MFML) and crystal structure fundamentals (MG Unit 2).

04. The bottleneck of hand-crafted descriptors

  • Many structure-property relations are too complex for fixed fingerprints (e.g., Magpie, SOAP).
  • Engineered features often saturate in performance or miss subtle structural interactions.
  • Feature Discovery: Instead of telling the model what to look for, we let the model find the most informative features (Neuer et al. 2024; Sandfeld et al. 2024).

05. Representation learning as an unsupervised task

  • Most materials data is unlabeled (structure exists, but property \(y\) is unknown).
  • Unsupervised learning seeks to uncover structure within the data itself.
  • Common paradigms:
    • Principal Component Analysis (PCA): Linear transformation.
    • Autoencoders: Nonlinear neural network-based compression.
    • Manifold Learning: t-SNE, UMAP.

06. The Autoencoder (AE) Topology

  • An AE is a neural network trained to map the input to itself: \(f(x) = x\).
  • Encoder \(\mathcal{E}(x) = z\): Compresses input to a low-dimensional “code” \(z\).
  • Decoder \(\mathcal{D}(z) = \hat{x}\): Reconstructs the input from the code.
  • The Bottleneck: A hidden layer with fewer neurons than the input, forcing information compression (Neuer et al. 2024; McClarren 2021).

07. Formalizing the Identity Mapping

  • Training objective: Minimize reconstruction error (loss \(J\)): \[ J = \sum_i ||x_i - \mathcal{D}(\mathcal{E}(x_i))||^2 \]
  • If reconstruction is successful, the latent vector \(z\) must contain all essential information about \(x\).
  • \(z\) is the learned representation or embedding (Bishop 2006).

08. PCA vs. Autoencoders: Linear vs. Nonlinear

  • PCA: A linear projection onto the eigenspace of the covariance matrix. \[ \hat{\mathbf{x}}_i = \mathbf{x}_i \mathbf{S}_C \]
  • Autoencoder: Uses nonlinear activation functions (ReLU, Sigmoid) to “unwrap” complex manifolds.
  • A single-layer AE with linear activations is mathematically equivalent to PCA (Bishop 2006; McClarren 2021).

09. The Latent Space as a Coordinate System

  • The values in the bottleneck layer form the latent space \(\mathbb{L}\).
  • We treat \(\mathbb{L}\) as a continuous, searchable coordinate system for materials.
  • Intuition: Similar materials should lie close together in \(\mathbb{L}\), forming a “materials manifold” (Sandfeld et al. 2024; Murphy 2012).

10. Feature Discovery: Interpreting Latent Dimensions

  • What does \(z_1\) or \(z_2\) actually mean physically?
  • Latent Traversal: Vary one dimension of \(z\) while keeping others fixed and observe the decoded output \(\hat{x}\).
  • Example: Dimension \(z_1\) might align with atomic volume, while \(z_2\) captures octahedral tilting, discovered without explicit labels (McClarren 2021).

11. Case Study: Compressing Leaf Spectra (McClarren 8.2)

  • Data: Reflectance/transmittance spectra (2051 wavelengths).
  • Goal: Reduce 4102 features to a latent space of size \(\ell=2\) or \(\ell=4\).
  • Result: \(\ell=4\) captures shape and intensity accurately (MAE 0.005).
  • Observation: Changing \(z_1\) changes the overall spectrum level nonlinearly (not just scaling).

12. Self-supervised learning in Materials Genomics

  • “Self-supervised” means the data provides the label.
  • Leverages massive unlabeled databases (e.g., Materials Project).
  • Masked Atom Modeling: Predict missing atoms in a crystal structure.
  • Contrastive Learning: Learn that a rotated or translated crystal is the same material.

13. Embedding quality: separability and structure

  • A good embedding is more than just low reconstruction error.
  • Separability Test: Do chemistry families or prototypes cluster naturally?
  • Neighborhood Consistency: Do physical neighbors in “property space” remain neighbors in “latent space”?
  • Visualization tools: t-SNE and UMAP help diagnose these properties (Sandfeld et al. 2024).

14. Transferability: Embeddings as pre-trained featurizers

  • Workflow:
    1. Train AE on 1,000,000 unlabeled structures (e.g., MP).
    2. Freeze the Encoder \(\mathcal{E}\).
    3. Use \(\mathcal{E}(x)\) as input for a small-data property task (e.g., thermal conductivity).
  • The encoder has learned the “language” of crystal chemistry before seeing any property labels (Murphy 2012).

15. Probing embeddings: The Linear Readout Test

  • How “ready” is the representation for prediction?
  • Linear Probe: Train a simple linear regressor on \(z\) to predict property \(y\).
  • If \(z\) can predict \(y\) linearly, the autoencoder has successfully “linearized” the complex physics.
  • This is a standard diagnostic for representation quality.

16. The Latent Space of Microstructures (Sandfeld 15.6)

  • Input: Microscopy images (TEM, SEM).
  • AE learns latent factors corresponding to:
    • Phase fraction
    • Grain size distribution
    • Orientation motifs
  • Bridges the “pixel world” and “parameter world” of constitutive models (Sandfeld et al. 2024).

17. Visualization Pitfalls: t-SNE and UMAP

  • Warning: Projections to 2D can create “hallucinated” clusters or hide real distances.
  • t-SNE does not preserve global topology; UMAP is better but still subject to artifacts.
  • Rule: Pretty pictures are for hypothesis generation, not scientific proof (Neuer et al. 2024; Sandfeld et al. 2024).

18. Failure Mode: Shortcut Learning

  • The AE might “cheat” by memorizing dataset indices or artifacts of the simulation pipeline.
  • Example: Reconstructing a crystal by memorizing the order in the file, not the chemistry.
  • Detection: Test on an out-of-distribution (OOD) chemical family.

19. Failure Mode: Over-compression

  • If the bottleneck is too narrow, critical physical nuances are lost.
  • A stable and an unstable polymorph might map to the same point \(z\).
  • Tradeoff: Compression vs. Fidelity.
  • Choose \(\text{dim}(z)\) by monitoring the elbow in reconstruction loss.

20. Variational Autoencoders (VAE) (Preview)

  • VAEs add a probabilistic constraint: the latent space must follow a standard normal distribution \(p(z) \sim \mathcal{N}(0, I)\).
  • Benefit: Smooths the latent space, making it better for interpolation and generation.
  • Prevents “holes” in the materials manifold where the decoder fails (Bishop 2006).

21. Embedding drift across domains

  • Embeddings are sensitive to the “source statistics” (e.g., DFT functional, relaxation settings).
  • An encoder trained on OQMD might fail on experimental data.
  • Mitigation: Domain adaptation or training on multi-source data.

22. Hybrid pipelines: Grey-Box Modeling

  • Don’t throw away human knowledge.
  • Concatenate engineered descriptors (which we trust) with learned embeddings (which capture complexity).
  • This “Grey-Box” approach balances interpretability and power (Sandfeld et al. 2024).

23. Uncertainty in the Latent Space

  • Is the prediction uncertain because the regressor is weak, or because the latent representation is ambiguous?
  • Anomaly detection: If \(x_{new}\) reconstructs poorly, the embedding is not valid for this region.
  • This is the foundation for active learning in materials discovery.

24. Summary of Unit 9

  • Representation learning replaces manual featurization with data-driven discovery.
  • Autoencoders use a bottleneck to extract salient structural information.
  • Latent spaces provide a continuous coordinate system for discovery and transfer.
  • Validation must go beyond reconstruction to include transfer and probe tests.

25. Bridge to Unit 10: Generative Modeling

  • If we can represent materials in a latent space (Unit 9)…
  • …we can generate new materials by sampling from that space (Unit 10).
  • Rule: Representation is the prerequisite for generation.

26. Exercise: Comparing Descriptors vs. Embeddings

  • Task: Implement a 100-32-10-32-100 AE for crystal data.
  • Comparison: Evaluate test RMSE using matminer descriptors vs. learned embeddings.
  • Goal: Identify which chemistry families benefit from learned features.

27. Exam Checklist

Bishop, Christopher M. 2006. Pattern Recognition and Machine Learning. Springer.
McClarren, Ryan G. 2021. Machine Learning for Engineers: Using Data to Solve Problems for Physical Systems. Springer.
Murphy, Kevin P. 2012. Machine Learning: A Probabilistic Perspective. MIT Press.
Neuer, Michael et al. 2024. Machine Learning for Engineers: Introduction to Physics-Informed, Explainable Learning Methods for AI in Engineering Applications. Springer Nature.
Sandfeld, Stefan et al. 2024. Materials Data Science. Springer.