Materials Genomics
Unit 10: Latent Spaces of Materials

Prof. Dr. Philipp Pelz

FAU Erlangen-Nürnberg

FAU Logo IMN Logo CENEM Logo ERC Logo Eclipse Logo

01. Title: Latent Spaces of Materials

  • Goal: Interpret latent spaces as structured materials manifolds, not just black-box compression vectors.
  • Utility: Navigation, interpolation, and anomaly detection in high-dimensional materials datasets.
  • Scientific Role: Provides a “Genomic” coordinate system for identifying prototypes and discovery candidates.
  • Book anchor: [Neuer 5.5.1–5.5.3].

02. Learning outcomes for Unit 10

By the end of this unit, students can: - define the latent space as a low-rank manifold of materials structure, - explain how a decoder acts as a materials generator, - use latent traversals and interpolation to identify structural trends, - apply reconstruction error as a quantitative metric for materials novelty, - diagnose failure modes such as latent collapse and source bias.

03. Recap: From Unit 9 (Representation Learning)

  • Unit 9 showed how an autoencoder learns the encoder \(\mathcal{E}\) and decoder \(\mathcal{D}\).
  • We established the bottleneck principle: \(z = \mathcal{E}(x)\) where \(\text{dim}(z) \ll \text{dim}(x)\).
  • Today: We focus on the geometry and interpretation of the space \(\mathbb{L}\) where \(z\) lives.
  • Book anchor: [McClarren Ch8].

04. What is a Latent Space? (Murphy 12.1)

  • The latent space \(\mathbb{L}\) is the space of variables \(z \in \mathbb{R}^L\) that “explain” the correlations in the observed data.
  • It represents the intrinsic dimensionality of the materials problem.
  • Probabilistic View: We assume a generative process \(p(x|z)\) where a few latent factors drive the observed structural complexity (Murphy 2012).

05. Latent Models as Low-Rank Parameterization

  • Factor Analysis (FA) models the data covariance as: \[ \mathbf{\Sigma} = \mathbf{W}\mathbf{W}^T + \mathbf{\Psi} \]
  • \(\mathbf{W}\) is the factor loading matrix (\(D \times L\)).
  • \(\mathbf{\Psi}\) is a diagonal matrix representing independent noise.
  • Discovery: We find a parsimonious description of complex materials trends using \(O(LD)\) parameters instead of \(O(D^2)\) (Murphy 2012).

06. The Autoencoder as a Nonlinear Manifold Learner

  • PCA is restricted to linear “pancake” projections.
  • Materials structure transitions (e.g., phase changes) are often nonlinearly “curved” in coordinate space.
  • Autoencoders use nonlinear activations (ReLU, Sigmoid) to “unwrap” these curved manifolds into a flat, searchable latent space (Sandfeld et al. 2024).

07. Latent Scores and materials visualization

  • The value \(z_i\) for a material \(x_i\) is its latent score.
  • By plotting \(z_1\) vs. \(z_2\), we visualize the relationships between 1,000s of compounds.
  • Biplots: Project original feature vectors (e.g., density, bulk modulus) into the latent space to see which directions align with physical properties (Murphy 2012).

08. The Problem of Unidentifiability (Murphy 12.1.3)

  • Latent spaces are often “unidentifiable” up to a rotation.
  • If \(\mathbf{R}\) is an orthogonal matrix, then \(\mathbf{W}\) and \(\mathbf{WR}\) provide the same likelihood.
  • Warning: Don’t blindly trust that “Axis 1 is atomic radius.” The model discovers the subspace, but the axes themselves might be rotated (Murphy 2012).

09. Latent Traversals: Decoding the Manifold

  • How to test if an axis has physical meaning? Traversals.
    1. Select a point \(z^*\) in latent space.
    1. Move along a direction \(\vec{v}\): \(z(\alpha) = z^* + \alpha \vec{v}\).
    1. Decode: \(\hat{x}(\alpha) = \mathcal{D}(z(\alpha))\).
  • Materials Insight: Observe how the crystal structure “morphs” (e.g., bonds stretch, octahedra tilt) as you move (McClarren 2021).

10. Interpolation: Navigating between Known Materials

  • Interpolation in raw coordinate space leads to nonphysical atom overlaps.
  • Latent Interpolation: \(z(t) = (1-t)z_A + t z_B\).
  • The path between material \(A\) and \(B\) in latent space follows the learned manifold.
  • Decoded intermediate structures are much more likely to be physically plausible (Sandfeld et al. 2024).

11. Anomaly Detection via Reconstruction Error (Neuer 5.5.3)

  • The Autoencoder acts as a “physics validator” for the data it was trained on.
  • Detection: \(J = |x - \mathcal{A}(x)|^2\).
  • If \(J\) is high, the input \(x\) is “exotic” or “anomalous” relative to the training distribution.
  • Use case: Identifying genuine breakthroughs or simulation errors (Neuer et al. 2024).

12. Anomaly Detection in the Latent Space

  • Normal materials form clusters in \(\mathbb{L}\).
  • Anomalies move away from these clusters.
  • Decision Logic: An unusual material should stand out both in reconstruction error (reconstruction failure) and in latent position (structural outlier) (Neuer et al. 2024).

13. Conditional Probability in Latent Space (Neuer 5.5.7)

  • We can use histograms to find \(P(z_2 \mid z_1)\).
  • If \(z_1\) and \(z_2\) are highly correlated for “normal” materials, an entry with high \(z_1\) and low \(z_2\) is a high-confidence anomaly.
  • This allows for multidimensional anomaly scoring (Neuer et al. 2024).

14. Denoising: Latent space as a physical prior

  • In microscopy, convolutional autoencoders are used for denoising.
  • By training to reconstruct clean images from noisy ones, the model learns a latent space of “physically valid” microstructural motifs.
  • The latent space effectively acts as a filter that discards non-physical noise (Sandfeld et al. 2024).

15. Quantifying Latent Utility: The Readout Test

  • Is the latent space actually “better” than raw descriptors?
  • The Probe: Fix the Encoder \(\mathcal{E}\) and train a linear model on \(z\) to predict property \(y\).
  • If a linear probe on \(z\) beats a linear model on raw descriptors, the representation has successfully “linearized” the physics of the problem.

16. Failure Mode: Latent Collapse

  • Occurs when the decoder ignores the latent space (often because the decoder is too powerful).
  • Symptoms: Multiple input materials map to the same constant latent code.
  • Solution: Adjust the bottleneck width or use regularization (e.g., Kullback-Leibler divergence in VAEs).

17. Failure Mode: Source Bias and Manifold Domination

  • If the dataset mixes DFT-calculated and experimental structures, the latent space might just cluster by “Source.”
  • The model discovers the “simulation vs experiment” gap instead of chemistry.
  • Defense: Use domain-adversarial training to force the encoder to be “source-blind” (Sandfeld et al. 2024).

18. Failure Mode: Visually clean but scientifically useless

  • t-SNE and UMAP can produce beautiful clusters that correlate with nothing physical.
  • Lesson: Visually clean projections are not proof of discovery.
  • Always validate latent structure against known chemical trends or downstream predictive tasks (Sandfeld et al. 2024).

19. Case Study: Trajectories across composition series

  • Mapping a solid solution series (e.g., \(Ba_{1-x}Sr_xTiO_3\)) in latent space.
  • Does the latent trajectory follow a smooth path?
  • Any “kinks” in the latent path might signal a hidden phase transition or symmetry change detected by the model.

20. Bridge to Unit 11 (Clustering vs. Discovery)

  • Unit 10 Outcome: A continuous coordinate system \(\mathbb{L}\).
  • Unit 11 Goal: Partitioning \(\mathbb{L}\) into clusters and prototypes.
  • Connection: Clustering is only as good as the metric space (\(z\)) it lives in.

21. Summary of Unit 10

  • Latent spaces are low-rank parameterizations of materials manifolds.
  • Nonlinear autoencoders “unwrap” structural trends that PCA misses.
  • Traversals, interpolation, and reconstruction error are the primary tools for scientific interrogation.
  • Beware of unidentifiability, source bias, and latent collapse.

22. Exercise: Latent Traversal and Anomaly Scoring

  • Task: Train a simple AE on crystal structures.
  • 1. Visual: Plot the 2D latent space and color by chemistry.
  • 2. Traversal: Sweep one latent dimension and decode to see the structural “morphing.”
  • 3. Scoring: Use reconstruction error to identify the most “exotic” structure in the test set.

23. Exam Checklist

McClarren, Ryan G. 2021. Machine Learning for Engineers: Using Data to Solve Problems for Physical Systems. Springer.
Murphy, Kevin P. 2012. Machine Learning: A Probabilistic Perspective. MIT Press.
Neuer, Michael et al. 2024. Machine Learning for Engineers: Introduction to Physics-Informed, Explainable Learning Methods for AI in Engineering Applications. Springer Nature.
Sandfeld, Stefan et al. 2024. Materials Data Science. Springer.