Materials Genomics
Unit 10: Latent Spaces of Materials

Prof. Dr. Philipp Pelz

FAU Erlangen-Nürnberg

Utility: Navigation, interpolation, and anomaly detection in high-dimensional materials datasets.
Scientific Role: Provides a “Genomic” coordinate system for identifying prototypes and discovery candidates.
Book anchor: [Neuer 5.5.1–5.5.3].

02. Learning outcomes for Unit 10

By the end of this unit, students can: - define the latent space as a low-rank manifold of materials structure, - explain how a decoder acts as a materials generator, - use latent traversals and interpolation to identify structural trends, - apply reconstruction error as a quantitative metric for materials novelty, - diagnose failure modes such as latent collapse and source bias.

03. Recap: From Unit 9 (Representation Learning)

Unit 9 showed how an autoencoder learns the encoder \(\mathcal{E}\) and decoder \(\mathcal{D}\).
We established the bottleneck principle: \(z = \mathcal{E}(x)\) where \(\text{dim}(z) \ll \text{dim}(x)\).
Today: We focus on the geometry and interpretation of the space \(\mathbb{L}\) where \(z\) lives.
Book anchor: [McClarren Ch8].

04. What is a Latent Space? (Murphy 12.1)

The latent space \(\mathbb{L}\) is the space of variables \(z \in \mathbb{R}^L\) that “explain” the correlations in the observed data.
It represents the intrinsic dimensionality of the materials problem.
Probabilistic View: We assume a generative process \(p(x|z)\) where a few latent factors drive the observed structural complexity (Murphy 2012).

05. Latent Models as Low-Rank Parameterization

Factor Analysis (FA) models the data covariance as: \[ \mathbf{\Sigma} = \mathbf{W}\mathbf{W}^T + \mathbf{\Psi} \]
\(\mathbf{W}\) is the factor loading matrix (\(D \times L\)).
\(\mathbf{\Psi}\) is a diagonal matrix representing independent noise.
Discovery: We find a parsimonious description of complex materials trends using \(O(LD)\) parameters instead of \(O(D^2)\) (Murphy 2012).

06. The Autoencoder as a Nonlinear Manifold Learner

PCA is restricted to linear “pancake” projections.
Materials structure transitions (e.g., phase changes) are often nonlinearly “curved” in coordinate space.
Autoencoders use nonlinear activations (ReLU, Sigmoid) to “unwrap” these curved manifolds into a flat, searchable latent space (Sandfeld et al. 2024).

07. Latent Scores and materials visualization

The value \(z_i\) for a material \(x_i\) is its latent score.
By plotting \(z_1\) vs. \(z_2\), we visualize the relationships between 1,000s of compounds.
Biplots: Project original feature vectors (e.g., density, bulk modulus) into the latent space to see which directions align with physical properties (Murphy 2012).

08. The Problem of Unidentifiability (Murphy 12.1.3)

Latent spaces are often “unidentifiable” up to a rotation.
If \(\mathbf{R}\) is an orthogonal matrix, then \(\mathbf{W}\) and \(\mathbf{WR}\) provide the same likelihood.
Warning: Don’t blindly trust that “Axis 1 is atomic radius.” The model discovers the subspace, but the axes themselves might be rotated (Murphy 2012).

09. Latent Traversals: Decoding the Manifold

How to test if an axis has physical meaning? Traversals.
1. Select a point \(z^*\) in latent space.
1. Move along a direction \(\vec{v}\): \(z(\alpha) = z^* + \alpha \vec{v}\).
1. Decode: \(\hat{x}(\alpha) = \mathcal{D}(z(\alpha))\).
Materials Insight: Observe how the crystal structure “morphs” (e.g., bonds stretch, octahedra tilt) as you move (McClarren 2021).

10. Interpolation: Navigating between Known Materials

Interpolation in raw coordinate space leads to nonphysical atom overlaps.
Latent Interpolation: \(z(t) = (1-t)z_A + t z_B\).
The path between material \(A\) and \(B\) in latent space follows the learned manifold.
Decoded intermediate structures are much more likely to be physically plausible (Sandfeld et al. 2024).

11. Anomaly Detection via Reconstruction Error (Neuer 5.5.3)

The Autoencoder acts as a “physics validator” for the data it was trained on.
Detection: \(J = |x - \mathcal{A}(x)|^2\).
If \(J\) is high, the input \(x\) is “exotic” or “anomalous” relative to the training distribution.
Use case: Identifying genuine breakthroughs or simulation errors (Neuer et al. 2024).

12. Anomaly Detection in the Latent Space

Normal materials form clusters in \(\mathbb{L}\).
Anomalies move away from these clusters.
Decision Logic: An unusual material should stand out both in reconstruction error (reconstruction failure) and in latent position (structural outlier) (Neuer et al. 2024).

13. Conditional Probability in Latent Space (Neuer 5.5.7)

We can use histograms to find \(P(z_2 \mid z_1)\).
If \(z_1\) and \(z_2\) are highly correlated for “normal” materials, an entry with high \(z_1\) and low \(z_2\) is a high-confidence anomaly.
This allows for multidimensional anomaly scoring (Neuer et al. 2024).

14. Denoising: Latent space as a physical prior

In microscopy, convolutional autoencoders are used for denoising.
By training to reconstruct clean images from noisy ones, the model learns a latent space of “physically valid” microstructural motifs.
The latent space effectively acts as a filter that discards non-physical noise (Sandfeld et al. 2024).

15. Quantifying Latent Utility: The Readout Test

Is the latent space actually “better” than raw descriptors?
The Probe: Fix the Encoder \(\mathcal{E}\) and train a linear model on \(z\) to predict property \(y\).
If a linear probe on \(z\) beats a linear model on raw descriptors, the representation has successfully “linearized” the physics of the problem.

16. Failure Mode: Latent Collapse

Occurs when the decoder ignores the latent space (often because the decoder is too powerful).
Symptoms: Multiple input materials map to the same constant latent code.
Solution: Adjust the bottleneck width or use regularization (e.g., Kullback-Leibler divergence in VAEs).

17. Failure Mode: Source Bias and Manifold Domination

If the dataset mixes DFT-calculated and experimental structures, the latent space might just cluster by “Source.”
The model discovers the “simulation vs experiment” gap instead of chemistry.
Defense: Use domain-adversarial training to force the encoder to be “source-blind” (Sandfeld et al. 2024).

18. Failure Mode: Visually clean but scientifically useless

t-SNE and UMAP can produce beautiful clusters that correlate with nothing physical.
Lesson: Visually clean projections are not proof of discovery.
Always validate latent structure against known chemical trends or downstream predictive tasks (Sandfeld et al. 2024).

19. Case Study: Trajectories across composition series

Mapping a solid solution series (e.g., \(Ba_{1-x}Sr_xTiO_3\)) in latent space.
Does the latent trajectory follow a smooth path?
Any “kinks” in the latent path might signal a hidden phase transition or symmetry change detected by the model.

20. Bridge to Unit 11 (Clustering vs. Discovery)

Unit 10 Outcome: A continuous coordinate system \(\mathbb{L}\).
Unit 11 Goal: Partitioning \(\mathbb{L}\) into clusters and prototypes.
Connection: Clustering is only as good as the metric space (\(z\)) it lives in.

21. Summary of Unit 10

Latent spaces are low-rank parameterizations of materials manifolds.
Nonlinear autoencoders “unwrap” structural trends that PCA misses.
Traversals, interpolation, and reconstruction error are the primary tools for scientific interrogation.
Beware of unidentifiability, source bias, and latent collapse.

22. Exercise: Latent Traversal and Anomaly Scoring

Task: Train a simple AE on crystal structures.
1. Visual: Plot the 2D latent space and color by chemistry.
2. Traversal: Sweep one latent dimension and decode to see the structural “morphing.”
3. Scoring: Use reconstruction error to identify the most “exotic” structure in the test set.

23. Exam Checklist

What is the difference between a latent score and a raw feature?
How does a biplot help interpret latent axes?
Why is latent interpolation better than raw coordinate interpolation?
How do you use an autoencoder for anomaly detection?
What is “source bias” in a materials latent manifold?

McClarren, Ryan G. 2021. Machine Learning for Engineers: Using Data to Solve Problems for Physical Systems. Springer.

Murphy, Kevin P. 2012. Machine Learning: A Probabilistic Perspective. MIT Press.

Neuer, Michael et al. 2024. Machine Learning for Engineers: Introduction to Physics-Informed, Explainable Learning Methods for AI in Engineering Applications. Springer Nature.

Sandfeld, Stefan et al. 2024. Materials Data Science. Springer.

Materials GenomicsUnit 10: Latent Spaces of Materials