Materials Genomics
Unit 10: Representation Learning and Feature Discovery

Prof. Dr. Philipp Pelz

FAU Erlangen-Nürnberg

FAU Logo IMN Logo CENEM Logo ERC Logo Eclipse Logo

§0 · Frame

01. Today’s Question

Which pretrained crystal representation do we trust, on which downstream task?

  • The 2018–2022 question was should we learn a crystal representation?
  • The 2024–2026 question is which one — among CGCNN, M3GNet, MACE-MP, OMat24-class, MatBERT, ChemFormer.
  • This unit answers it for materials, applying what MFML and ML-PC already taught.

What this unit is not.

  • Not a re-derivation of the autoencoder — that is MFML W5 (Bishop 2006; Murphy 2012).
  • Not a re-introduction to t-SNE / UMAP / contrastive learning — that is MFML W9.
  • Not a tour of conv-AEs on micrographs — that is ML-PC W5.
  • Today’s job: stand on those three and apply representation learning to crystals and materials data.

02. Where We Are

Recap — what we already have

  • MFML W5: autoencoder objective, encoder/decoder, bottleneck, K-means/GMM on embeddings.
  • MFML W9: latent-space geometry, t-SNE/UMAP, contrastive learning, the foundation-embedding concept.
  • ML-PC W5: convolutional AE on micrographs, frozen-CNN embeddings, AE anomaly detection.
  • MG U6: Magpie / matminer / SOAP / ACSF — the engineered-descriptor baseline we have to beat.

Today — Unit 10 in one line

  • Replace hand-crafted crystal descriptors with pretrained, materials-specific embeddings — and learn how to verify they are doing real work.
  • Five strands: chemistry priors, SSL pretraining on MP/OQMD, contrastive crystal embeddings, foundation models for materials, embedding diagnostics.

03. Learning Outcomes

By the end of 90 minutes, you can:

  1. Articulate why a crystal embedding is fundamentally different from an image or text embedding (chemistry, periodicity, equivariance).
  2. Describe self-supervised pretext tasks for crystal data — atom masking, edge masking, denoising, contrastive pairs — and the database substrate (MP / OQMD / AFLOW / NOMAD).
  3. Construct a SimCLR-style contrastive setup for crystals: positive pairs, hard negatives, InfoNCE loss.
  1. Identify the 2024–2026 foundation-model families for materials (M3GNet, MACE-MP, OMat24-class, MatBERT, ChemFormer) and pick the right one for a downstream task.
  2. Diagnose a learned embedding using linear probes and nearest-neighbour retrieval — and recognise the “pretty t-SNE, dead downstream” failure mode.
  3. Decide when a learned crystal embedding is justified over a Magpie / SOAP baseline and when it is not.

§A · MFML W9 Recap

04. Latent space, in one slide

Restated from MFML W9

  • \(z = \mathcal{E}(x) \in \mathbb{R}^d\), \(d \ll \dim(x)\).
  • \(z\) is low-dimensional, continuous, hopefully smooth in property-relevant directions (Bishop 2006; Murphy 2012).
  • An AE trained on materials data gives one such \(z\). So does a contrastively trained encoder. So does a foundation model.

The materials-specific question for today

  • What changes when the input \(x\) is a crystal rather than an image?
  • Answer (preview): chemistry, periodicity, equivariance, and supercell invariance must be respected — by the architecture or by the data augmentation.
  • Hold this question; we answer it in §B.

05. t-SNE / UMAP, in one slide

Restated from MFML W9

  • 2D / 3D projections of high-dimensional \(z\) for visualisation.
  • t-SNE preserves local neighbourhoods, distorts global topology.
  • UMAP preserves more global structure, still has artefacts.
  • Both are exploration tools, not metrics (Neuer et al. 2024; Sandfeld et al. 2024).

Why this matters for §F

  • Materials students disproportionately publish a “pretty t-SNE” as evidence the embedding is good.
  • It is necessary (a hairball is bad news) but not sufficient (a clean t-SNE on cell-size metadata is worse than useless).
  • §F builds the honest diagnostic stack on top of probes, not projections.

06. Contrastive learning, in one slide

Restated from MFML W9

  • Pull positives together, push negatives apart, no labels.
  • InfoNCE: \(\mathcal{L} = -\log \dfrac{\exp(\text{sim}(z_i, z_i^+)/\tau)}{\sum_j \exp(\text{sim}(z_i, z_j)/\tau)}\).
  • Image SimCLR: positives = augmented views of the same image.

The materials-specific question for §D

  • What is a “positive pair” of crystals?
  • What augmentations leave the material invariant while changing the input tensor?
  • Wrong answer: change an element. Right answer: rotate, duplicate the cell, perturb thermally.
  • Hold this; we develop it in §D.

07. Foundation embeddings, in one slide

Restated from MFML W9

  • A single large model, pretrained once on a broad unlabelled corpus.
  • Reused frozen — or fine-tuned — across many downstream tasks.
  • Vision: ImageNet-pretrained ResNets, ViTs. NLP: BERT, T5, GPT-class.
  • The pretraining cost is amortised across all downstream users.

The materials-specific question for §E

  • What is the materials equivalent of an ImageNet-pretrained ResNet?
  • The 2024–2026 candidates: M3GNet, MACE-MP, OMat24-class universal MLIPs, MatBERT, ChemFormer.
  • None of them subsumes the others. §E walks through which is good for what.

§B · Why Materials Representations Need Their Own Treatment

08. A crystal is not an image

An image

  • Fixed-size pixel grid, \(H \times W \times C\).
  • Top-down 2D Euclidean topology.
  • Pixels are interchangeable scalars (R, G, B).
  • CNN inductive bias (translation equivariance, local kernels) is correct.

A crystal

  • Variable atom count, periodic 3D unit cell.
  • Atoms are typed (element, oxidation state).
  • Topology is graph-like (bonds, neighbours), not grid-like.
  • A CNN on voxelised electron density loses chemistry; an image-style ViT loses periodicity.

Consequence: copy-pasting a vision encoder onto crystals discards the inductive biases that make crystals tractable. Every crystal-specific architecture (CGCNN, SchNet, M3GNet, MACE) is built around the right priors instead.

09. Chemistry priors

What a learned crystal embedding should know

  • Element identity (not just \(Z\) as a real number).
  • Valence-electron count.
  • Electronegativity, oxidation-state plausibility.
  • Ionic vs covalent vs metallic character.

Where this knowledge enters

  • Initial atom features (CGCNN-style one-hot or learned element embeddings).
  • Discovered from data during pretraining on millions of structures.
  • Hybrid — prior + learned residual; this is the dominant pattern in 2024–2026.

10. Structure priors

What a learned crystal embedding should know

  • Bond lengths and bond angles (continuous, smooth).
  • Coordination polyhedra (tetrahedral, octahedral, etc.).
  • Space-group symmetry.
  • Dimensionality (0D / 1D / 2D / 3D motifs).

Connection to MG U6

  • SOAP and ACSF encoded these by hand with a fixed basis.
  • Learned crystal encoders (CGCNN, SchNet, M3GNet, MACE) compute message passes that induce equivalent local geometric features.
  • Same physics, two routes; the learned route scales (Neuer et al. 2024).

11. Periodic boundary conditions

The PBC requirement

  • The same material in a \(1\times1\times1\) cell and a \(2\times2\times2\) supercell must produce the same embedding (up to a known scaling).
  • Naive GNNs fail this: more atoms → more messages → different aggregated embedding.
  • Architectural fixes: per-atom embedding then size-invariant pooling (mean, attention).

The free positive pair

  • Supercell duplication is therefore a free augmentation for contrastive learning.
  • Same material, different input tensor → guaranteed positive pair.
  • Architectural invariance + augmentation invariance reinforce each other.
  • Foreshadow §D2.

12. Equivariance baked into the latent space

Symmetries to respect

  • Translation: shift all atoms by the same vector → same embedding.
  • Rotation: rotate the cell → same embedding (or rotated, for tensor outputs).
  • Atom permutation: reorder the atom list → same embedding.
  • Inversion: invert through origin → same embedding (where physics demands).

Where it comes from in 2026

  • Architectural: equivariant message passing — NequIP, MACE, e3nn (math in MFML W9 and specialist courses, not here).
  • Data-augmentation: rotate / translate during training.
  • The architectural route generalises better with less data; the augmentation route is cheaper to implement.

Default in 2026: equivariant architecture for the encoder backbone; augmentations on top for contrastive pretraining. Both routes, in the same model.

§C · Self-Supervised Pretraining on Materials Databases

13. The unlabelled-data substrate

Database snapshot, SS26

Database Structures (approx.) Labels
Materials Project (jain2013materialsproject?) 1.5 M DFT energies; some band gaps, elastic
OQMD (saal2013oqmd?) 1.0 M Formation energies
AFLOW (curtarolo2012aflow?) 3.5 M DFT energies, mostly intermetallics
NOMAD (draxl2018nomad?) 19 M (entries) Heterogeneous, multi-source

The asymmetry that matters

  • Structures — abundant.
  • Property labels of interest (band gap, conductivity, \(T_c\), etc.) — scarce.
  • This asymmetry is exactly what self-supervised pretraining exploits.
  • We pretrain on structure; we fine-tune / probe on labels.

14. The pretraining recipe in one slide

Three knobs

  1. Corruption \(T\) — what we hide / perturb in the input.
  2. Architecture \(\mathcal{E}\) — the encoder.
  3. Substrate \(\mathcal{D}\) — the unlabelled database.

Pseudocode

for batch in loader(D):
    x = batch
    x_corrupt = T(x)       # mask, perturb, augment
    z = encoder(x_corrupt) # the encoder we want to learn
    loss = pretext_loss(z, x)  # reconstruct, contrast, predict
    loss.backward()

Choosing \(T\) defines the pretext task. §C3–C6 walk through the four standard choices for crystal data.

15. Pretext task 1 — Atom Masking

The task

  • Hide a random subset (~15%) of atom identities.
  • Encoder sees: positions, neighbourhoods, masked tokens.
  • Predict the masked elements from context.
  • Direct analogue: BERT’s masked language modelling.

What it teaches the encoder

  • Local chemical context: which elements coexist, which substitutions are plausible.
  • Implicit valence and electronegativity rules.
  • Crystal Twins, Crystal-BERT-style models (2022–2024).
  • Strong on chemistry-OOD generalisation; weaker on geometry.

16. Pretext task 2 — Edge / Bond Masking

The task

  • Mask a random subset of edges in the crystal graph.
  • Predict bond presence and bond length from the rest of the structure.
  • Forces the encoder to internalise local geometry.

When this is the right pretext

  • Element vocabulary is small (atom masking is too easy).
  • Geometry is the actual signal (e.g., polymorph discrimination).
  • Common in materials sub-fields with restricted chemistry: alloys, allotropes, intercalation compounds.

17. Pretext task 3 — Denoising

The task

  • Perturb atomic positions by small Gaussian noise.
  • Predict the displacement field \(\Delta r\) from the noisy structure.
  • Encoder learns the gradient of the energy landscape — implicitly.

The pleasant surprise

  • M3GNet, MACE-MP, NequIP-Pretrained are already trained this way — for energies and forces.
  • Their internal embeddings are usable as crystal representations for free.
  • One pretraining run, two products: a force field and a foundation embedding.

18. Pretext task 4 — Contrastive Pairs

The task

  • Build positive pairs: two views of the same material.
  • Negative pairs: views of different materials.
  • InfoNCE / triplet loss pulls positives together, pushes negatives apart (Goodfellow et al. 2016).

Where this gets interesting

  • “Two views” is a materials physics decision, not an ML decision.
  • Rotation, supercell duplication, thermal perturbation — all valid views.
  • Element substitution — not a valid view.
  • §D develops this in detail.

19. CGCNN as a Featurizer

The 2018 baseline

  • Crystal Graph Convolutional Neural Network (xie2018cgcnn?).
  • Trained for formation-energy prediction on Materials Project.
  • 92-dim per-element initial features + graph convolutions + pooling → per-material vector.

Why it matters in 2026

  • The first widely adopted learned crystal representation.
  • Even a property-supervised CGCNN gives a usable frozen embedding.
  • Set this as the minimum bar — any 2024–2026 foundation embedding should beat a frozen CGCNN on equal footing (Sandfeld et al. 2024).

20. M3GNet and MACE Pretrained on Millions of Structures

M3GNet (chen2022m3gnet?)

  • Graph-based, three-body terms.
  • Trained on the MPF.2021 dataset (~10\(^6\) relaxation snapshots).
  • Predicts energies, forces, stresses across the periodic table.
  • Frozen node embeddings → crystal representation.

MACE-MP-0 (batatia2024macemp?)

  • Higher-order equivariant message passing.
  • Trained on Materials Project relaxations.
  • Stronger out-of-distribution behaviour than M3GNet.
  • Now widely used as a “universal MLIP” — and its embeddings are usable too.

21. Frozen-Embedding Downstream Property Prediction

The standard 2026 recipe

encoder = load_pretrained("MACE-MP-0")
encoder.eval()             # freeze
for batch in train_loader:
    z = encoder(batch.x).detach()
    y_hat = head(z)        # small MLP / linear / GP
    loss = mse(y_hat, batch.y)
    loss.backward()        # head only

What to compare against

  • Magpie + matminer + tree model (MG U6 baseline).
  • SOAP + GP (MG U6 baseline).
  • Same architecture trained from scratch on the small set.
  • The third comparison catches “the architecture is doing the work, the pretraining did nothing”.

22. What Pretraining Actually Buys You

Pretraining helps most when

  • Downstream labels are scarce (< 1k examples).
  • Downstream chemistry overlaps the pretraining distribution.
  • Target is correlated with the pretraining objective (energetic, geometric).

Pretraining helps least when

  • Downstream chemistry is outside the pretraining distribution.
  • Target is decoupled from pretraining (e.g., synthesis yield from a structural encoder).
  • Downstream dataset is large enough to train from scratch.

Empirical pattern (2023–2026 literature): the small-data, in-distribution regime is where foundation embeddings dominate. Outside that regime, Magpie or SOAP often catch up — and they cost orders of magnitude less.

§D · Contrastive Learning of Crystal Embeddings

23. The Materials Version of SimCLR

The pattern, restated

  • Two augmentations \(T_1, T_2\) of the same crystal \(x\).
  • Pass through encoder + projection head: \(z_1, z_2\).
  • InfoNCE: \(\mathcal{L} = -\log \dfrac{\exp(\text{sim}(z_1, z_2)/\tau)}{\sum_{j} \exp(\text{sim}(z_1, z_j)/\tau)}\).
  • \(\tau\): temperature; sum runs over batch negatives.

What is materials-specific

  • Architecture: equivariant graph net (not a CNN).
  • Augmentations: physical operations that leave the material invariant.
  • Negatives: other crystals in the batch — possibly hard-mined by composition / prototype similarity.

24. Positive-Pair Construction for Crystals

Four standard moves

  1. Random rotation of the unit cell.
  2. Supercell duplication (\(1\times1\times1 \leftrightarrow 2\times2\times2\)).
  3. Atomic-position perturbation within thermal-fluctuation envelope.
  4. Origin shift (translation of the cell origin).

Why each is valid

  • Rotation: physics is rotation-invariant.
  • Supercell: physics is independent of the choice of cell.
  • Perturbation: at finite \(T\), the real material is jittered — small noise is on-distribution.
  • Origin shift: physics is translation-invariant.

25. What Does NOT Count as a Positive Pair

Three ways to get this wrong

  • Element substitution. NaCl and KCl are different materials.
  • Space-group change. Rutile and anatase TiO\(_2\) are different polymorphs.
  • Atom removal / addition. A vacancy is a different material with different properties.

Why this matters

  • A “bad” augmentation that crosses the line trains the encoder to erase the discriminating signal.
  • The bug shows up as a great pretraining loss and bad downstream performance.
  • This is a silent labelling bug; it has no symptom in the pretraining metrics.

26. Negative Pairs and the In-Batch Trick

In-batch negatives

  • Every other crystal in the same training batch is a negative.
  • \(N-1\) negatives per anchor at batch size \(N\).
  • Cheap; no extra forward passes.
  • Standard since SimCLR (2020).

Batch size matters

  • Larger batches → more diverse negatives → better embeddings.
  • For crystal SimCLR variants, \(N \in [256, 4096]\) is typical.
  • Memory-bound for large encoders; gradient accumulation or memory banks help.

27. Hard-Negative Mining

The signal asymmetry

  • An anchor crystal vs a totally unrelated crystal: easy negative, low gradient.
  • An anchor crystal vs a similar but different crystal (same composition, different polymorph): hard negative, high gradient.
  • Most of the learning signal lives in the hard negatives.

Mining strategies

  • Nearest-neighbour mining: pick negatives closest to the anchor in the current embedding.
  • Prototype-based mining: deliberately include same-prototype, different-composition pairs.
  • Polymorph-aware mining: ensure each composition has multiple polymorphs in the batch.

28. InfoNCE vs Triplet Loss

InfoNCE

\[\mathcal{L} = -\log \dfrac{\exp(\text{sim}(z_a, z_p)/\tau)}{\sum_{j} \exp(\text{sim}(z_a, z_j)/\tau)}\]

  • Log-softmax over (positive, negatives).
  • Scales with batch size.
  • The 2024–2026 default.

Triplet

\[\mathcal{L} = \max\bigl(0, \, d(z_a, z_p) - d(z_a, z_n) + m\bigr)\]

  • Margin \(m\).
  • Single positive, single negative per anchor.
  • Heavier reliance on hard-negative mining.

29. Published Crystal-Contrastive Results

Notable systems (2022–2025)

Headline results

  • Match or beat property-supervised CGCNN baselines on small downstream tasks.
  • Strongest gains on small-label, in-distribution tasks — consistent with §C22.
  • Modest or no gains on large-label, well-covered targets.

30. Embedding Similarity as Retrieval

The retrieval task

  • Given a query crystal \(q\), return its \(k\) nearest neighbours in the database.
  • \(k\)-NN in embedding space, not feature space.
  • Returns: same prototype? same composition family? similar properties?

Why retrieval beats t-SNE for diagnostics

  • Operates in full embedding dimension — no 2D projection artefacts.
  • Operates per-query — local quality, not global.
  • Generalises directly to the U13 active-discovery loop: “find me a candidate similar to this one”.

§E · Foundation Models for Materials

31. What “Foundation Model” Means in Materials

Borrowed from NLP/Vision

  • A single large model.
  • Pretrained once on a broad unlabelled corpus.
  • Reused frozen — or fine-tuned — across many downstream tasks.
  • Pretraining cost amortised across all downstream users.

The materials version, in 2026

  • Multiple credible candidates across structure, chemistry, text.
  • None dominates; the choice depends on modality and downstream task.
  • We are roughly where vision was in 2014–2015.
  • The benchmarks are emerging, the consensus is not.

32. Modalities of Materials Foundation Models

Three lanes

  1. Text — MatBERT, SciBERT, materials-text-LLMs.
  2. Chemistry strings — ChemFormer, MoLFormer (SMILES / SELFIES).
  3. Structure — CGCNN, M3GNet, MACE-MP, OMat24-class (crystal graphs / atomistic configs).

No lane subsumes the others

  • A text model knows synthesis recipes; not a structure file.
  • A SMILES model knows organic chemistry; not periodic solids.
  • A structure model knows crystals; not the literature.
  • Multimodal materials models exist but are early-stage in 2026.

33. MatBERT and the Text Lane

MatBERT in one slide

  • BERT pretrained on materials-science abstracts (trewartha2022matbert?).
  • Use cases:
    • embed an abstract → retrieve similar papers,
    • extract synthesis conditions from text,
    • classify paper topic.

Strengths and weaknesses

  • Strength: taps implicit human knowledge in the literature.
  • Weakness: knows nothing about a structure file you didn’t text-describe.
  • Right downstream: literature mining, not property regression.

34. ChemFormer, MoLFormer, and the SMILES Lane

Models

Use cases

  • Molecular property prediction with frozen encoder + small head.
  • Reaction prediction (ChemFormer’s seq2seq strength).
  • Limitation: SMILES poorly describes periodic solids; these do not transfer to crystals.

35. M3GNet Embeddings

M3GNet (chen2022m3gnet?)

  • Materials Graph Network with three-body interactions.
  • Trained on Materials Project relaxation trajectories.
  • Predicts energies, forces, stresses across the periodic table.
  • Frozen node embeddings → crystal representation.

Strengths and weaknesses

  • Strength: full periodic-table coverage at MP-DFT level.
  • Weakness: inherits MP’s chemistry biases (oxide-heavy, organics-light).
  • Right downstream: structure–property regression on inorganic crystals.

36. MACE-MP and the Universal-MLIP Family

MACE-MP-0 (batatia2024macemp?)

  • Higher-order equivariant message passing.
  • Pretrained on Materials Project relaxations.
  • Stronger OOD behaviour than M3GNet at comparable parameter count.

The universal-MLIP era

  • The closest analogue materials has to ImageNet-pretrained ResNets.
  • Used as: (i) MD potential, (ii) frozen featurizer, (iii) starting point for fine-tuning.
  • Successors: MACE-MP-1, MACE-OFF, NequIP-Pretrained.

37. The 2024 OMat24 / Meta Release

OMat24 (barrosoluque2024omat24?)

  • Meta AI Research, October 2024.
  • ~118 million inorganic structures + DFT labels.
  • Equiformer-V2-class models trained on it.
  • Released as open weights and dataset.

Why this matters

  • ~100x larger pretraining substrate than MP-only.
  • Pretrained encoder available for downstream use.
  • The 2024–2026 frontier in scale; will be displaced by something larger by 2027–2028.

38. GNoME and Large-Scale Discovery Models

GNoME (merchant2023gnome?)

  • DeepMind’s Graph Networks for Materials Exploration.
  • Large GNN pretrained on Materials Project + active learning.
  • Claimed ~2.2 M stable crystal candidates.

The cautious read

  • Whatever fraction of the 2.2 M is experimentally real, the pretrained encoder is usable.
  • The discovery claims are contested in 2024–2025 follow-up.
  • The methodology — active learning + pretrained encoder — is the durable contribution.

39. Few-Shot Regression with a Frozen Foundation Embedding

The workflow

encoder = load("MACE-MP-0")
encoder.eval()
z_train = encoder(X_train).detach()   # 50–500 examples
z_test  = encoder(X_test).detach()
head = LinearRegression().fit(z_train, y_train)
y_hat = head.predict(z_test)

Why this works at small N

  • The embedding has already internalised chemistry and geometry.
  • The head only needs to learn the property-specific direction in \(z\)-space.
  • Few parameters in the head → low overfitting risk.
  • Robustly beats from-scratch training in the small-N regime.

40. Choosing a Foundation Model for a Downstream Task

Downstream task First choice Backup
Crystal structure → property (in-dist.) M3GNet / MACE-MP frozen + linear probe SOAP + GP
Crystal structure → property (OOD chem.) MACE-MP fine-tuned SOAP + GP
Molecular property (organic) MoLFormer / ChemFormer Morgan fingerprint + RF
Literature mining / abstract classification MatBERT SciBERT
Discovery candidate ranking OMat24-class encoder + active learning GNoME-style pipeline

No model dominates all rows. The decision is task-driven, not hype-driven (Sandfeld et al. 2024).

§F · Diagnosing Learned Representations

41. The Fundamental Diagnostic Question

The question that matters

Does this embedding contain information about the property I care about?

  • Answered by probes, not by t-SNE.
  • A probe is a small predictor on top of the frozen embedding.
  • If the probe predicts the property, the information is there.

The question that does NOT matter (alone)

Does this t-SNE plot look pretty?

  • Pretty t-SNE is necessary (a hairball is bad news) but not sufficient.
  • See §F5 for the canonical failure mode.

42. Linear Probe Protocol

Protocol

  1. Freeze encoder \(\mathcal{E}\).
  2. Compute \(z_i = \mathcal{E}(x_i)\) for the labelled set.
  3. Train a single linear layer \(W\): \(\hat{y}_i = W^\top z_i + b\).
  4. Report \(R^2\), MAE on a held-out chemistry.

The four comparisons that matter

Probe input What it tests
Pretrained \(\mathcal{E}\) The embedding
Random-init \(\mathcal{E}\) Did pretraining help?
Magpie / matminer Engineered baseline
SOAP Engineered structural baseline

Without the random-init comparison, you cannot tell what the pretraining contributed. This is the most-omitted comparison in published work (Sandfeld et al. 2024).

43. Nearest-Neighbour Retrieval Check

Protocol

  1. Pick 20 query crystals across the periodic table.
  2. For each, retrieve the 10 nearest neighbours in embedding space.
  3. Inspect: same prototype? same chemistry family? property values clustered?
  4. Score qualitatively or with a retrieval metric (precision@k).

Why retrieval beats t-SNE

  • Operates in full embedding dimension.
  • Operates per-query — local quality.
  • Generalises directly to the U13 discovery loop.
  • Manually inspectable — a human scientist can look at 20×10 = 200 crystals.

44. Cluster-Structure Check

Protocol

  • Run K-means or HDBSCAN on the embedding.
  • Choose \(k\) via silhouette / BIC (MFML W5 / ML-PC W5).
  • Inspect cluster centroids: composition family? prototype?
  • Cross-tabulate clusters against known labels (where they exist).

What “good” looks like

  • Clusters correspond to physically interpretable groupings.
  • Clusters do not have to be perfect.
  • Clusters that align with space group or prototype without being told are a strong signal.
  • This is the diagnostic step for latent-space interpretation (covered later in this unit; see also the supplementary U11 deck) and for the discovery loop in MG U12 (generative models & inverse design).

45. The “Pretty t-SNE, Dead Downstream” Failure Mode

The anti-pattern

  • t-SNE plot: beautifully separated clusters.
  • Linear probe: \(R^2 \approx 0\).
  • Nearest-neighbour retrieval: random-looking.
  • The embedding is broken; the t-SNE was lying.

Diagnosis

  • t-SNE often picks up low-dimensional artefacts: cell size, atom count, calculator-version metadata.
  • The artefact is real; it is also physically irrelevant to the property.
  • t-SNE shows whatever the largest variance direction is — which need not be the chemistry.

46. The “Good Downstream, Bad t-SNE” Success Mode

The symmetric pattern

  • t-SNE plot: hairball, no obvious clusters.
  • Linear probe: high \(R^2\).
  • Nearest-neighbour retrieval: physically reasonable.
  • The embedding is fine; the t-SNE was misleading.

Diagnosis

  • The property is a smooth function across embedding space without sharp cluster boundaries.
  • t-SNE punishes smooth structure (it likes sharp clusters).
  • This is what we want for regression: a continuous embedding manifold.

§G · Wrap-Up and Bridges

47. When to Use a Learned Representation vs Magpie / SOAP

Foundation embedding wins when

  • \(N_\text{label} < 1000\), in-distribution chemistry.
  • Pretraining objective aligned with target.
  • You can verify with probes + retrieval.

Engineered baseline wins when

  • \(N_\text{label} > 10\,000\), fast iteration desired.
  • Chemistry is wholly novel.
  • Calibrated uncertainty is required (SOAP + GP).
  • Cost matters (Magpie + tree is seconds; foundation embedding is GPU minutes).

2026 honesty: “always use the foundation model” is wrong. The right answer is task-driven (Sandfeld et al. 2024; Neuer et al. 2024).

48. Interpreting the Latent Space (integrated content)

Beyond the encoder: what do the axes mean?

  • The encoder we built in §A–§E produces an embedding.
  • We now ask: what does each axis mean physically?
  • Tools: latent traversal, attribute regression, disentanglement metrics.

The interpretation question, in one example

If the embedding’s first principal axis correlates with mean atomic mass, what does the second axis correlate with?

  • Compute correlations between latent dimensions and known descriptors (atomic mass, electronegativity, formation energy, …).
  • See which axes have which alignments.

Note

After the 2026-05-13 realignment, this is the natural home for latent-space interpretation: the standalone U11 latent-spaces deck is now optional supplementary reading. The diagnostic discipline of §F applies here — interpretation needs a known-good embedding to begin.

49. Bridge to Unit 12 — Generative Models for Discovery

U10 → U12

  • U10’s diagnostic cluster check (§F4) was a sanity test of the embedding.
  • U12 (Generative Models & Inverse Design) builds on top of this embedding to generate new candidate crystals.
  • MatterGen, DiffCSP, CrystaLLM, FlowMM all operate on a learned latent representation of crystals.

The discovery loop, in one diagram

  1. Embed (this unit).
  2. Generate candidates in latent space (U12).
  3. Decode candidates back to crystals.
  4. Filter by predicted property + symmetry + stability.
  5. Synthesise / DFT-validate the most promising.

50. Exam Checklist for Unit 10

You should be able to

  • Name two ways a crystal embedding differs from an image embedding.
  • Describe two pretext tasks for SSL on crystal data.
  • State what counts (and what does not count) as a positive pair of crystals.

You should also be able to

  • Name two foundation-model families for materials and what each is good for.
  • Sketch the linear-probe protocol and explain why random-init is a required comparison.
  • State one regime where SOAP + GP beats every foundation embedding.
Bishop, Christopher M. 2006. Pattern Recognition and Machine Learning. Springer.
Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press.
Murphy, Kevin P. 2012. Machine Learning: A Probabilistic Perspective. MIT Press.
Neuer, Michael et al. 2024. Machine Learning for Engineers: Introduction to Physics-Informed, Explainable Learning Methods for AI in Engineering Applications. Springer Nature.
Sandfeld, Stefan et al. 2024. Materials Data Science. Springer.

Continue