FAU Erlangen-Nürnberg
Recap — Units 1–7 (MG) and the parallel tracks
Today — MG Unit 8
By the end of 90 minutes, you can:
Seven sections, ~90 min
The single sentence to leave with
In materials ML, the test set’s relationship to the training set is the scientific claim. A model is trustworthy only when its split design matches the claim its predictions are meant to support.
5-minute break after §C (around minute 47).
The empirical-risk picture
\[\hat{\boldsymbol{\theta}} \;=\; \arg\min_{\boldsymbol{\theta}} \;\underbrace{\frac{1}{N}\sum_{i=1}^{N} \ell(f_{\boldsymbol{\theta}}(\mathbf{x}_i), y_i)}_{\text{empirical risk}} \;+\; \underbrace{\lambda\,\Omega(\boldsymbol{\theta})}_{\text{regularizer}}\]
The materials twist
Standard decomposition (Bishop 2006; Murphy 2012)
\[\mathbb{E}[\,(\hat{f}(x) - y)^2\,] \;=\; \underbrace{\text{Bias}^2}_{\text{model too simple}} + \underbrace{\text{Var}}_{\text{too sensitive to data}} + \underbrace{\sigma^2}_{\text{irreducible}}\]
The fourth term materials adds
\[+\; \underbrace{\Delta_{\text{shift}}}_{\text{distribution shift error}}\]
Ridge \(\Omega(\mathbf{w}) = \|\mathbf{w}\|_2^2\) — Gaussian prior on weights.
Lasso \(\Omega(\mathbf{w}) = \|\mathbf{w}\|_1\) — Laplace prior, drives weights exactly to zero.
Elastic net — both, weighted (Bishop 2006).
These are MFML W7 content. We use them, we do not re-derive them.
Why ridge is the materials default
\(k\)-fold CV recap (Bishop 2006)
The mechanics are MFML W7 content.
The materials issue
The discipline (MFML W7)
If the test set influences any choice, the test estimate is no longer honest (Bishop 2006; Murphy 2012).
Materials adaptation
Reporting “test MAE” without specifying which axis the test set was held out along is uninterpretable. Always declare the split design.
Six axes of “new” for a materials regression task
The diagnostic question
Which axis or axes does your test set probe?
The setup
The hidden structure
The reveal
The 60-meV number was a measure of interpolation. The 250+-meV number is a measure of transfer. The model is the same; the question changed (cf. Matbench/Matbench-Discovery community benchmarks (Dunn et al. 2020; Riebesell et al. 2025)).
The setup
The aliasing
Polymorph aliasing puts a noise floor on composition-only models that more data cannot remove. The fix is structural information, not statistical effort.
Materials Project, OQMD, AFLOW, NOMAD
The consequence
Two operationally different questions

The reported-vs-relevant gap
The same target, different numbers
Cross-database evaluation reveals it
DFT formation energies are not experimental enthalpies
The cascade
Stability bias
The deployment problem
The materials we want to discover are often the materials we have not yet calculated. Training distributions reflect what the field has done, not what discovery requires.
A small set of prototypes dominates
The reporting consequence
Most public datasets assume idealised periodic crystals
Where this bites
Variance of MAE estimates
The chemistry-coverage problem on top of that
Random splits are honest when…
…and lie when
Mechanics
Implementation
sklearn.model_selection.GroupKFold with groups = anion family or composition cluster.MaterialsProject.composition.alphabetical_formula reduces to the per-composition group; aggregate to family with a chemistry rule.groups = [classify_family(formula) for formula in train_formulae]; pass groups to GroupKFold.The headline number this produces is the right one for cross-family transfer claims.
Mechanics
Tools: - pymatgen.symmetry.analyzer.SpacegroupAnalyzer for space group and prototype matching. - AFLOW prototype labels for ICSD-derived structures. - matminer’s structure-similarity matchers for fuzzier grouping.
What this tests
Mechanics
Why LOCO is informative
The setup
Why time-based is the most realistic discovery benchmark
The skew problem
Stratified splitting
sklearn.model_selection.StratifiedKFold for the regression case requires manual binning then group-split.StratifiedGroupKFold.The choice of split is a scientific choice
The protocol
The split is part of the hypothesis, not the postprocessing.
The strongest materials-regression papers report
The gap is the signal
Properties that make it canonical
Order-of-magnitude reference points
Tier 0: Constant predictor
Tier 1: Composition-only linear
Tier 2: Composition-only nonlinear
Tier 3: Structure-aware kernel
Tier 4: Pretrained GNN
Tier 5: Your method
Must clearly beat tier 4 under the split that matches your claim. Anything less has not earned its complexity.
Random split benchmark
Chemistry-held-out benchmark
The honest report
“MAE 30 meV/atom (random split, IID-comparable to prior literature) and 150 meV/atom (chemistry-held-out, reflecting the model’s transfer-to-new-anion-family performance).”
Random split — surprising parity
Polymorph-rich subsets — the real GNN advantage
Structure-held-out splits
Documented failures
Inflation factors
The bare-minimum check
What Matbench is (Dunn et al. 2020)
matbench.materialsproject.org.Why use Matbench
Random splits — the easy trap
Chemistry-aware splits — the hard truth
The honest choice
Match the split to the abstract’s claim. If the abstract says “discovery”, the split must enforce extrapolation. If the abstract says “interpolation”, a random split is fine.
The five-criterion checklist
The harsh reality
Closing of §D
Formation-energy regression is the canonical case because every failure mode shows up here clearly. Mastering this case generalises to every other materials-regression task you will encounter.
The plot
Reading the plot
The recipe
Reading the table
The recipe
pymatgen.StructureMatcher or AFLOW prototype labels).What this reveals
The composition-only diagnostic
A composition-only model has identical per-prototype MAE under random splits as overall (modulo sample-size noise). A structure-aware model that genuinely uses structure should show prototype variation.
The recipe
Reading the pattern
The plot
Three regimes
The materials adaptation
Not all chemistry has equal target noise
The reporting consequence
Calibration metrics
Ranking metrics
A model can be well-ranked but poorly calibrated — useful for screening, not for property reporting. Choose the metric to match the downstream use.
For any materials-regression paper you are reading
If they’re not there
Diagnostic plots are not an aesthetic preference. They are how the field’s headline numbers acquire scientific meaning.
The ablation
Why this matters
Variants of the ablation
Every materials-regression paper should include
Why each tier
The “scientific cost” of skipping baselines
Document everything
StructureMatcher with what tolerances?).Why this matters
If your reader cannot reconstruct your test set from the methods section, your headline number is unreplicable.
Before claiming a regression result
Self-assessment for your own work
The cohort exercise targets 5/7. Your thesis work should target 7/7.
The five most-common reporting failures
The cultural fix
Trustworthy reporting is a cultural practice, not just a technical one. Build it into your habits this semester.
The seven exam-ready statements
The single sentence for the exam
In materials ML, the test set’s relationship to the training set is the scientific claim. A model is trustworthy only when its split design matches the claim its predictions are meant to support.
Unit 9 (next week)
What carries forward unchanged from today
Better architecture does not fix bad benchmarking. The MG U8 discipline carries into U9 and every subsequent unit of MG.

© Philipp Pelz - Materials Genomics