Materials Genomics
Unit 14: Constraints, Trust, and Integration Outlook

Prof. Dr. Philipp Pelz

FAU Erlangen-Nürnberg

FAU Logo IMN Logo CENEM Logo ERC Logo Eclipse Logo

§A · Where We Are at the End of MG

01. Today’s Mission

The closing unit, in one line

  • Take everything from U2–U13 and make it run as a closed-loop discovery system without lying to itself.
  • Three knobs: physical constraints, distribution-shift-aware trust, experimental closure.
  • One centrepiece: the autonomous-lab loop.

What U14 is not.

  • Not a re-derivation of PINNs — that is MFML W13 (Neuer et al. 2024).
  • Not a generic explainability lecture — that is MFML W14.
  • Not the imaging-side autonomous-pipeline talk — that is ML-PC W14.
  • Today’s job: the integration story that ties MG together.

02. The MG Arc in One Slide

What we built

  • U2–U4: QM/QC postulates, electronic structure, thermo, classical atomistic simulation. The physics substrate.
  • U6–U7: local atomic environments, descriptors, and crystal graphs. The representation substrate.
  • U8–U10: regression, NN models, learned representations. The predictive substrate.

Where we ended

  • U11–U12: latent spaces and clustering. Discovery vs labelling, manifold geometry.
  • U13: Materials Project + OQMD + AFLOW; convex hull; Gaussian processes; Bayesian optimisation.
  • U14 today: the integration that makes U8–U13 a system rather than a pile of notebooks.

03. What U2–U13 Left Unfinished

Three honest gaps

  • U13 candidates can violate stoichiometry / symmetry / charge — the BO loop happily proposes Na₂Cl₃.
  • U13 confidence intervals are GP / ensemble posteriors, not finite-sample coverage guarantees.
  • U13 ends at “propose a candidate.” Nobody synthesises it.

The three gaps map to U14’s three knobs

  • Gap 1 → §B physical constraints (hard projection, soft penalty, architectural prior).
  • Gap 2 → §D conformal prediction + OOD detection.
  • Gap 3 → §E the autonomous-lab loop.

04. Learning Outcomes for Unit 14

By the end of 90 minutes, you can:

  1. Enforce physical constraints (stoichiometry, charge, symmetry, conservation) in regression heads, generative models, and acquisition functions.
  2. Recognise when to choose soft penalty vs hard projection vs architectural prior for a given constraint.
  3. Recall the PINN loss in one line and identify two materials problems where PINNs help (and two where they do not).
  1. Wrap a surrogate with conformal prediction and an OOD score to obtain finite-sample coverage and a refusal mechanism.
  2. Sketch an autonomous-lab loop architecture and name three failure modes for the synthesis side and three for the measurement side.
  3. Articulate the 2026 honest assessment: what works, what is marginal, what does not work yet.

§B · Physical Constraint Enforcement

05. Why Constraints Are Not an Afterthought

The naïve generative-model failure

  • Train a VAE on the Materials Project formula list.
  • Sample 1000 candidates.
  • Inspect: $$30% violate stoichiometry (non-integer ratios, violated cation / anion balance) (Goodfellow et al. 2016).
  • Top-\(k\) acquisition list is dominated by garbage before the surrogate even runs.

Constraints are correctness

  • A surrogate that emits “Cu with 7-fold rotational symmetry” is not “noisy” — it is wrong.
  • Regularisation makes a valid model better; constraints make an invalid model valid.
  • Treat constraints with the same rigour as a unit-test, not as a hyperparameter.

06. Four Families of Materials Constraints

Composition-side

  • Stoichiometry / charge balance: integer (or rational) site occupancies; sum of oxidation states = 0.
  • Composition simplex: \(\sum_i x_i = 1\), \(x_i \geq 0\) for fractional alloys.

Structure-side

  • Symmetry: space-group consistency, site multiplicity, Wyckoff positions (Sandfeld et al. 2024).
  • Conservation: mass / energy / momentum (where the system is closed).
  • Thermodynamic feasibility: \(E_{\text{hull}} \leq \Delta_{\text{tol}}\) from U13.

Mnemonic: Composition is what the formula says; structure is what the lattice says; thermodynamics is whether nature lets it exist.

07. Three Enforcement Mechanisms

Architectural prior — the constraint is built into the model

  • Equivariant heads (NequIP, MACE) for symmetry.
  • Softmax decoder for simplex.
  • E(3)-equivariant message passing for rotational invariance.
  • Pro: guaranteed by construction.
  • Con: design effort; expressivity loss if applied wrong.

Hard projection / filter vs soft penalty

  • Hard projection: \(\hat{x} = \Pi_{\mathcal{F}}(x)\). Guaranteed; non-differentiable.
  • Soft penalty: \(\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{data}} + \lambda \mathcal{L}_{\text{phys}}\). Differentiable; no feasibility guarantee.
  • Hybrid (most common in 2026): soft during training, hard at inference / acquisition.

08. Constraints in the Regression Head

Composition simplex via softmax

  • Last layer: \(\mathbf{z} \in \mathbb{R}^{|\text{elements}|}\).
  • Output: \(x_i = \mathrm{softmax}(\mathbf{z})_i\).
  • Guarantees \(x_i \geq 0\), \(\sum_i x_i = 1\)exactly on the simplex.
  • Cost: zero. Use it everywhere (Goodfellow et al. 2016).

Charge-balance head for ionic compounds

  • Two output heads: cation fractions \(\mathbf{c}\), anion fractions \(\mathbf{a}\), both simplex.
  • Charge-balance constraint: \(\sum_i c_i z_i^{+} + \sum_j a_j z_j^{-} = 0\).
  • Enforce by projecting the joint output onto the constraint hyperplane.
  • Or: parameterise only the unconstrained degrees of freedom.

09. Constraints in the Generative Model

Latent-space projection

  • Sample \(z \sim p(z)\).
  • Project \(z\) onto the feasible-decoded manifold before decoding.
  • Equivalently: train decoder with feasibility-aware reconstruction loss; samples land on \(\mathcal{F}\) by construction.

Discriminator / score-based filter

  • Train a feasibility classifier \(f_\phi : \text{candidate} \to [0, 1]\) on stable-vs-unstable Materials Project entries.
  • Reject samples with \(f_\phi(x) < \tau\).
  • 2024–2026: diffusion guidance with \(f_\phi\) as the gradient signal during sampling — fast, modular (Goodfellow et al. 2016).

10. Constraints in the Acquisition Function

Constrained acquisition

\[x^* = \arg\max_{x \in \mathcal{F}} \alpha(x)\]

  • Filter the candidate pool \(\to \mathcal{F}\) before ranking.
  • Then maximise the U13 acquisition function \(\alpha(x)\) (EI, UCB, TS) only on \(\mathcal{F}\).
  • Filter first, rank second — order matters.

Cost-aware soft variant

  • \(\tilde{\alpha}(x) = \alpha(x) - \beta \, d(x, \mathcal{F})\).
  • \(d(x, \mathcal{F})\) = distance to the feasible set.
  • Smooth gradients survive; near-feasible candidates can still propagate.
  • Tune \(\beta\) to balance exploration and feasibility-margin tolerance.

Filter ordering matters: ranking 1000 candidates and then filtering to feasible ≠ filtering to feasible and then ranking. The two top-10 lists are different. Filter first.

11. Soft vs Hard: When to Choose Which

Soft penalty

\[\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{data}} + \lambda \mathcal{L}_{\text{phys}}\]

  • Differentiable; integrates with autograd.
  • Trades off data fit and feasibility — no guarantee.
  • Right tool during training (Neuer et al. 2024).
  • \(\lambda\)-tuning is a black art; cross-validate.

Hard projection

\[\hat{x} = \Pi_{\mathcal{F}}(x)\]

  • Guaranteed feasible.
  • Non-differentiable on \(\partial \mathcal{F}\).
  • Right tool at inference / acquisition.
  • Combine: train soft, deploy hard.

12. Case Study 1 — MoS₂ Stoichiometry in a Generative Model

Setup

  • VAE trained on transition-metal-dichalcogenide (TMD) compositions.
  • Latent space \(z \in \mathbb{R}^{16}\).
  • Decoder outputs (M-fraction, S-fraction, structure features).
  • Sample 1000 candidates.

Without vs with the simplex head

  • Vanilla: $$30% off-stoichiometry (M:S not 1:2 or close).
  • Two-head + softmax: M-fraction and S-fraction each on simplex, joint constraint \(2c_M = c_S\) enforced via reparameterisation.
  • $$95% physically valid; expressivity preserved.
  • No measurable degradation on reconstruction error.

13. Case Study 2 — Alloy Composition in a BO Loop

Setup

  • U13 BO loop on ternary Ni-Co-Cr alloy hardness.
  • Acquisition: EI on a GP surrogate.
  • Decision variable: composition \((x_{\text{Ni}}, x_{\text{Co}}, x_{\text{Cr}})\).

Unconstrained vs simplex acquisition

  • Unconstrained box \([0, 1]^3\): returns recipes summing to 0.94 or 1.07. Hand-normalisation introduces bias.
  • Simplex via reparameterisation \((x_1, x_2)\), \(x_3 = 1 - x_1 - x_2\), with \(x_i \geq 0\): returns valid recipes; regret curve unchanged.
  • No acquisition cost; large correctness gain.

Generalisable lesson: parameterise the constraint into the search space, do not impose it via post-hoc rescaling.

§C · MFML W13 PINN Recap and Materials Applications

14. PINN in One Slide (MFML W13 reminder)

The PINN loss

\[\mathcal{L}_{\text{PINN}} = \mathcal{L}_{\text{data}} + \lambda_r \|\mathcal{N}[u_\theta]\|^2 + \lambda_b \|\mathcal{B}[u_\theta]\|^2\]

  • \(u_\theta\): neural network approximating the field \(u(x, t)\).
  • \(\mathcal{N}[\cdot]\): PDE residual operator.
  • \(\mathcal{B}[\cdot]\): boundary / initial condition operator.
  • Backprop through \(\mathcal{N}[u_\theta]\) via autograd (Neuer et al. 2024).

What PINN gives you

  • A mesh-free, differentiable representation of \(u(x, t)\).
  • A natural framework for inverse problems (infer parameters of \(\mathcal{N}\)).
  • Pointer: soft-constraint balancing, training stability, NTK reweighting — all MFML W13. Not re-taught here.

15. Inverse-Problem Framing for Materials

Forward problem

  • Given parameters \(\theta_{\text{phys}}\) (diffusivity, conductivity, viscosity, mobility).
  • Solve the PDE \(\mathcal{N}[u; \theta_{\text{phys}}] = 0\).
  • Return \(u(x, t)\).
  • Classical PDE solvers do this fast and well.

Inverse problem (where PINNs shine)

  • Given measurements \(\{u(x_i, t_i)\}\).
  • Infer the unknown \(\theta_{\text{phys}}\).
  • PINN parameterises both \(u_\theta(x, t)\) and \(\theta_{\text{phys}}\) simultaneously; one optimisation.
  • Output: a consistent field and parameters.

16. Two Materials Uses Worth Knowing

Microstructure homogenisation

  • Heterogeneous strain field measured by digital image correlation (DIC) on a polycrystal.
  • Forward: elasticity PDE with grain-resolved stiffness tensor \(C(x)\).
  • Inverse: infer effective \(C^{\text{eff}}\) such that the PDE residual is small and matches measured strain.
  • PINN naturally enforces compatibility and equilibrium.

Phase-field parameter inference

  • Time-resolved microstructure data (in-situ TEM, 4D-STEM, optical).
  • Phase-field PDE: \(\partial_t \phi = -M \delta F / \delta \phi\).
  • Unknowns: mobility \(M\), interface energy \(\sigma\), double-well height.
  • PINN infers \(\{M, \sigma, \dots\}\) that reproduce observed phase-boundary motion.

17. Why PINNs Are Not the Universal Hammer

Where PINNs fit

  • A clean PDE with unknown parameters.
  • Sparse, noisy field measurements.
  • Mesh-free representation desirable.
  • Inverse problems with consistency constraints.

Where PINNs do not fit

  • Most static crystal-property prediction (no PDE) — U8–U10 plus §B.
  • Multi-step synthesis (no closed PDE) — phenomenological surrogate.
  • Catalysis, multi-phase synthesis, alloys with phase changes — PDE either unknown or unreliable.
  • High-stiffness PDEs — PINN training is brittle (Neuer et al. 2024).

Closing rule: use a PINN where you have a PDE you trust and parameters you do not. Otherwise use a §B-constrained surrogate.

§D · Trust Under Distribution Shift

18. The OOD Problem in Materials

The setup that breaks naïve trust

  • Train a surrogate on Materials-Project oxides.
  • Query a candidate from the nitride family.
  • The GP returns a confident posterior — small \(\sigma\).
  • The candidate is out-of-distribution; the small \(\sigma\) is meaningless.

The two operational OOD signals

  • Latent-space distance: how far is the candidate from the U11 / U12 latent manifold of the training set?
  • Feature-space Mahalanobis distance, deep-ensemble disagreement, density estimation in latent space.
  • Did our latent space cover this candidate? — the right question to ask before trusting the posterior (Bishop 2006).

19. The Simulation–Experiment Gap

Three sources of sim–exp gap

  • DFT functional bias: PBE underestimates band gaps by 30–50%; SCAN closer; r²SCAN now standard.
  • Geometry mismatch: DFT-relaxed lattice parameters differ from as-synthesised by 1–3%.
  • Property-definition mismatch: “stability” in DFT = \(E_{\text{hull}} \leq 0\) at 0 K, no entropy. “Stability” in synthesis = “we made it last week.”

Operational consequence

  • A model trained on DFT does not predict measurement.
  • The “MAE 30 meV/atom on Materials Project” headline is not the error you see at the bench.
  • Calibrate against experimental ground truth or accept large drift.

20. Calibration Drift

The phenomenon

  • Surrogate calibrated on chemistry family A.
  • Reliability diagram (cross-link ML-PC W8): nominal 90% intervals cover 88% — well calibrated.
  • Apply to chemistry family B without re-calibration.
  • 90% nominal intervals now cover 70%. Over-confident. Decisions made on this surrogate are wrong.

Why it happens

  • The aleatoric / epistemic split is family-specific.
  • Different chemistries have different intrinsic noise.
  • The kernel / network capacity allocated to family A may not generalise to family B (Murphy 2012).

Operational rule: re-calibrate the surrogate on every newly-entered chemistry family, before using its uncertainty for screening.

21. Conformal Prediction in One Slide

The construction

  • Train any surrogate \(\hat{f}\) on \(\mathcal{D}_{\text{train}}\).
  • Hold out a calibration set \(\mathcal{D}_{\text{cal}}\).
  • Compute residuals \(r_i = |y_i - \hat{f}(x_i)|\) on \(\mathcal{D}_{\text{cal}}\).
  • Take the \((1-\alpha)\)-quantile \(q_\alpha\) of \(\{r_i\}\).

The guarantee

\[\hat{C}_\alpha(x) = [\hat{f}(x) - q_\alpha,\ \hat{f}(x) + q_\alpha]\]

  • \(\Pr(y \in \hat{C}_\alpha(x)) \geq 1 - \alpha\).
  • Distribution-free: no Gaussianity, no kernel assumption.
  • Finite-sample: holds for any \(|\mathcal{D}_{\text{cal}}|\).
  • Model-agnostic: wraps any \(\hat{f}\).

22. Conformal Prediction for Materials Surrogates

Per-family calibration

  • Calibrate per chemistry slice — coverage is heterogeneous across families.
  • Mondrian conformal: split \(\mathcal{D}_{\text{cal}}\) by chemistry-family group, compute \(q_\alpha\) per group.
  • Per-family intervals reflect per-family epistemic content (Murphy 2012).

As an acquisition gate

  • Decision: “synthesise candidate \(x\) if interval width \(|\hat{C}_\alpha(x)| < \delta\).”
  • Wide interval = “we don’t know enough; do not commit synthesis budget.”
  • Combine with OOD score (next slide) for refusal: wide interval AND high OOD score → escalate to human.

23. OOD Detection — “Did We Cover This Candidate?”

Three usable OOD scores

  • Latent-space nearest-neighbour distance (U11/U12 representation): \(d_{\text{NN}}(x) = \min_{i} \|\phi(x) - \phi(x_i)\|\).
  • Mahalanobis in feature space: \(d_M(x) = \sqrt{(x - \mu)^\top \Sigma^{-1} (x - \mu)}\).
  • Deep-ensemble disagreement: \(\sigma_{\text{ens}}(x) = \mathrm{std}_k \hat{f}_k(x)\).

Use as a refusal gate

  • Threshold each score on validation OOD examples.
  • Reject acquisition candidate if any score exceeds threshold.
  • Two of three exceeding is a stronger refusal than one of three — combine (Goodfellow et al. 2016).

24. Failure Mode — Silent Extrapolation

The trap

  • Surrogate emits low variance on a novel chemistry.
  • Reason: the kernel does not represent the chemistry-family difference; “looks like training” in the kernel metric.
  • Conformal interval is also tight (calibration set was in-distribution).
  • All trust signals say “go.” All trust signals are wrong.

Mitigation

  • Independent OOD score not derived from the surrogate.
  • Conservative refusal: low surrogate variance + high OOD score = refuse.
  • Periodic blind audit: synthesise one or two flagged-OK-but-unusual candidates per month, measure the actual error.

The lesson: trust is a system property, not a model property. Combine signals.

25. Trust Budget — The Operational Summary

The audit trail per decision

  • Surrogate: model, training set, version.
  • Conformal calibration: calibration set, \(\alpha\), per-family \(q_\alpha\).
  • OOD score: which score(s), threshold, value.
  • Feasibility filter: which constraints were checked.
  • Decision: rank, candidate, refusal flag, human-review status.

The materials-specific MFML W14 instantiation

  • MFML W14 framed trust abstractly.
  • §D narrows to: distribution shift, chemistry-family OOD, sim-vs-exp gap, conformal coverage, audit trail.
  • The audit trail is the model card (§F slide 38) — written once per loop run.

§E · The Autonomous-Lab Loop

26. The Closing-the-Loop Ambition

Discovery is a decision problem, not a prediction problem

  • A surrogate that predicts \(E_{\text{hull}}\) for 200 candidates is not discovery.
  • Discovery = one of those 200 ended up in a vial, and we know which one and what it became.
  • The loop is what turns proposals into measured outcomes.

Six steps, repeated

  1. Predict (U8–U10, U13 surrogate).
  2. Propose (acquisition + §B feasibility + §D refusal).
  3. Schedule (workflow engine, instrument time).
  4. Run (synthesis robot).
  5. Measure (characterisation pipeline).
  6. Update (parser, database, surrogate retrain).

27. Loop Architecture

Components, named

  • Surrogate stack: U8–U10 NN + U13 GP/ensemble.
  • Proposal layer: U13 acquisition + U14 §B feasibility + §D conformal/OOD gate.
  • Scheduler: workflow engine. Picks order; manages parallelism.
  • Execution layer: synthesis hardware (powder dispenser, furnace, glovebox).
  • Measurement layer: characterisation hardware (XRD, mass-spec, electrochem).
  • Feedback layer: parser → database → retrain trigger.

Interfaces, the painful part

  • Surrogate ↔︎ proposal: API call.
  • Proposal ↔︎ scheduler: structured candidate (composition, recipe, target).
  • Scheduler ↔︎ hardware: instrument SDK, vendor API, SiLA-2 / OPC-UA for cross-platform.
  • Measurement ↔︎ database: parser per instrument, schema-versioned.
  • Each interface is a real engineering effort.

28. The Orchestration Stack

Workflow engines (pick one)

  • AiiDA: materials-native, provenance-tracked.
  • FireWorks: simpler, materials community.
  • Prefect / Airflow: general-purpose, large community.
  • Argo Workflows: Kubernetes-native, scale-out.
  • Pick one, do not roll your own.

BO drivers (pick one)

  • BoTorch: PyTorch-native, modern, multi-fidelity ready.
  • Ax: BoTorch + experiment management.
  • Dragonfly: works at scale, classical roots.
  • GPyOpt: light-weight, easy onboarding.
  • Pick one, plug into the workflow engine.

The 80/20 rule for autonomous labs: 80% of the work is orchestration; 20% is the surrogate. The community has good tools for both halves now — use them.

29. A-Lab — The 2023 Case

What A-Lab claimed

  • Berkeley’s autonomous lab for inorganic synthesis (szymanski2023autonomous?).
  • Goal: synthesise candidates predicted stable by Materials Project DFT screens.
  • Pipeline: powder dispenser → furnace → XRD characterisation → automated phase identification.
  • Reported high-throughput synthesis of many candidate compounds, with phase-match confirmation by XRD.

What A-Lab demonstrated

  • The workflow works: hardware integration, scheduling, parsing.
  • A non-trivial synthesis success rate on novel-stoichiometry candidates.
  • The integration story is real — not a slide-deck.
  • A landmark for the field (Sandfeld et al. 2024).

30. The A-Lab Debate, Honestly

The follow-up critique (leeman2024challenges?)

  • Independent re-analysis of A-Lab’s reported novel phases.
  • Many candidates re-assignable to known structures with different stoichiometric labelling.
  • The automated XRD phase-identification pipeline mis-assigned phases that an expert crystallographer would have flagged.

What we learn

  • Lesson 1: autonomous synthesis works; autonomous novelty verification does not yet.
  • Lesson 2: the workflow result and the science result need separate evaluation.
  • Lesson 3: human structural review is still required for novelty claims in mid-2026.
  • Lesson 4: the field updated honestly. That is health.

31. Other 2023–2026 Self-Driving Labs

Photochemistry / catalysis

  • Aspuru-Guzik group (Toronto, Vector Institute, then Harvard): self-driving labs for photocatalysts, organic reactions.
  • ChemOS / ChemBO software stack — open source.
  • Successful single-domain demos; cross-domain generalisation open.

Energy materials, polymers, electrolytes

  • MIT polymer-electrolyte loops.
  • IBM RXN for chemistry → battery and catalyst loops.
  • LBNL battery cycling automation.
  • Each is single-domain, single-platform; recipe portability across labs is not yet demonstrated.

32. Failure Modes — Synthesis Side

Recipe ambiguity

  • “Heat at 600 °C for 12 h” — what ramp rate? Crucible material? Atmosphere?
  • The same nominal recipe on two platforms produces different products.
  • Mitigation: recipe representation includes ramp profile, atmosphere, vessel, contact materials, not just nominal temperature and time.

Hardware bottlenecks and sample-handling errors

  • Weighing, mixing, thermal cycles dominate cycle time, not the surrogate.
  • Dropped vials, contaminated crucibles, mis-loaded samples — invisible to the model, fatal to the data.
  • Mitigation: instrumented hardware (vibration, mass, atmosphere logs); per-step success flags routed to the database.

33. Failure Modes — Measurement Side

Characterisation-pipeline failures

  • Automated XRD phase ID misidentifies (slide 30).
  • Spectral fitting fails on overlapping peaks; pipeline returns “best match” with no warning.
  • Drift in instrument calibration over weeks of campaign.

The operator-time bottleneck

  • “Autonomous” pipelines often produce 100–500 spectra / day for human review.
  • Reviewing 200 spectra / day is not autonomous — it is a person staring at a screen.
  • Mitigation: triage by uncertainty (review only flagged), automate the obvious calls, human-in-the-loop for ambiguous.

34. What Works, What Does Not — Mid-2026

Works (productive use)

  • Workflow orchestration on a single platform.
  • BO over single composition axes with one fast measurement endpoint.
  • Synthesis-then-XRD on inorganic powders with curated phase library.
  • Photochemistry with HPLC readout.
  • Closed-loop within a curated chemistry family.

Marginal / does not yet work

  • Multi-step synthesis with on-line correction.
  • Multi-property optimisation under conflicting objectives.
  • Cross-platform recipe portability.
  • Open-ended novelty discovery without curated candidate pools.
  • Self-debugging instruments.
  • Cross-domain transfer (catalysis ↔︎ batteries).

The honest 2026 verdict: autonomous labs are real research infrastructure, within their domain. They are not yet a general-purpose discovery engine.

35. The Minimum Viable Autonomous Loop in 2026

What you need

  • One synthesis platform you control end-to-end.
  • One measurement endpoint with a parser you trust.
  • A constrained, calibrated surrogate (§B + §D wrapping a U13 GP).
  • A workflow engine of choice (slide 28).
  • Audit trail per decision (slide 25).
  • A model card + dataset card per loop run (§F).

What that buys you

  • A loop that runs nights and weekends.
  • 3–5× throughput over manual screening within its chemistry domain.
  • Reproducible artefacts (logs, model card, run record) for publication.
  • Real research infrastructure — within the constraints of slide 34.
  • A platform to grow on as new chemistry / measurement modalities come online.

§F · Reproducibility and FAIR for Materials ML

36. FAIR for Materials ML Artefacts

FAIR principles, applied to ML

  • Findable: DOI for code, weights, dataset, run logs.
  • Accessible: public artefact registry (Zenodo, HuggingFace, MaterialsCloud).
  • Interoperable: standard formats (ASE, OPTIMADE, structured JSON for runs).
  • Reusable: licence, environment lock, deterministic seed, version-pinned dependencies.

ML artefacts that need FAIR-ification

  • Training dataset (immutable snapshot).
  • Model weights (versioned).
  • Training script + environment lock.
  • Calibration / conformal artefacts.
  • Run logs from each loop iteration.
  • Model card + dataset card (next slides) (Sandfeld et al. 2024).

37. Dataset Cards for Materials

What a dataset card answers

  • Provenance: which database, which DFT functional, which relaxation status, which version.
  • Coverage: which chemistry families, which composition ranges, which property ranges.
  • Known biases: selection bias toward stable phases, experimental-vs-computed mixture, duplication.

Splits and distribution

  • Random split numbers — the headline.
  • Chemistry-family LOCO numbers — the operational reality.
  • Time-stratified splits — drift signal.
  • OOD-slice numbers — distribution-shift stress test (Bishop 2006).

38. Model Cards for Materials Surrogates

Intended use vs out-of-scope

  • Intended: which property, which composition range, which chemistry families.
  • Out-of-scope: chemistry families absent from training, properties not predicted, environmental conditions not represented.
  • Performance metrics: random-split, LOCO, OOD slice — all three.

Trust artefacts

  • Calibration / conformal-coverage diagnostics per family.
  • Known failure modes (one or two named, with mitigation).
  • Reproduction artefact bundle (script + environment + data DOI).
  • Audit-trail format for downstream use (Neuer et al. 2024).

39. Benchmark Hygiene and the Materials Project Debates

Recent shortcut-learning findings

  • Several 2023–2025 papers documented shortcut learning on common Materials Project benchmarks.
  • Composition leakage: train and test sets share compositions through prototype duplication.
  • “MAE 30 meV” headline numbers degrade markedly under chemistry-family LOCO splits.

Recommended 2026 practice

  • Chemistry-family LOCO as the primary evaluation.
  • Time-stratified holdouts to test drift.
  • At least one OOD slice (held-out chemistry the model was not trained on).
  • Report all three; do not report random-split alone.

40. What a Reviewable Materials-ML Submission Looks Like in 2026

The minimum bundle

  1. Dataset card (slide 37).
  2. Model card (slide 38).
  3. Training script + environment lock (slide 36).
  4. Random-split + LOCO + OOD numbers (slide 39).
  5. Conformal-coverage table per chemistry family (slide 22).

The story bundle

  1. One named failure mode + one mitigation, in the paper text.
  2. (If autonomous-lab work) audit-trail summary; loop iterations; one named loop failure with post-mortem.
  3. Code + data DOIs in the paper, not just on a website that may rot.

The reviewer’s checklist: can I reproduce the model from the artefacts? Can I reproduce the numbers? Can I trust the OOD claim? If any answer is “no,” the paper needs revision.

§G · 2026 Outlook

41. Foundation MLIPs as the New Substrate

The 2024–2026 emergence

  • MACE-MP-0 (batatia2024foundation?): equivariant GNN trained on Materials Project relaxation trajectories.
  • CHGNet (deng2023chgnet?): charge-informed GNN for transition-metal chemistry.
  • ORB (Orbital Materials, 2024): general-purpose ML potential.
  • GNoME (merchant2023scaling?) and follow-ups: large-scale MLIP-driven discovery.

What they share

  • Trained on \(10^6\)\(10^8\) DFT structures.
  • Transferable across most of the periodic table.
  • Replace DFT for relaxation / dynamics in the chemistry space they cover.
  • Open weights; reproducible. The first open-source materials substrate at this scale.

42. What Foundation MLIPs Change

Cheap energy evaluation

  • \(10^4\)\(10^6\) structures per GPU-hour with a foundation MLIP.
  • DFT: 1–10 structures per GPU-hour.
  • A 1000× speedup. Real.

Bottleneck shift

  • The bottleneck was DFT energy evaluation.
  • The bottleneck is synthesis + measurement.
  • Discovery loop economics: spend the saved compute on more measurement, not more compute.
  • Multi-fidelity AL (slide 43) becomes the natural framing.

43. Multi-Fidelity Active Learning

Three fidelities in 2026 MG

  • Cheap: foundation MLIP inference. ~1 ms / structure.
  • Medium: DFT (PBE / SCAN / r²SCAN). ~1 hour / structure.
  • Expensive: experiment. Days / measurement.

Routing the query

  • Multi-fidelity GP / BO: model the joint over fidelities.
  • For each candidate, compute expected information gain per cost at each fidelity.
  • Query the cheapest fidelity that pays.
  • Save expensive measurement for candidates the cheap fidelities cannot resolve.

The right framing is not “DFT or experiment” — it is “spend each budget where it pays” (Murphy 2012).

44. Closing-the-Loop Economics

Cost per query in 2026 (order of magnitude)

  • Foundation MLIP: \(\sim\) €0.001 per structure (cents to fractions of a cent).
  • DFT calculation: \(\sim\) €1 per structure (compute + storage).
  • Experiment: \(\sim\) €100–10000 per measurement (consumables, beam time, instrument time, person time).

The economic logic

  • Each experiment saved by a calibrated surrogate is worth \(\sim\) €100–10000.
  • Each MLIP / DFT calculation costs \(\sim\) €0.001–€1.
  • Spending €1000 of compute to save one €1000 experiment is even.
  • Spending €1000 of compute to save ten €1000 experiments is the loop’s economic basis.

45. Open Problems

Methodological open problems

  • Cross-domain transfer (catalysis ↔︎ batteries ↔︎ structural alloys).
  • Honest uncertainty under distribution shift at scale.
  • Recipe portability across platforms.
  • Novelty verification without human review.
  • Reproducibility of multi-month autonomous campaigns.

Infrastructural open problems

  • Standardised recipe representation across inorganic / organic / polymer.
  • Cross-vendor instrument APIs.
  • Data-format interoperability (parsers, schemas, units).
  • Long-term stewardship of campaign databases (5+ year horizon).
  • Cost of the integration effort itself (the social problem).

46. Where MG Is, Mid-2026

Maturity ladder, by component

  • Substrate (data + foundation MLIPs): mature.
  • Modelling (surrogates + UQ): mature in-domain, fragile OOD.
  • Loop infrastructure: working in single domains; portability open.
  • End-to-end autonomous discovery of genuinely novel materials: open. The next five years.

Reading the field honestly

  • The hype is real for the substrate; over-stated for the discovery claims.
  • A working materials-ML stack in 2026 is real research infrastructure — within slide 34’s caveats.
  • The next five years are the integration / cross-domain era. Your career.

§H · Course Wrap

47. The MG Syllabus Arc in One Slide

The arc, walked once more

  • Physics (U2–U4): QM postulates, electronic structure, thermo, atomistic simulation.
  • Representations (U6–U7): graphs, local atomic environments, descriptors.
  • Models (U8–U10): regression, neural networks, learned representations.
  • Geometry (U11–U12): latent spaces, clustering, discovery vs labelling.

The arc’s destination

  • Decision (U13–U14): UQ, BO, constraints, trust, autonomous loops.
  • Each unit served the next.
  • This unit served all of them.
  • The integration story is the test of whether the rest taught anything operational.

48. The Four Big Skills

Choose and Train

  1. Choose a representation with the right invariances for the property (U6–U7, U10, §B).
  2. Train a surrogate with calibrated uncertainty and a defensible split protocol (U8, U13, §F).

Plan and Close

  1. Plan an acquisition that respects feasibility, OOD coverage, and budget (U13, §B, §D).
  2. Close the loop with reviewable artefacts (model card, dataset card, run log) (§E, §F).

If you can do all four end-to-end on a chemistry domain you care about, you are an MG practitioner. That is what this course taught.

49. Reading List for Going Further

The course textbooks

  • (Sandfeld et al. 2024) — materials data science from the engineering perspective. The most practically useful single book.
  • (Neuer et al. 2024) Ch6–Ch7 — physics-informed and explainable methods at engineering depth.
  • (Bishop 2006) §9 — clustering, EM, foundation for U12.
  • (Murphy 2012) — for probabilistic depth on UQ and BO.
  • (Goodfellow et al. 2016) Ch14 — autoencoders, the basis of U10–U12.

Beyond MG itself

  • MFML W13 — full PINN tutorial.
  • MFML W14 — generic explainability and trust.
  • ML-PC W14 — autonomous characterisation reflection.
  • 2024–2026 review papers on autonomous labs (Aspuru-Guzik, A-Lab follow-ups, GNoME), MatBench reform, foundation MLIPs.
  • Open-source stacks: AiiDA, BoTorch, MACE-MP, HuggingFace Hub for materials.

50. Exam, Questions, and End of MG

Exam scope

  • The four big skills (slide 48), each via one operational scenario.
  • Worked examples drawn from U8, U10, U13, U14.
  • Vocabulary: representation, invariance, kernel, posterior, conformal, OOD, feasibility, audit trail.
  • Open-book in spirit: bring one A4 cheat sheet (handwritten only).

The exam rubric

  • Bring to every answer: a feasibility filter, a conformal wrapper, a defensible split (LOCO or time), a named failure mode, a mitigation.
  • That is the rubric. State the constraint. State the trust signal. State the failure mode. State the mitigation.
  • Thank you. Questions.
Bishop, Christopher M. 2006. Pattern Recognition and Machine Learning. Springer.
Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press.
Murphy, Kevin P. 2012. Machine Learning: A Probabilistic Perspective. MIT Press.
Neuer, Michael et al. 2024. Machine Learning for Engineers: Introduction to Physics-Informed, Explainable Learning Methods for AI in Engineering Applications. Springer Nature.
Sandfeld, Stefan et al. 2024. Materials Data Science. Springer.

Continue