Materials Genomics
Unit 14: Constraints, Trust, and Integration Outlook

Prof. Dr. Philipp Pelz

FAU Erlangen-Nürnberg

§A · Where We Are at the End of MG

01. Today’s Mission

The closing unit, in one line

Take everything from U2–U13 and make it run as a closed-loop discovery system without lying to itself.
Three knobs: physical constraints, distribution-shift-aware trust, experimental closure.
One centrepiece: the autonomous-lab loop.

What U14 is not.

Not a re-derivation of PINNs — that is MFML W13 (Neuer et al. 2024).
Not a generic explainability lecture — that is MFML W14.
Not the imaging-side autonomous-pipeline talk — that is ML-PC W14.
Today’s job: the integration story that ties MG together.

Open with a frame, not a definition. “Twelve units of materials genomics. Today is the unit that asks: does any of this run end-to-end? If we put a surrogate, an acquisition function, a synthesis robot, and a measurement instrument in one room and pressed go — would we get a discovered material, or would we get six weeks of garbage and one wrecked sample stage?”

Stake the lecture. The whole course has been pointed at this lecture. U2–U4 gave us the physics; U6–U7 gave us representations; U8–U10 gave us predictive models; U11–U12 gave us latent geometry; U13 gave us calibrated uncertainty and Bayesian optimisation. None of that closes the loop. Today does.

Pacing. Eight minutes on §A (recap and unfinished business); seventeen on §B (constraints, the materials-specific bit); seven on §C (PINN reminder, light); sixteen on §D (distribution-shift trust); twenty on §E (the autonomous-lab loop, the centrepiece); nine on §F (FAIR / model cards / benchmarks); eight on §G (2026 outlook); five on §H (course wrap, exam framing).

Triad coordination, said aloud. “If you took MFML alongside MG, you saw the full PINN tutorial last week (W13) and the full explainability + trust talk this week (W14). I will not re-do those. I will give you the materials-specific instantiations and trust the MFML deck for the framework. If you did not take MFML, the MFML W13/W14 decks are on the shared drive and you should read them — they are the foundation, today is the application.”

Anti-hype frame. Autonomous labs are the most over-sold and under-evaluated topic in materials ML 2024–2026. Today is not a hype reel. Today is “what works, what fails, what is honest in mid-2026, and what you can build yourself in your master’s thesis without lying about what you built.”

02. The MG Arc in One Slide

What we built

U2–U4: QM/QC postulates, electronic structure, thermo, classical atomistic simulation. The physics substrate.
U6–U7: local atomic environments, descriptors, and crystal graphs. The representation substrate.
U8–U10: regression, NN models, learned representations. The predictive substrate.

Where we ended

U11–U12: latent spaces and clustering. Discovery vs labelling, manifold geometry.
U13: Materials Project + OQMD + AFLOW; convex hull; Gaussian processes; Bayesian optimisation.
U14 today: the integration that makes U8–U13 a system rather than a pile of notebooks.

Why repeat the arc. Some students will only have shown up half the weeks. The closing slide of the course needs to say “here is everything you missed if you missed any of it.” Five seconds per unit, twelve units, ninety seconds total.

Connections to draw aloud. U6 (local atomic environments) and U10 (learned representations) are the two ways to feed structure into U8’s regression head. U11’s latent space is what U13’s GP kernel sits on. U13’s BO loop is what §E’s autonomous lab calls every iteration. U14 §B’s constraints sit between U13’s surrogate and the synthesis step. Every unit served the next.

The honest reading of the arc. This is not a clean story where each unit perfectly enabled the next. There are gaps — most importantly between U13’s “we have a calibrated proposal” and U14’s “we have a synthesised material.” That gap is what §E will spend twenty minutes on.

The exam-framing version. The arc is the exam blueprint. A typical exam question takes a piece of one unit and asks how it connects to a piece of another — “given a U13 GP posterior, how does a U14 §B feasibility filter change the acquisition ranking?” Students who understand the arc can answer; students who memorised single units cannot.

03. What U2–U13 Left Unfinished

Three honest gaps

U13 candidates can violate stoichiometry / symmetry / charge — the BO loop happily proposes Na₂Cl₃.
U13 confidence intervals are GP / ensemble posteriors, not finite-sample coverage guarantees.
U13 ends at “propose a candidate.” Nobody synthesises it.

The three gaps map to U14’s three knobs

Gap 1 → §B physical constraints (hard projection, soft penalty, architectural prior).
Gap 2 → §D conformal prediction + OOD detection.
Gap 3 → §E the autonomous-lab loop.

The “Na₂Cl₃” example. Run a vanilla generative model on the Materials Project, ask it to propose stable Na-Cl compounds. It will propose Na₂Cl₃ with ~30% probability. There is no Na₂Cl₃. The model has no concept of charge balance — it learned the statistics of string lengths, not the chemistry. Without constraint enforcement, generative models are stochastic-parrot chemists.

The “GP posterior is not coverage” trap. A GP returns $\mu \pm \sigma$. Students treat $\mu \pm 2\sigma$ as a 95% interval. It is — under the GP’s assumed kernel, on data drawn from the assumed prior. Step outside that distribution and the interval has no calibration. We need a finite-sample, distribution-free guarantee. That is conformal prediction (§D).

The “nobody synthesises it” trap. A surrogate that proposes 200 candidates is not discovery. Discovery means one of them ended up in a vial, and we know which one and what it became. The autonomous-lab loop is what turns “200 proposals” into “1 measured outcome / week, repeated for 50 weeks.” It is the hardest part. It is also where MG, MFML, and ML-PC stop being university-toy machine learning and start being lab infrastructure.

Said as a frame for the rest of the lecture. “These three gaps are what today fills. Forty-seven slides. Hold tight.”

04. Learning Outcomes for Unit 14

By the end of 90 minutes, you can:

Enforce physical constraints (stoichiometry, charge, symmetry, conservation) in regression heads, generative models, and acquisition functions.
Recognise when to choose soft penalty vs hard projection vs architectural prior for a given constraint.
Recall the PINN loss in one line and identify two materials problems where PINNs help (and two where they do not).

Wrap a surrogate with conformal prediction and an OOD score to obtain finite-sample coverage and a refusal mechanism.
Sketch an autonomous-lab loop architecture and name three failure modes for the synthesis side and three for the measurement side.
Articulate the 2026 honest assessment: what works, what is marginal, what does not work yet.

Exam-weight outcomes. 1, 4, 5 are exam-weight; expect a question that uses outcome 1 vocabulary on a generative-model proposal, outcome 4 on a screening decision, and outcome 5 on an autonomous-lab failure mode.

Skill-weight outcomes. 2, 3, 6 are skill-weight — they shape how you talk about systems, not what you compute. The exercise tests them implicitly via the “name one failure mode and one mitigation” rubric.

The five “must-know” statements (introduce now, repeat in §H slide 48).

Constraints are correctness, not regularisation. A model that violates stoichiometry is not “approximately right” — it is wrong.
Soft penalties go in the loss; hard projections go in the acquisition; architectural priors are the gold standard.
Conformal prediction wraps any surrogate with a finite-sample, distribution-free coverage guarantee. Use it.
The autonomous-lab loop has six steps; failure can occur at any of them. Audit the trail.
In 2026, the loop’s bottleneck is synthesis and measurement, not the surrogate. Your ML model is not the slow part.

Tell them where the exam questions live. Each of the five statements is a likely exam stem. The “discovery-loop economics” calculation (§G slide 44) is also fair game.

§B · Physical Constraint Enforcement

05. Why Constraints Are Not an Afterthought

The naïve generative-model failure

Train a VAE on the Materials Project formula list.
Sample 1000 candidates.
Inspect: $$30% violate stoichiometry (non-integer ratios, violated cation / anion balance) (Goodfellow et al. 2016).
Top-$k$ acquisition list is dominated by garbage before the surrogate even runs.

Constraints are correctness

A surrogate that emits “Cu with 7-fold rotational symmetry” is not “noisy” — it is wrong.
Regularisation makes a valid model better; constraints make an invalid model valid.
Treat constraints with the same rigour as a unit-test, not as a hyperparameter.

War story to open with. A graduate student in our group, two years ago, published a “novel” Mn-O candidate from a generative model. Reviewer 2 noted that the formula was MnO₁.₆ — physically impossible without a vacancy mechanism the model never represented. The paper was rewritten. Lesson: do not let a generative model out of the lab without a stoichiometry filter.

The “30% garbage” number is real. Reproduce: any vanilla VAE trained on Pearson-symbol-aware Materials Project entries, sampled without constraint, returns 25–40% off-stoichiometry samples depending on training data. Diffusion models are slightly better (more conservative tails) but still 10–20% off without explicit guidance.

The terminological battle to settle now. In MFML W4 (regularisation) we called $\lambda \|\theta\|^2$ a constraint. It is not — it is a soft preference. In U14 constraint means a feasibility set $\mathcal{F}$ that the output must lie in. Stoichiometry is a constraint. L2 weight decay is a regulariser. Different objects, different roles.

Why this is the first slide of §B. Students arrive thinking constraints are an “improvement.” After this slide they should think constraints are a prerequisite. Everything in §B then explains how — the why is settled.

06. Four Families of Materials Constraints

Composition-side

Stoichiometry / charge balance: integer (or rational) site occupancies; sum of oxidation states = 0.
Composition simplex: $\sum_i x_i = 1$, $x_i \geq 0$ for fractional alloys.

Structure-side

Symmetry: space-group consistency, site multiplicity, Wyckoff positions (Sandfeld et al. 2024).
Conservation: mass / energy / momentum (where the system is closed).
Thermodynamic feasibility: $E_{\text{hull}} \leq \Delta_{\text{tol}}$ from U13.

Mnemonic: Composition is what the formula says; structure is what the lattice says; thermodynamics is whether nature lets it exist.

Why four families and not one big list. Each family has different enforcement options. Stoichiometry is naturally a softmax / simplex projection. Symmetry is naturally an architectural prior (equivariant networks). Conservation is naturally a soft penalty (Noether-like). Thermodynamic feasibility is naturally a hard filter on the acquisition set. Mixing the families up leads to picking the wrong tool.

Stoichiometry vs charge balance — say it once. They are almost the same but not quite. Stoichiometry is the integer-site-occupancy constraint. Charge balance is the oxidation-state sum constraint. Na₂O₂ satisfies stoichiometry (clean integers) but charge balance only if peroxide O₂²⁻ is allowed. A model that knows stoichiometry but not charge balance will propose ionic compounds with impossible oxidation states.

The thermodynamic-feasibility connection to U13. U13 ended with the convex-hull-aware acquisition: rank by expected $E_{\text{hull}}$ improvement, not raw energy. That is already a constraint — the feasibility set is “$E_{\text{hull}} \leq \Delta$ for some tolerance.” We are continuing that move here, not starting a new one.

Conservation laws — the honest scope. For static crystal-property prediction, conservation is largely automatic (the lattice is fixed). Conservation matters in dynamic problems: catalysis, growth, electrochemistry. Most of MG’s surrogate problems are static; conservation matters mainly when we get to PINNs in §C.

07. Three Enforcement Mechanisms

Architectural prior — the constraint is built into the model

Equivariant heads (NequIP, MACE) for symmetry.
Softmax decoder for simplex.
E(3)-equivariant message passing for rotational invariance.
Pro: guaranteed by construction.
Con: design effort; expressivity loss if applied wrong.

Hard projection / filter vs soft penalty

Hard projection: $\hat{x} = \Pi_{\mathcal{F}}(x)$. Guaranteed; non-differentiable.
Soft penalty: $\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{data}} + \lambda \mathcal{L}_{\text{phys}}$. Differentiable; no feasibility guarantee.
Hybrid (most common in 2026): soft during training, hard at inference / acquisition.

The “where does the constraint live” question. Three answers:

In the architecture — the model literally cannot output an infeasible value (softmax for simplex, equivariant layers for symmetry). Best when feasible.
In the loss — the model is encouraged toward feasibility but can drift. Use during training because gradients flow.
Outside the model — a post-hoc filter / projection. The model is unchanged; we just refuse infeasible outputs.

Most 2026 systems use a combination: architectural where cheap (softmax for composition), soft penalty where differentiable (energy conservation in PINNs), hard filter at acquisition time (stoichiometry round-off, $E_{\text{hull}}$ threshold).

The differentiability point matters during training. A hard projection $\Pi_{\mathcal{F}}$ has zero gradient inside $\mathcal{F}$ and undefined gradient on its boundary. You cannot backprop through it. So during training, soft penalties dominate. At inference / acquisition, hard projections dominate.

The “expressivity loss” warning on architectural priors. A naïve symmetry-aware decoder forced to output cubic crystals cannot represent non-cubic cubics-perturbed structures (slight distortions). The constraint “lattice is cubic” is too tight if the data contains tetragonal phases. Architectural priors must match the true feasibility set, not a strict subset of it.

One-line summary to leave on the board. “Architectural for guaranteed; soft for trainable; hard for safe.”

08. Constraints in the Regression Head

Composition simplex via softmax

Last layer: $\mathbf{z} \in \mathbb{R}^{|\text{elements}|}$.
Output: $x_i = \mathrm{softmax}(\mathbf{z})_i$.
Guarantees $x_i \geq 0$, $\sum_i x_i = 1$ — exactly on the simplex.
Cost: zero. Use it everywhere (Goodfellow et al. 2016).

Charge-balance head for ionic compounds

Two output heads: cation fractions $\mathbf{c}$, anion fractions $\mathbf{a}$, both simplex.
Charge-balance constraint: $\sum_i c_i z_i^{+} + \sum_j a_j z_j^{-} = 0$.
Enforce by projecting the joint output onto the constraint hyperplane.
Or: parameterise only the unconstrained degrees of freedom.

Softmax is the cheapest win in materials ML. If your composition output is not on a softmax, your composition output is wrong. Vanilla regressors output unnormalised vectors that get hand-normalised post-hoc — and the hand-normalisation step is where bugs live (negative fractions zeroed out, sums of 1.03 silently rescaled). Use softmax. End of discussion.

The “parameterise only unconstrained DOF” trick. For a 5-element ionic system with charge balance, you have 5 element fractions but only 4 free parameters (sum = 1) and effectively 3 once charge balance is imposed. You can either parameterise 5 with two projections, or parameterise 3 directly via a clever reparameterisation. The latter is cleaner but harder to design; the former is easier and almost always good enough.

Equivariant message passing — name them so students recognise them. NequIP (Batzner 2022), MACE (Batatia 2022), Allegro (Musaelian 2023), eSEN, ORB. These are E(3)-equivariant graph neural networks: rotating / translating / reflecting the input rotates / translates / reflects the output, exactly, by construction. They are the architectural-prior gold standard for atomic-system regression in 2026.

A common student bug. Outputting fractions on a softmax and then computing a $\sum_i x_i$-related quantity downstream that also normalises. Now the gradient signal is muddled. Pick one normalisation point.

09. Constraints in the Generative Model

Latent-space projection

Sample $z \sim p(z)$.
Project $z$ onto the feasible-decoded manifold before decoding.
Equivalently: train decoder with feasibility-aware reconstruction loss; samples land on $\mathcal{F}$ by construction.

Discriminator / score-based filter

Train a feasibility classifier $f_\phi : \text{candidate} \to [0, 1]$ on stable-vs-unstable Materials Project entries.
Reject samples with $f_\phi(x) < \tau$.
2024–2026: diffusion guidance with $f_\phi$ as the gradient signal during sampling — fast, modular (Goodfellow et al. 2016).

The 2024–2026 development to flag. Diffusion models with classifier guidance changed the field. Instead of training a constrained generative model from scratch (expensive, tied to one constraint), you train an unconstrained diffusion model once and add classifier-guidance gradients $\nabla_x \log f_\phi(x)$ during sampling. Switching constraints means switching $f_\phi$, not retraining the diffusion model. This is now the default for most materials generative work in 2026.

The “feasibility classifier” trap. $f_\phi$ is itself a learned model. It has its own OOD failure modes. A feasibility classifier trained on Materials Project will reject anything Materials Project would not have selected — including some genuinely novel-but-feasible candidates. The classifier is a useful filter, not an oracle.

Material-specific generative models in 2026. CDVAE (Xie 2022), DiffCSP (Jiao 2023), MatterGen (Microsoft, Zeni 2024), GNoME (merchant2023scaling?). Each handles symmetry / stoichiometry differently — CDVAE uses periodic-boundary-aware decoders, DiffCSP uses fractional-coordinate diffusion, MatterGen uses adjacency-matrix joint diffusion. The constraint-handling architecture is increasingly the differentiating feature.

Connecting to U11 / U12. The latent space of U11 is the same object as the diffusion latent here. A “feasible-decoded manifold” is a subset of U11’s manifold restricted by the constraint. Visualising it (t-SNE, UMAP from U11) shows where the constraint cuts the space.

10. Constraints in the Acquisition Function

Constrained acquisition

\[x^* = \arg\max_{x \in \mathcal{F}} \alpha(x)\]

Filter the candidate pool $\to \mathcal{F}$ before ranking.
Then maximise the U13 acquisition function $\alpha(x)$ (EI, UCB, TS) only on $\mathcal{F}$.
Filter first, rank second — order matters.

Cost-aware soft variant

$\tilde{\alpha}(x) = \alpha(x) - \beta \, d(x, \mathcal{F})$.
$d(x, \mathcal{F})$ = distance to the feasible set.
Smooth gradients survive; near-feasible candidates can still propagate.
Tune $\beta$ to balance exploration and feasibility-margin tolerance.

Filter ordering matters: ranking 1000 candidates and then filtering to feasible ≠ filtering to feasible and then ranking. The two top-10 lists are different. Filter first.

Why filter ordering matters — the gotcha. Suppose you have an EI acquisition function. You compute EI for 1000 candidates, sort, take the top 10. Of those 10, say 7 are infeasible. You filter, take the remaining 3. Now repeat with 1000 different candidates: the same procedure gives 4 different feasible candidates because the EI ranking depends on the distribution of all 1000 EI values (variance normalisation in some implementations). Filter first, rank only the feasibles, take top-10. This is the correct order.

The “constraint at acquisition” advantage over “constraint in the surrogate”. A surrogate trained without constraints can predict $E_{\text{hull}}$ for infeasible candidates and the prediction is valid (just useless to act on). A surrogate trained with constraints baked in cannot even represent infeasible candidates — useful for sampling, restrictive for analysis. The 2026 best practice is unconstrained surrogate + constrained acquisition. Cleanest separation of concerns.

Cost-aware soft variant — when to use. When the feasibility boundary is fuzzy — i.e. when “stoichiometry within ±0.05 of integer” is acceptable for nominal screening. In that case a hard threshold throws away near-feasible candidates that experimentalists would happily synthesise. The soft variant keeps them in play with a small penalty.

Connecting to U13’s expected-hull-improvement. U13 already has one constraint baked into its acquisition: the convex-hull-aware ranking. Today’s §B generalises that move to other constraint families. Hull-improvement is the thermodynamic constraint; today we add stoichiometry, charge, symmetry.

11. Soft vs Hard: When to Choose Which

Soft penalty

\[\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{data}} + \lambda \mathcal{L}_{\text{phys}}\]

Differentiable; integrates with autograd.
Trades off data fit and feasibility — no guarantee.
Right tool during training (Neuer et al. 2024).
$\lambda$-tuning is a black art; cross-validate.

Hard projection

\[\hat{x} = \Pi_{\mathcal{F}}(x)\]

Guaranteed feasible.
Non-differentiable on $\partial \mathcal{F}$.
Right tool at inference / acquisition.
Combine: train soft, deploy hard.

The “λ is a black art” warning. A soft penalty $\lambda \mathcal{L}_{\text{phys}}$ has a balance problem: too small and the constraint is ignored; too large and the data fit collapses. Worse, the optimal $\lambda$ depends on the scales of the two losses, which depend on the dataset, the architecture, the optimiser, and the random seed. There is no good universal recipe.

Practical λ-tuning recipe. - Start with $\lambda = 1$. - Train; measure constraint-violation rate on validation. - If > 1%, multiply $\lambda$ by 10. If $\mathcal{L}_{\text{data}}$ degraded by > 5%, divide $\lambda$ by 3. - Iterate until both are acceptable. 5–10 iterations usually converge.

Adaptive-$\lambda$ approaches (named so students recognise them in PINN literature, MFML W13): NTK-based reweighting (Wang 2022), self-adaptive PINN (McClenny 2020), gradient-norm balancing. These automate $\lambda$-tuning. They help; they do not eliminate it.

The “train soft, deploy hard” pattern. This is the single most useful heuristic in §B. During training, $\mathcal{L}_{\text{phys}}$ provides smooth gradients. At inference, $\Pi_{\mathcal{F}}$ snaps the output to the feasibility set. A model trained with soft penalty plus an inference-time projection is feasible and trainable. Costs nothing.

Failure mode to call out. Training without soft penalty and only projecting at inference. Result: the projection is a large move (because the model never learned to be near feasible), so the projected output is not the “best feasible point” — it’s the projection of an arbitrary point. The soft penalty is what makes the projection small and meaningful.

12. Case Study 1 — MoS₂ Stoichiometry in a Generative Model

Setup

VAE trained on transition-metal-dichalcogenide (TMD) compositions.
Latent space $z \in \mathbb{R}^{16}$.
Decoder outputs (M-fraction, S-fraction, structure features).
Sample 1000 candidates.

Without vs with the simplex head

Vanilla: $$30% off-stoichiometry (M:S not 1:2 or close).
Two-head + softmax: M-fraction and S-fraction each on simplex, joint constraint $2c_M = c_S$ enforced via reparameterisation.
$$95% physically valid; expressivity preserved.
No measurable degradation on reconstruction error.

Why this case study. TMDs are simple enough to do on the board, important enough that students recognise them (MoS₂, WS₂, MoSe₂ — 2D semiconductor materials, exam-relevant for U11’s latent-space examples). The constraint is non-trivial — it’s not just “softmax over composition,” it’s fixed-stoichiometry-class compositions (1:2 chalcogenides).

The reparameterisation trick. For a fixed M:S = 1:2 ratio, the two-element composition has one free parameter (which metal). Parameterise $\theta \in \mathbb{R}^{|\text{metals}|}$ as a softmax over metals; set $c_M = \mathrm{softmax}(\theta) / 3$ and $c_S = 2 c_M$. The constraint is structural, not enforced.

The “expressivity preserved” claim — said with care. Adding architectural constraints sometimes degrades expressivity. The check: compare reconstruction loss on a held-out validation set with and without the constraint head, on data that already satisfies the constraint. If the constrained model is worse, the parameterisation is wrong. If the two are equal (within noise), the constraint is free.

The “not measurable degradation” finding generalises. Across most materials applications I have seen, well-designed architectural constraints cost ≤1% in held-out fit and recover ≥20% in feasibility rate. Never a bad trade.

Where this kind of system fails. When the constraint is uncertain — e.g. “MoS₂ but with sulfur vacancies that are sometimes 5% and sometimes 10%.” A hard 1:2 constraint excludes vacancy-laden samples. Either widen the feasibility set or accept that this is an off-stoichiometry-tolerant problem and use a soft penalty.

13. Case Study 2 — Alloy Composition in a BO Loop

Setup

U13 BO loop on ternary Ni-Co-Cr alloy hardness.
Acquisition: EI on a GP surrogate.
Decision variable: composition $(x_{\text{Ni}}, x_{\text{Co}}, x_{\text{Cr}})$.

Unconstrained vs simplex acquisition

Unconstrained box $[0, 1]^3$: returns recipes summing to 0.94 or 1.07. Hand-normalisation introduces bias.
Simplex via reparameterisation $(x_1, x_2)$, $x_3 = 1 - x_1 - x_2$, with $x_i \geq 0$: returns valid recipes; regret curve unchanged.
No acquisition cost; large correctness gain.

Generalisable lesson: parameterise the constraint into the search space, do not impose it via post-hoc rescaling.

The “hand-normalisation introduces bias” point. Two ways to “fix” a recipe summing to 1.07: (a) divide by 1.07, or (b) clip the largest component until the sum is 1. Both change the relative composition. The model’s acquisition optimum was at (0.4, 0.4, 0.27); after rescaling it becomes (0.37, 0.37, 0.25). That is a different recipe. The model’s acquisition logic was based on a point we never synthesised.

Why the reparameterisation works. Two free parameters $(x_1, x_2)$ on the 2-simplex (the triangle $x_1 + x_2 \leq 1$, $x_i \geq 0$) is a complete parameterisation of the ternary composition. The third component is bookkeeping. The GP kernel sits on $(x_1, x_2)$, EI is computed there, the optimum is on the simplex by construction.

Generalisation to higher dimensions. A $K$-element alloy lives on the $(K-1)$-simplex. Parameterise via a softmax of $(K-1)$ unbounded parameters, or via the stick-breaking construction. Either is fine; pick one and be consistent across the codebase.

The “regret curve unchanged” finding — the good news. Imposing the constraint correctly does not slow learning. The BO loop converges at the same rate; it converges to a valid optimum. Wins on both axes.

The bad-day version of this case study. A real Ni-Co-Cr campaign in our group (years ago, pre-realignment). The first six iterations used unconstrained box acquisition. Three of the proposed recipes were unsynthesisable as written. We lost a week. The constraint-aware version was retrofitted in an afternoon; the rest of the campaign ran cleanly.

§C · MFML W13 PINN Recap and Materials Applications

14. PINN in One Slide (MFML W13 reminder)

The PINN loss

\[\mathcal{L}_{\text{PINN}} = \mathcal{L}_{\text{data}} + \lambda_r \|\mathcal{N}[u_\theta]\|^2 + \lambda_b \|\mathcal{B}[u_\theta]\|^2\]

$u_\theta$: neural network approximating the field $u(x, t)$.
$\mathcal{N}[\cdot]$: PDE residual operator.
$\mathcal{B}[\cdot]$: boundary / initial condition operator.
Backprop through $\mathcal{N}[u_\theta]$ via autograd (Neuer et al. 2024).

What PINN gives you

A mesh-free, differentiable representation of $u(x, t)$.
A natural framework for inverse problems (infer parameters of $\mathcal{N}$).
Pointer: soft-constraint balancing, training stability, NTK reweighting — all MFML W13. Not re-taught here.

Why this slide is one slide. The PINN derivation, training stability tricks, soft-constraint balancing, NTK theory, and the canonical PDE benchmark suite are all twenty slides of MFML W13. Today is materials genomics, not PINN week. The question I am answering today is: what do PINNs do for materials that other tools do not?

The autograd magic in two sentences. PINN’s trick: $\mathcal{N}[u_\theta]$ contains derivatives of $u_\theta$ w.r.t. $x, t$. Compute those via autograd through the network. Then $\|\mathcal{N}[u_\theta]\|^2$ is a normal scalar loss. Backprop. Done.

The soft-constraint nature of PINN. $\lambda_r$ and $\lambda_b$ are exactly the soft-penalty $\lambda$ from §B slide 11. PINN training is the canonical example of soft-penalty constrained learning. The MFML W13 deck spends a whole section on $\lambda$-balancing because of this — and that section is not in MG.

MFML W13 cross-link, said aloud. “If you sat through MFML W13 last week, you saw the loss landscape of PINNs is dominated by the $\lambda_r / \lambda_b$ ratio, and you saw three adaptive-$\lambda$ schemes (NTK reweighting, gradient-norm balancing, self-adaptive PINN). I assume that. If you did not sit through it, the deck is on the shared drive. Read it tonight if §C confuses you.”

Why PINNs and §B are the same idea. PINNs are §B’s “soft penalty in the loss” applied to a PDE constraint. Stoichiometry is a discrete constraint with a softmax. PDE residuals are a continuous constraint with autograd. Same family, different objects.

15. Inverse-Problem Framing for Materials

Forward problem

Given parameters $\theta_{\text{phys}}$ (diffusivity, conductivity, viscosity, mobility).
Solve the PDE $\mathcal{N}[u; \theta_{\text{phys}}] = 0$.
Return $u(x, t)$.
Classical PDE solvers do this fast and well.

Inverse problem (where PINNs shine)

Given measurements $\{u(x_i, t_i)\}$.
Infer the unknown $\theta_{\text{phys}}$.
PINN parameterises both $u_\theta(x, t)$ and $\theta_{\text{phys}}$ simultaneously; one optimisation.
Output: a consistent field and parameters.

Why inverse problems are the PINN sweet spot for materials. Forward solves are well served by FEM / finite-volume / spectral methods that have been engineered for fifty years. PINNs are typically slower and less accurate as forward solvers. Where they win is when parameters are unknown — and you have to jointly fit field and parameters from sparse, noisy measurements. That is the materials-characterisation regime: scattered DIC strain measurements, EBSD grain-boundary motion, in-situ XRD-tracked phase change.

The data-assimilation interpretation. PINN inverse problems are continuous-physics data assimilation: blend a sparse measurement field with a known PDE structure to get a consistent picture. Compare to ensemble-Kalman / variational data assimilation in geophysics — same idea, different field.

Why the consistency matters. A naïve regression of $u$ from sparse measurements gives you $u$ but no $\theta_{\text{phys}}$. A naïve fit of $\theta_{\text{phys}}$ to a forward solver matched against measurements is iterative and slow. PINN does both at once: $u_\theta$ is constrained to satisfy the PDE with the inferred $\theta_{\text{phys}}$, automatically. The output is a self-consistent estimate.

The honest caveat. Inverse problems are ill-posed — multiple ($u, \theta_{\text{phys}}$) pairs can fit measurements. PINN does not fix that. You still need regularisation (priors on $\theta_{\text{phys}}$), enough measurements, and uncertainty quantification (deep ensembles or Bayesian PINN, MFML W13 has the details).

16. Two Materials Uses Worth Knowing

Microstructure homogenisation

Heterogeneous strain field measured by digital image correlation (DIC) on a polycrystal.
Forward: elasticity PDE with grain-resolved stiffness tensor $C(x)$.
Inverse: infer effective $C^{\text{eff}}$ such that the PDE residual is small and matches measured strain.
PINN naturally enforces compatibility and equilibrium.

Phase-field parameter inference

Time-resolved microstructure data (in-situ TEM, 4D-STEM, optical).
Phase-field PDE: $\partial_t \phi = -M \delta F / \delta \phi$.
Unknowns: mobility $M$, interface energy $\sigma$, double-well height.
PINN infers $\{M, \sigma, \dots\}$ that reproduce observed phase-boundary motion.

Why these two examples. Both are active research in 2024–2026, both are materials-genomics-relevant (homogenisation feeds into multiscale property prediction; phase-field underpins synthesis simulation), and both are honest fits for PINN — sparse measurements, known PDE, unknown parameters.

Microstructure homogenisation — the operational point. Classical homogenisation requires a representative volume element (RVE) and many forward solves over RVE realisations. PINN can ingest one DIC dataset directly and infer effective parameters without an RVE construction. That is faster and more grounded in measurement. Caveat: only works if the DIC field is dense enough to constrain the PDE.

Phase-field parameter inference — the operational point. Phase-field models have $$3–8 free parameters per phase pair. Calibrating them is laborious. In-situ measurement of phase-boundary motion is now routine (4D-STEM, in-situ TEM). PINN closes the loop between measurement and parameter calibration. This is one of the cleanest “PINN for materials” success stories.

Forward link to U14 §E. A calibrated phase-field model becomes part of the autonomous-lab loop: predict synthesis outcome (forward phase-field), select recipe (BO loop), execute (synthesis robot), measure (in-situ characterisation), update (PINN re-calibrates phase-field parameters). PINN is the parameter-update step in the multi-fidelity discovery cycle.

Where PINN does not fit. Property prediction without a PDE — most static crystal properties (band gap, formation energy, magnetic moment). For those, U8–U10 surrogates with §B constraints dominate. PINN is the right tool when there is a PDE and the parameters are unknown.

17. Why PINNs Are Not the Universal Hammer

Where PINNs fit

A clean PDE with unknown parameters.
Sparse, noisy field measurements.
Mesh-free representation desirable.
Inverse problems with consistency constraints.

Where PINNs do not fit

Most static crystal-property prediction (no PDE) — U8–U10 plus §B.
Multi-step synthesis (no closed PDE) — phenomenological surrogate.
Catalysis, multi-phase synthesis, alloys with phase changes — PDE either unknown or unreliable.
High-stiffness PDEs — PINN training is brittle (Neuer et al. 2024).

Closing rule: use a PINN where you have a PDE you trust and parameters you do not. Otherwise use a §B-constrained surrogate.

Why the field over-claimed PINNs 2018–2022. The original Raissi/Perdikaris/Karniadakis paper (raissi2019physics?) was a beautiful proof of concept: a single-page formulation that handled forward and inverse PDE problems uniformly. The community ran with it and applied PINNs to everything. By 2022, the limitations were clear: training instability, $\lambda$-balancing pain, slower-than-FEM forward solves. The 2023–2026 consensus is PINNs for inverse problems and parameter-poor regimes; PINNs not for forward solves where FEM works.

The “high-stiffness PDE” failure. PINNs struggle on PDEs with sharp gradients (shocks, boundary layers, fast-time-scale dynamics). The MLP-based $u_\theta$ has spectral bias: it learns low-frequency components first and high-frequency components slowly. Sharp features take very long to fit, or never fit. MFML W13 covers this; do not re-derive.

The materials-specific honest reading. For most MG problems, the static-crystal-property regime dominates: predict $E_f$, $E_{\text{hull}}$, band gap, magnetic moment from structure. None of those is naturally a PDE. PINNs do not help. §B’s stoichiometry / charge / symmetry constraints help. Choose the right tool.

When PINNs are the right tool, you know it. You have an in-situ measurement of a field evolving over time, a PDE whose form you trust, and parameters that are uncertain. That is the situation slide 16 described. Outside it, do not reach for PINN. The lecture slot for §C is short because PINN’s footprint in MG is narrow.

Pointer to MFML W13, one last time. “If you want PINN-as-primary-tool, MFML W13 is your week. If you want PINN-as-occasional-tool inside a materials workflow, today is enough.”

§D · Trust Under Distribution Shift

18. The OOD Problem in Materials

The setup that breaks naïve trust

Train a surrogate on Materials-Project oxides.
Query a candidate from the nitride family.
The GP returns a confident posterior — small $\sigma$.
The candidate is out-of-distribution; the small $\sigma$ is meaningless.

The two operational OOD signals

Latent-space distance: how far is the candidate from the U11 / U12 latent manifold of the training set?
Feature-space Mahalanobis distance, deep-ensemble disagreement, density estimation in latent space.
Did our latent space cover this candidate? — the right question to ask before trusting the posterior (Bishop 2006).

The “GP is confident on a point it has never seen” trap. Standard GP behaviour: as the test point moves far from training data, predictive variance grows. In feature space. But “far from training data” is computed in the kernel-induced metric. If the kernel does not represent the chemistry-family difference (say, RBF on element-fraction features without nitride samples), nitride and oxide candidates can be kernel-close even though they are physically very different. The GP is confident because the kernel says “this looks like training data” — but the kernel is wrong.

The cure: measure OOD outside the kernel. Use latent-space distance from a separately-trained representation (U11 contrastive, U12 clustering distance to nearest cluster centre). Use Mahalanobis distance in raw feature space with an empirical training-set covariance. Use deep-ensemble disagreement (variance across an ensemble of differently-initialised models). The signals partially overlap; using two is much safer than using one.

The “did our latent space cover this candidate” framing. This is the single most useful operational question in materials ML 2026. Before acting on a surrogate prediction, ask: was a similar candidate in the training set? If yes, trust the posterior. If no, the posterior is meaningless and the candidate must be flagged. This is conformal prediction’s job, slides 21–22.

Forward link to MFML W14. MFML W14 covers the generic version of this — covariate shift, label shift, concept drift. Today is the materials-specific instantiation. The vocabulary is borrowed; the application is not generic.

19. The Simulation–Experiment Gap

Three sources of sim–exp gap

DFT functional bias: PBE underestimates band gaps by 30–50%; SCAN closer; r²SCAN now standard.
Geometry mismatch: DFT-relaxed lattice parameters differ from as-synthesised by 1–3%.
Property-definition mismatch: “stability” in DFT = $E_{\text{hull}} \leq 0$ at 0 K, no entropy. “Stability” in synthesis = “we made it last week.”

Operational consequence

A model trained on DFT does not predict measurement.
The “MAE 30 meV/atom on Materials Project” headline is not the error you see at the bench.
Calibrate against experimental ground truth or accept large drift.

The functional-choice gotcha. A surrogate trained on PBE-derived energies cannot predict SCAN energies (or experiment) without a systematic offset. The offset is not constant — it depends on the chemistry family, on whether the system is metallic / semiconducting, on van-der-Waals presence. Cross-functional surrogate use is dangerous unless you have measured the offset for your chemistry.

The geometry-mismatch gotcha. DFT relaxation finds the local energy minimum at 0 K, no entropy, vacuum. As-synthesised crystals have thermal vacancies, finite-temperature lattice expansion, possible secondary phases. Lattice parameter differences of 1–3% are routine. For sensitive properties (band structure, magnetic ordering) those few percent matter.

The property-definition mismatch — the trickiest. “$E_{\text{hull}} = 0$” predicts thermodynamic stability at 0 K under the convex hull as constructed from MP entries. It does not predict synthesisability — kinetics, available precursors, reaction pathway, vessel constraints. A-Lab’s reported false-positive rate on “predicted stable” candidates was non-trivial precisely because $E_{\text{hull}}$ overpromises.

The experimentalist’s heuristic. “If your DFT screen says yes, your synthesis says yes 30–60% of the time.” That is the order of magnitude. Plan capacity accordingly. Multi-fidelity active learning (§G slide 43) is partly designed to cope with this gap.

Honest closing. The simulation–experiment gap is the hardest part of MG to fix. It is what U14 §E’s autonomous lab loop has to navigate every iteration. There is no software fix; there is only careful calibration, OOD detection, and conformal coverage.

20. Calibration Drift

The phenomenon

Surrogate calibrated on chemistry family A.
Reliability diagram (cross-link ML-PC W8): nominal 90% intervals cover 88% — well calibrated.
Apply to chemistry family B without re-calibration.
90% nominal intervals now cover 70%. Over-confident. Decisions made on this surrogate are wrong.

Why it happens

The aleatoric / epistemic split is family-specific.
Different chemistries have different intrinsic noise.
The kernel / network capacity allocated to family A may not generalise to family B (Murphy 2012).

Operational rule: re-calibrate the surrogate on every newly-entered chemistry family, before using its uncertainty for screening.

Reliability diagrams in three sentences. Bin predictions by predicted probability / quantile. For each bin, compute the empirical frequency (in regression: empirical coverage of the predicted interval). Plot. A well-calibrated model has a diagonal reliability curve. Off-diagonal = miscalibration. ML-PC W8 has the full machinery; we use the result here.

The “calibrate per family slice” recipe. 1. Hold out a small calibration set from each chemistry family. 2. Compute reliability per family. 3. If miscalibrated, fit a Platt-scaling / isotonic regression / temperature-scaling adjustment per family. 4. At inference time, route the candidate to the appropriate family’s calibration.

The honest limitation. When you encounter a new chemistry family (no prior data), you have no calibration. The surrogate’s intervals are uncalibrated by definition. Conformal prediction (next slides) is the right tool here — it gives finite-sample coverage with no assumption about the chemistry, only an assumption of exchangeability.

Aleatoric vs epistemic, said one last time (cross-link MFML W12). Aleatoric = irreducible noise in the data. Epistemic = reducible-by-data uncertainty in the model. Calibration drift is epistemic — fixable by collecting more data in family B. But until that data exists, the surrogate is over-confident. Plan accordingly.

21. Conformal Prediction in One Slide

The construction

Train any surrogate $\hat{f}$ on $\mathcal{D}_{\text{train}}$.
Hold out a calibration set $\mathcal{D}_{\text{cal}}$.
Compute residuals $r_i = |y_i - \hat{f}(x_i)|$ on $\mathcal{D}_{\text{cal}}$.
Take the $(1-\alpha)$-quantile $q_\alpha$ of $\{r_i\}$.

The guarantee

\[\hat{C}_\alpha(x) = [\hat{f}(x) - q_\alpha,\ \hat{f}(x) + q_\alpha]\]

$\Pr(y \in \hat{C}_\alpha(x)) \geq 1 - \alpha$.
Distribution-free: no Gaussianity, no kernel assumption.
Finite-sample: holds for any $|\mathcal{D}_{\text{cal}}|$.
Model-agnostic: wraps any $\hat{f}$.

Why this is one of the most important slides in the unit. Conformal prediction is the right tool for materials surrogates that have to make screening decisions. It provides a guarantee — a real, mathematically-sound, finite-sample guarantee — that the GP / ensemble / NN posterior interval cannot. Use it.

The exchangeability assumption — read it carefully. Conformal prediction guarantees coverage under exchangeability: training, calibration, and test data are exchangeable (any permutation has the same joint distribution). This is strictly weaker than i.i.d. but stronger than no assumption at all. In particular, it fails under distribution shift — exactly the regime §D is worried about. So conformal alone is not enough; we combine with OOD detection (next slide).

The “wraps any $\hat{f}$” property is the killer feature. GP, deep ensemble, MC-dropout, foundation MLIP — all of them can be wrapped. Switch surrogate, keep calibration script. No retraining.

Adaptive conformal variants (named for recognition). Conformalised quantile regression (Romano 2019), localised conformal (Lei 2018), Mondrian conformal (Vovk 2003), adaptive conformal inference (Gibbs & Candès 2021) for online drift. The basic version on this slide is the marginal-coverage version; the adaptive variants give per-slice coverage. Use marginal as the default; reach for adaptive when you have known sub-populations.

The 2026 best-practice stack. GP / ensemble surrogate → conformal wrapper → OOD score → acquisition. Four steps. No optional ones. This stack is the U14 instantiation of “trust.” Drill it.

22. Conformal Prediction for Materials Surrogates

Per-family calibration

Calibrate per chemistry slice — coverage is heterogeneous across families.
Mondrian conformal: split $\mathcal{D}_{\text{cal}}$ by chemistry-family group, compute $q_\alpha$ per group.
Per-family intervals reflect per-family epistemic content (Murphy 2012).

As an acquisition gate

Decision: “synthesise candidate $x$ if interval width $|\hat{C}_\alpha(x)| < \delta$.”
Wide interval = “we don’t know enough; do not commit synthesis budget.”
Combine with OOD score (next slide) for refusal: wide interval AND high OOD score → escalate to human.

The “interval width as decision gate” pattern. This is the operational use of conformal in a discovery campaign. EI / UCB / TS picks the highest-acquisition candidate. Conformal-wrapped acquisition adds a minimum-trust constraint: “of the candidates with $|\hat{C}_\alpha(x)| < \delta$, pick the highest-acquisition.” The result is a campaign that refuses to commit budget on candidates the surrogate cannot resolve.

The right $\delta$ depends on synthesis cost. Cheap synthesis → loose $\delta$ (commit and learn). Expensive synthesis → tight $\delta$ (commit only when confident). For LPBF builds at €5000/sample, $\delta$ should be tight; for ink-jet polymer prints at €5/sample, $\delta$ can be loose.

The escalation pattern is important. “Wide interval AND high OOD score” is a different decision than “wide interval, low OOD score.” The first means the candidate is outside our coverage — needs a human structural review or a careful first synthesis. The second means the surrogate is uncertain within its known territory — usually means more data on this region would close the gap. Different responses; the system needs both signals.

Cross-reference to U13. U13’s BO loop computed EI on the GP posterior. The U14 version computes EI on the conformal-wrapped posterior, with an OOD-score gate, with a stoichiometry / charge feasibility filter (§B), and routes refusals to a human-review queue. That stack is the §E loop’s acquisition step.

Failure mode the conformal wrapper does not fix. If the calibration set is itself biased (only stable phases, no failure cases), the conformal interval inherits the bias. Conformal is not magic — it converts surrogate errors into intervals; it does not eliminate them.

23. OOD Detection — “Did We Cover This Candidate?”

Three usable OOD scores

Latent-space nearest-neighbour distance (U11/U12 representation): $d_{\text{NN}}(x) = \min_{i} \|\phi(x) - \phi(x_i)\|$.
Mahalanobis in feature space: $d_M(x) = \sqrt{(x - \mu)^\top \Sigma^{-1} (x - \mu)}$.
Deep-ensemble disagreement: $\sigma_{\text{ens}}(x) = \mathrm{std}_k \hat{f}_k(x)$.

Use as a refusal gate

Threshold each score on validation OOD examples.
Reject acquisition candidate if any score exceeds threshold.
Two of three exceeding is a stronger refusal than one of three — combine (Goodfellow et al. 2016).

Why three scores and not one. Each captures a different OOD signal. NN-distance catches “this looks unlike anything we’ve seen.” Mahalanobis catches “this is statistically far from the training distribution.” Ensemble disagreement catches “different models trained on the same data disagree about this point.” A point can be OOD by one metric and not another (e.g. close to training in raw features but far in learned latent → ensemble disagreement catches it; Mahalanobis misses).

Threshold selection. Standard recipe: 1. Hold out a validation set known to be in-distribution. 2. Compute the chosen OOD score on it. 3. Set threshold at the 95th percentile of validation scores (5% in-distribution false-positive rate). 4. Anything above threshold at inference time is flagged. This calibrates the threshold to a target false-positive rate, not to an absolute scale.

The “two of three” combiner. Use AND-of-thresholds for a strong refusal (both NN-distance and ensemble disagreement exceed threshold → likely OOD). Use OR-of-thresholds for a cautious refusal (any score exceeds → flag for review). The acquisition use case is conservative: prefer OR.

Forward link to §E. The OOD gate is the refusal mechanism in the autonomous-lab loop. Without it, the loop will eagerly synthesise OOD candidates, fail, learn from the failure, and call that “discovery.” With it, the loop refuses OOD candidates and either escalates to human review or runs a cheap exploratory characterisation (e.g. fast XRD on the precursor) before committing to a full synthesis. The economics of the loop depends on this gate.

24. Failure Mode — Silent Extrapolation

The trap

Surrogate emits low variance on a novel chemistry.
Reason: the kernel does not represent the chemistry-family difference; “looks like training” in the kernel metric.
Conformal interval is also tight (calibration set was in-distribution).
All trust signals say “go.” All trust signals are wrong.

Mitigation

Independent OOD score not derived from the surrogate.
Conservative refusal: low surrogate variance + high OOD score = refuse.
Periodic blind audit: synthesise one or two flagged-OK-but-unusual candidates per month, measure the actual error.

The lesson: trust is a system property, not a model property. Combine signals.

Why this slide is its own slide. Silent extrapolation is the operational nightmare of materials ML in 2026. Loud extrapolation (high variance, “I don’t know”) the system handles. Silent extrapolation (low variance, “trust me,” candidate is wrong) the system does not handle without external signals. The whole §D is built around this failure mode.

Where “blind audit” lives in the loop. Once a quarter, sample 10 candidates that the surrogate confidently endorsed. Synthesise. Measure. Tabulate predicted-vs-actual. If the error rate is much higher than the conformal nominal coverage, the calibration is broken — re-calibrate, re-train, re-thresholds. This is the maintenance discipline that keeps a long-running loop honest.

War story (composite, real elements). A perovskite stability campaign, surrogate trained on iodide-bromide compositions. Halfway through, the lab introduced a chloride-substituted candidate. Surrogate said “stable, narrow CI.” Synthesis succeeded. Stability test ran out of spec at week 3. Post-mortem: chlorides were OOD, the surrogate had not seen them, the GP kernel did not distinguish them from bromides at the input level, no OOD score was being checked. After adding a Mahalanobis-on-composition gate, three more would-be-failures were caught before synthesis.

The “trust is a system property” framing. A model has uncertainty. A system has trust. Trust = surrogate uncertainty + conformal coverage + OOD detection + feasibility filter + audit trail. Models give you signals; systems give you decisions. Always combine.

25. Trust Budget — The Operational Summary

The audit trail per decision

Surrogate: model, training set, version.
Conformal calibration: calibration set, $\alpha$, per-family $q_\alpha$.
OOD score: which score(s), threshold, value.
Feasibility filter: which constraints were checked.
Decision: rank, candidate, refusal flag, human-review status.

The materials-specific MFML W14 instantiation

MFML W14 framed trust abstractly.
§D narrows to: distribution shift, chemistry-family OOD, sim-vs-exp gap, conformal coverage, audit trail.
The audit trail is the model card (§F slide 38) — written once per loop run.

The audit-trail discipline is what makes this trust operational. It is one thing to say “we trust the model”; it is another thing to log per-decision: which model, which calibration set, which OOD score crossed which threshold, which constraint was checked, what was the final rank, did a human review it. That log is the basis for post-mortem analysis when the loop fails — and loops always fail somehow.

Why this scales. A six-month autonomous-lab campaign generates 50–200 synthesis decisions. Each has a log entry. That log is what reviewers will ask for, what funding agencies will ask for, what the next student trying to reproduce the work will read. Without it, “we ran an autonomous lab” is unverifiable.

Forward link to §F slide 40. A reviewable materials-ML submission in 2026 includes the audit trail. Not the whole trail (50000 lines is unreadable), but a model card and dataset card that summarise it. Slide 40 will lay out the format.

Closing the §D arc. §D started with “the OOD problem in materials” and ended with “the audit trail.” That is the trajectory: from acknowledging the problem (slides 18–20), to the right wrapper (21–22), to the OOD complement (23), to the silent failure mode (24), to the operational summary (this slide). Each piece is necessary; none alone is sufficient.

Said as a frame for §E. “Everything in §D was preparation for §E. The autonomous-lab loop needs §D to work. Without distribution-shift trust, the loop cheerfully synthesises OOD candidates and produces noisy garbage. With §D, the loop refuses, escalates, or commits — defensibly. Now §E.”

§E · The Autonomous-Lab Loop

26. The Closing-the-Loop Ambition

Discovery is a decision problem, not a prediction problem

A surrogate that predicts $E_{\text{hull}}$ for 200 candidates is not discovery.
Discovery = one of those 200 ended up in a vial, and we know which one and what it became.
The loop is what turns proposals into measured outcomes.

Six steps, repeated

Predict (U8–U10, U13 surrogate).
Propose (acquisition + §B feasibility + §D refusal).
Schedule (workflow engine, instrument time).
Run (synthesis robot).
Measure (characterisation pipeline).
Update (parser, database, surrogate retrain).

This is the centrepiece of the lecture. Six steps; ten slides; twenty minutes; the integration story for the whole course. Pacing matters here — do not rush §E. If §B–§D ran long, cut from §F or §G, not from §E.

Why “decision, not prediction” matters pedagogically. Many ML courses end at “we trained a model, here is its MAE.” Discovery courses must end at “we made a material, here is what it is.” The intellectual move from one to the other is what U14 lives for. State it baldly: prediction is preparation for decision; decision is the point.

The six-step framing is the unifying skeleton. Every autonomous-lab paper, every closed-loop discovery system, every self-driving lab in 2023–2026 fits this skeleton. Different systems differ in which step is the bottleneck and which step is automated vs human-in-the-loop. But the skeleton is universal. Memorise it.

Where each step lives in the course. Step 1 = U8–U10 + U13 surrogate. Step 2 = U13 acquisition + U14 §B + §D. Step 3 = workflow engine, new in U14 §E. Step 4 = synthesis robot, new. Step 5 = characterisation pipeline (cross-link ML-PC). Step 6 = parser + database update + retrain trigger, new. Three of six are entirely new in §E; that is why §E is the longest section.

The honest framing. “Closed-loop autonomous discovery” is the marketing term. The reality, as of mid-2026, is partially-automated, single-domain, human-supervised loops with strong constraint enforcement. Step 3 (scheduling) is mostly solid. Steps 4–5 are platform-specific. Step 6 is the part most papers gloss over. We will be honest about each.

27. Loop Architecture

Components, named

Surrogate stack: U8–U10 NN + U13 GP/ensemble.
Proposal layer: U13 acquisition + U14 §B feasibility + §D conformal/OOD gate.
Scheduler: workflow engine. Picks order; manages parallelism.
Execution layer: synthesis hardware (powder dispenser, furnace, glovebox).
Measurement layer: characterisation hardware (XRD, mass-spec, electrochem).
Feedback layer: parser → database → retrain trigger.

Interfaces, the painful part

Surrogate ↔︎ proposal: API call.
Proposal ↔︎ scheduler: structured candidate (composition, recipe, target).
Scheduler ↔︎ hardware: instrument SDK, vendor API, SiLA-2 / OPC-UA for cross-platform.
Measurement ↔︎ database: parser per instrument, schema-versioned.
Each interface is a real engineering effort.

Where the engineering hours go. A typical autonomous-lab build, in the academic group I worked with, allocated time roughly: 10% surrogate, 5% acquisition, 15% scheduler, 30% synthesis-hardware integration, 25% measurement-hardware integration, 15% parsers and databases. The ML is small. The integration is enormous.

SiLA-2 — name it once. SiLA-2 is the lab-instrument communication standard pushed by industrial automation (especially pharma). It is the closest thing to a USB-for-lab-instruments. Many vendors support it; many do not. When it works, instrument integration takes days. When it does not, integration takes months. Plan accordingly.

OPC-UA — name it once. OPC-UA is the industrial-automation cousin of SiLA-2, used in process and characterisation. Materials labs increasingly use it; the trade-off is more rigorous than SiLA-2 (real-time guarantees) but heavier to deploy.

The “schema-versioned parser” point. A parser written in 2024 against an instrument’s data format will break in 2025 when the vendor pushes a firmware update. Version the schema. Test against old data. Have a fallback. This is not glamorous; it is essential.

Forward link. The orchestration stack (slide 28) chooses which workflow engine and which BO driver glue these components together. The pieces here are the components; the next slide is the glue.

28. The Orchestration Stack

Workflow engines (pick one)

AiiDA: materials-native, provenance-tracked.
FireWorks: simpler, materials community.
Prefect / Airflow: general-purpose, large community.
Argo Workflows: Kubernetes-native, scale-out.
Pick one, do not roll your own.

BO drivers (pick one)

BoTorch: PyTorch-native, modern, multi-fidelity ready.
Ax: BoTorch + experiment management.
Dragonfly: works at scale, classical roots.
GPyOpt: light-weight, easy onboarding.
Pick one, plug into the workflow engine.

The 80/20 rule for autonomous labs: 80% of the work is orchestration; 20% is the surrogate. The community has good tools for both halves now — use them.

The “do not roll your own” rule. Every group that builds an autonomous lab has the urge to build a custom workflow engine. They all regret it. The off-the-shelf engines (AiiDA, Prefect, Argo) handle parallelism, retry-on-failure, provenance, monitoring, and resumption-from-crash. A custom workflow engine handles none of those by default and costs a postdoc-year to make robust. Do not.

AiiDA’s claim to materials-native. AiiDA was designed at EPFL specifically for materials computational workflows: atomic-structure-aware data nodes, DFT-code integrations (VASP, Quantum ESPRESSO, ABINIT), provenance graphs that capture which calculation produced which structure. For pure-DFT campaigns it is the natural choice. For mixed compute / experiment campaigns it is heavier than necessary; Prefect or Argo wins.

Why BoTorch in 2026. BoTorch is the open-source library Meta and the academic Bayesian-optimisation community converge on. Multi-fidelity, batch acquisition, constrained optimisation, custom kernels — all first-class. Plug-in architecture matches the workflow-engine architecture cleanly. If MG U13 used GPyOpt for simplicity, U14 § E uses BoTorch for production.

The instrument-API painful part, said again. Even the best workflow engine cannot abstract away a vendor SDK that crashes on Tuesday afternoons. The integration layer between scheduler and hardware is bespoke per platform. This is the place where 80% of debugging time is spent in a real autonomous-lab campaign.

Forward link. The next slide (29) lands on a concrete example: A-Lab. We will see all the pieces from slides 27–28 in one named system, with their actual choices.

29. A-Lab — The 2023 Case

What A-Lab claimed

Berkeley’s autonomous lab for inorganic synthesis (szymanski2023autonomous?).
Goal: synthesise candidates predicted stable by Materials Project DFT screens.
Pipeline: powder dispenser → furnace → XRD characterisation → automated phase identification.
Reported high-throughput synthesis of many candidate compounds, with phase-match confirmation by XRD.

What A-Lab demonstrated

The workflow works: hardware integration, scheduling, parsing.
A non-trivial synthesis success rate on novel-stoichiometry candidates.
The integration story is real — not a slide-deck.
A landmark for the field (Sandfeld et al. 2024).

The 2023 paper’s framing, accurately. The original report is best read as a workflow demonstration paper, not a materials-discovery paper. The novelty is that all six loop steps were automated and integrated end-to-end at non-trivial scale. The workflow is the contribution; the new materials are the by-product.

Why this matters even if individual phase calls were debated (next slide). Before A-Lab, the integration story was anecdotal — single-platform demonstrations, often manual measurement, often without provenance. A-Lab made the engineering visible at scale. The follow-up debate (next slide) is about science evaluation, not about whether the workflow works.

The hardware stack, briefly. Powder dispensing robot, automated mixing, programmable furnaces with controlled atmospheres, robotic XRD sample loading, automated data parsing for phase identification against the ICSD database. Each component existed before A-Lab; integrating them as a closed loop was the new step.

The honest reading. “Berkeley demonstrated that you can build a closed loop. They did. The phase-call quality on the loop’s outputs is a separate question, addressed by the 2024 follow-ups, slide 30. The fact that follow-ups exist and are engaging with the data is itself a sign of a healthy field.”

Forward link. Slide 30 is the debate, slide 31 is the broader landscape. Together they place A-Lab in honest context: a milestone, not a magic engine, surrounded by parallel efforts in other domains.

30. The A-Lab Debate, Honestly

The follow-up critique (leeman2024challenges?)

Independent re-analysis of A-Lab’s reported novel phases.
Many candidates re-assignable to known structures with different stoichiometric labelling.
The automated XRD phase-identification pipeline mis-assigned phases that an expert crystallographer would have flagged.

What we learn

Lesson 1: autonomous synthesis works; autonomous novelty verification does not yet.
Lesson 2: the workflow result and the science result need separate evaluation.
Lesson 3: human structural review is still required for novelty claims in mid-2026.
Lesson 4: the field updated honestly. That is health.

This slide is the honesty slide of the lecture. Many MG courses gloss over the A-Lab follow-up because it is “negative.” It is not negative — it is how science is supposed to work. A landmark paper, an honest critique, and a refined understanding of what the loop does well and what it doesn’t. Students need to see the field functioning that way, not as a marketing reel.

The phase-identification specifics. Automated XRD phase ID matches diffraction peaks against a reference library (typically ICSD). When a measured pattern matches a known phase within tolerance, the automated pipeline calls it that phase. When the measured pattern is similar but not identical to multiple known phases, the pipeline picks the best match — which can be wrong. An expert sees the deviation and flags “this is a known phase with disorder” or “this is a stoichiometric variation,” not “novel.”

Where the field landed (mid-2026). Autonomous synthesis + automated XRD = high-throughput candidate validation, with the caveat that novelty claims need human review. Most current loops now route XRD outputs through a human structural-review queue for any candidate flagged as “novel by automated pipeline.” This is a small fraction of throughput (<5% of candidates), so the loop is still mostly automatic, but the novelty filter is human.

The “workflow vs science result” distinction, said clearly. A workflow demonstration validates: “we can run a synthesis-measurement loop with this success rate.” A science result validates: “the materials we synthesised are what we claim them to be, and they are novel.” These are different evaluations, with different standards, and conflating them is the most common public-communication failure of autonomous-lab papers.

Closing rule. Every autonomous-lab paper should report two numbers: a workflow success rate, and a human-reviewed novelty rate. The two are different. Neither alone is sufficient.

31. Other 2023–2026 Self-Driving Labs

Photochemistry / catalysis

Aspuru-Guzik group (Toronto, Vector Institute, then Harvard): self-driving labs for photocatalysts, organic reactions.
ChemOS / ChemBO software stack — open source.
Successful single-domain demos; cross-domain generalisation open.

Energy materials, polymers, electrolytes

MIT polymer-electrolyte loops.
IBM RXN for chemistry → battery and catalyst loops.
LBNL battery cycling automation.
Each is single-domain, single-platform; recipe portability across labs is not yet demonstrated.

The shape of the 2023–2026 landscape. Roughly a dozen serious autonomous-lab efforts, each in one domain (inorganic synthesis, organic synthesis, polymer chemistry, photochemistry, battery cycling, catalysis). Each has a working closed loop in its domain. None has demonstrated cross-domain transfer — the same loop running unmodified on a different chemistry / material class.

Aspuru-Guzik’s role. Alán Aspuru-Guzik’s group at Toronto / Harvard pioneered much of the self-driving-lab software stack. ChemOS (workflow), ChemBO (Bayesian optimisation), Phoenics (acquisition functions) are open-source and used widely. If you want to build an autonomous lab in 2026, the Aspuru-Guzik stack is the most mature open option.

Why “single-domain, single-platform” is the honest 2026 caveat. Each successful loop was tuned to its specific synthesis platform, its specific characterisation, its specific chemistry family. Moving the loop to a different platform requires re-engineering most of slides 27’s components. Recipe portability — “this recipe works on platform A and platform B” — is the hard problem and is largely unsolved.

The “vendor lock-in” reality. A working autonomous lab in 2026 typically depends on one vendor’s hardware. If the vendor changes their API, the loop breaks. Cross-vendor portability is in its infancy (SiLA-2 helps; not enough). Plan for multi-month re-integration costs when changing platforms.

Honest closing on §E’s first half. “These are real systems doing real work. They are also each one human-year of integration effort, and that effort is mostly not portable. The 2026 question is: how do we lower the integration cost so that more labs can build loops?”

32. Failure Modes — Synthesis Side

Recipe ambiguity

“Heat at 600 °C for 12 h” — what ramp rate? Crucible material? Atmosphere?
The same nominal recipe on two platforms produces different products.
Mitigation: recipe representation includes ramp profile, atmosphere, vessel, contact materials, not just nominal temperature and time.

Hardware bottlenecks and sample-handling errors

Weighing, mixing, thermal cycles dominate cycle time, not the surrogate.
Dropped vials, contaminated crucibles, mis-loaded samples — invisible to the model, fatal to the data.
Mitigation: instrumented hardware (vibration, mass, atmosphere logs); per-step success flags routed to the database.

The “recipe representation” point is one of the most important in materials informatics 2024–2026. A recipe is not “600 °C, 12 h.” It is a full procedural object: precursor identities and grades, weighing tolerance, mixing protocol, vessel material, atmosphere, ramp rate, hold time, cooling protocol, post-treatment. Two recipes that share a temperature and a time can produce different materials if the ramp rate or the atmosphere differs.

The IUPAC / Open Reaction Database / Polymer FAIR efforts. The community is converging on standardised recipe representations. ORD (Open Reaction Database) for organic. PolyMRP / PRIME for polymer synthesis. For inorganics, the field is still fragmented; A-Lab’s recipe schema is one de facto standard, others are emerging. None is universally adopted.

The “hardware bottleneck” rule. In a typical autonomous-inorganic-synthesis loop, the cycle is: 30 min weighing/mixing + 6–24 h thermal cycle + 1–4 h cooling + 30 min characterisation. The surrogate prediction takes seconds. The acquisition decision takes seconds. The loop’s cycle time is dominated by the slow physical step. This is fundamental. No ML improvement helps; only parallelism and faster hardware.

The “sample-handling errors” problem. A dropped vial is invisible to the surrogate. A contaminated crucible produces a “successful” measurement of a contaminated product. Without instrumented hardware (cameras, mass logs, atmospheric integrity sensors), these failures enter the database as legitimate data and corrupt training. This is one of the harder integration problems and is a leading reason loops require human supervision.

War story. A polymer-electrolyte loop in our collaboration ran for two months before someone noticed the moisture sensor in the glovebox had drifted; samples had been silently water-contaminated for weeks. The model “learned” that certain compositions had higher conductivity — those were the moisture-contaminated ones. Two months of data invalidated. Atmospheric logs prevent this; they are not optional.

33. Failure Modes — Measurement Side

Characterisation-pipeline failures

Automated XRD phase ID misidentifies (slide 30).
Spectral fitting fails on overlapping peaks; pipeline returns “best match” with no warning.
Drift in instrument calibration over weeks of campaign.

The operator-time bottleneck

“Autonomous” pipelines often produce 100–500 spectra / day for human review.
Reviewing 200 spectra / day is not autonomous — it is a person staring at a screen.
Mitigation: triage by uncertainty (review only flagged), automate the obvious calls, human-in-the-loop for ambiguous.

The phase-ID pipeline failure deserves its own emphasis. Slide 30 covered the A-Lab specific case; this is the general pattern. Automated phase ID is a learned classifier with some accuracy. Below threshold, it returns “no match.” Above threshold, it returns “best match.” In between, it returns the best match with low confidence, but the “low confidence” flag is often ignored by downstream code. The result: decisions are made on uncertain phase calls without realising it.

The “operator-time bottleneck” — the dirty secret of autonomous labs. A “fully autonomous” loop in 2026 still typically requires 20–60% of one full-time researcher’s time on review tasks: spectrum sanity checks, phase-call review, instrument-drift monitoring, recipe-failure post-mortems, database curation. This is progress over the manual baseline (which would require 10× more researcher time), but it is not “autonomous.” Communicating this honestly is a major part of the slide.

Triage-by-uncertainty as the right pattern. Use the pipeline’s confidence score (if calibrated — see §D) to triage. Above $\tau_{\text{high}}$: auto-accept. Below $\tau_{\text{low}}$: auto-reject (re-measure or flag). Between: human review. This focuses operator time on the ambiguous cases. With reasonable calibration, 70–90% of cases are auto-handled; the human reviewer sees only the genuinely hard ones.

Forward link. Slide 34 will summarise where autonomous-lab loops actually work in mid-2026 and where they do not. The failure-mode catalogue (32–33) feeds the honest assessment (34) directly.

The closing aphorism for slides 32–33. “An autonomous lab is a lab that fails automatically. Engineering it well means catching the failures automatically too.” That is the failure-mode philosophy.

34. What Works, What Does Not — Mid-2026

Works (productive use)

Workflow orchestration on a single platform.
BO over single composition axes with one fast measurement endpoint.
Synthesis-then-XRD on inorganic powders with curated phase library.
Photochemistry with HPLC readout.
Closed-loop within a curated chemistry family.

Marginal / does not yet work

Multi-step synthesis with on-line correction.
Multi-property optimisation under conflicting objectives.
Cross-platform recipe portability.
Open-ended novelty discovery without curated candidate pools.
Self-debugging instruments.
Cross-domain transfer (catalysis ↔︎ batteries).

The honest 2026 verdict: autonomous labs are real research infrastructure, within their domain. They are not yet a general-purpose discovery engine.

Why this slide matters more than the previous five. The whole §E builds toward this honest assessment. Students will leave this lecture and read autonomous-lab marketing material; they need a calibrated frame for what to believe.

The “single composition axis” caveat. Most successful 2024–2026 BO-driven autonomous loops optimised over a 1D or 2D composition slice (one or two element ratios) with a single property target. Extending to 3D+ composition with multi-property optimisation degrades performance dramatically: more data needed, more failure modes, more recipe variability. The “vary five things at once” version is research, not production, in 2026.

The “curated phase library” caveat. Inorganic synthesis loops with XRD readout work because the ICSD library is large and well-curated. The pipeline matches measured patterns to a known structure database. For chemistries where the relevant phases are not in the library — novel polymorphs, metastable phases, layered hetero-structures — the readout pipeline degrades to “we measured something; we are not sure what.”

The “open-ended novelty discovery” caveat. Autonomous loops in 2026 do well at exploring a defined search space. They do poorly at defining the search space. The candidate pool comes from a Materials Project query, a chemist’s intuition, or a generative model — humans / curated databases set the candidates. The loop screens; it does not invent.

The forward-looking honesty. “In 2030, several of the items in the right column will move to the left column. Multi-property optimisation will likely solidify. Cross-platform recipe portability is a research priority. Cross-domain transfer is far harder and may be a 2035 problem. Self-debugging instruments are an open AI problem, not just a materials problem.”

Closing the §E narrative. Slide 26 made the case for the loop. Slides 27–31 described the engineering. Slides 32–33 named the failure modes. Slide 34 says where we are. Slide 35 closes with the minimum-viable-build advice.

35. The Minimum Viable Autonomous Loop in 2026

What you need

One synthesis platform you control end-to-end.
One measurement endpoint with a parser you trust.
A constrained, calibrated surrogate (§B + §D wrapping a U13 GP).
A workflow engine of choice (slide 28).
Audit trail per decision (slide 25).
A model card + dataset card per loop run (§F).

What that buys you

A loop that runs nights and weekends.
3–5× throughput over manual screening within its chemistry domain.
Reproducible artefacts (logs, model card, run record) for publication.
Real research infrastructure — within the constraints of slide 34.
A platform to grow on as new chemistry / measurement modalities come online.

The “minimum viable” framing is deliberate. Many groups try to build a “complete” autonomous lab — multi-domain, multi-platform, all-the-bells. They fail and burn 2 PhD-years. The minimum-viable build (one platform, one measurement, one chemistry domain) succeeds in 1 PhD-year, produces a working loop, and is the foundation to grow on.

The “you control end-to-end” requirement matters. “Control” means: you can read all instrument data programmatically, you can trigger all instrument actions programmatically, you can recover from a failure without a human key-press (or you have explicit human-in-the-loop checkpoints). If any link in the chain requires a human to “click here to continue,” the loop is not autonomous; it is semi-automatic. Both are useful; the difference matters for planning.

The “publication-quality artefacts” point. A loop that runs but does not produce reviewable artefacts is research debt. Loops that produce a model card per campaign + dataset card per data drop + run-log per cycle are publishable. Loops that produce only “a folder of CSV files” are not. Build the artefact pipeline first; the loop becomes a paper second.

The “platform to grow on” point. The minimum-viable build is an investment. Once the workflow engine, the parser, the constraint-aware acquisition stack, and the audit trail exist, adding a second measurement modality is days of work, not months. The cost is in the first deployment.

Forward link to §F. “What buys you reviewable artefacts” — that is the FAIR / model-card discipline. §F is one section but it is the discipline that makes §E publishable. Without §F, §E is private research; with §F, §E becomes citable infrastructure.

§F · Reproducibility and FAIR for Materials ML

36. FAIR for Materials ML Artefacts

FAIR principles, applied to ML

Findable: DOI for code, weights, dataset, run logs.
Accessible: public artefact registry (Zenodo, HuggingFace, MaterialsCloud).
Interoperable: standard formats (ASE, OPTIMADE, structured JSON for runs).
Reusable: licence, environment lock, deterministic seed, version-pinned dependencies.

ML artefacts that need FAIR-ification

Training dataset (immutable snapshot).
Model weights (versioned).
Training script + environment lock.
Calibration / conformal artefacts.
Run logs from each loop iteration.
Model card + dataset card (next slides) (Sandfeld et al. 2024).

FAIR was originally for data, not for ML. The 2016 FAIR guidelines (wilkinson2016fair?) targeted research data — datasets, metadata, schemas. Applying them to ML artefacts (models, training scripts, environments) required adaptation that the field has been doing 2020–2026. The materials-ML community is converging on a workable practice, but it is not yet uniform.

The “DOI per artefact” rule. Each ML artefact gets its own DOI: the dataset, the trained weights, the training script, the model card. This sounds heavy; it is necessary. Without it, a paper claims “we used model X” and X is irreproducible because the weights were on a graduate student’s laptop that has since died.

Zenodo / HuggingFace / MaterialsCloud — quick orientation. Zenodo is the general-purpose archive (CERN-hosted, EU-funded, free, DOIs). HuggingFace is the ML-community standard for model weights and datasets. MaterialsCloud is the EPFL-hosted materials-specific archive that integrates with AiiDA workflows. For materials ML in 2026, use Zenodo for one-off papers, HuggingFace for actively-maintained models, MaterialsCloud for AiiDA-driven workflows. None excludes the others.

The “environment lock” rule. A model trained with PyTorch 1.10 and CUDA 11.7 may not load with PyTorch 2.4 and CUDA 12.4. Pin the environment (Conda lockfile, Docker image). Without this, “reproducible” decays in months.

The honest reading. Most 2024–2026 materials-ML papers do not meet the FAIR bar yet. The community is moving in the right direction; reviewers increasingly require it. Plan to submit FAIR-compliant artefacts; budget the time for it.

37. Dataset Cards for Materials

What a dataset card answers

Provenance: which database, which DFT functional, which relaxation status, which version.
Coverage: which chemistry families, which composition ranges, which property ranges.
Known biases: selection bias toward stable phases, experimental-vs-computed mixture, duplication.

Splits and distribution

Random split numbers — the headline.
Chemistry-family LOCO numbers — the operational reality.
Time-stratified splits — drift signal.
OOD-slice numbers — distribution-shift stress test (Bishop 2006).

Why dataset cards matter for materials. A “Materials Project subset” can mean five different things: which version of MP, with which DFT functional, after which deduplication, with which property filter, with which computational settings. Two papers using “Materials Project subsets” can be incomparable. Dataset cards force the reporting that makes them comparable.

The “selection bias toward stable phases” honest note. Materials Project preferentially contains phases that are relaxed and stable. Metastable phases are under-represented; transient phases are largely absent. A surrogate trained on MP cannot predict properties of metastable phases reliably — they are out-of-distribution by construction. Document this in the dataset card; do not let users discover it the hard way.

LOCO and time-stratified splits — what they are. LOCO (Leave-One-Chemistry-Out) splits hold out an entire chemistry family for validation. Time-stratified splits hold out the most recent additions to the database. Both are stronger evaluation than random splits and reveal failure modes that random splits hide.

The “headline vs operational” distinction. Most papers report random-split MAE — the “headline” number. The “operational” number is LOCO MAE on a chemistry family the user actually cares about. The two often differ by 2–10×. A dataset card includes both; a paper that reports only the random-split number is hiding something.

Sandfeld as the materials-data-science reference. (Sandfeld et al. 2024) has a chapter on data quality and provenance for materials databases; it is the right entry point for the community-consensus best practices on dataset cards.

38. Model Cards for Materials Surrogates

Intended use vs out-of-scope

Intended: which property, which composition range, which chemistry families.
Out-of-scope: chemistry families absent from training, properties not predicted, environmental conditions not represented.
Performance metrics: random-split, LOCO, OOD slice — all three.

Trust artefacts

Calibration / conformal-coverage diagnostics per family.
Known failure modes (one or two named, with mitigation).
Reproduction artefact bundle (script + environment + data DOI).
Audit-trail format for downstream use (Neuer et al. 2024).

Model cards as the §D audit trail materialised. §D slide 25 ended with “the audit trail is the model card.” This is that slide. Per loop-run, a model card summarising what model was used, what calibration was applied, what OOD scores were checked, what known failure modes were flagged. Roughly 1–2 pages per run.

The “intended vs out-of-scope” framing. Borrowed from the model-cards proposal (mitchell2019model?), adapted for materials. The key move is to explicitly enumerate what the model is and is not for. A surrogate trained on transition-metal oxides is not for use on actinides; saying that on the model card prevents misuse downstream.

The “one or two named failure modes with mitigation” rule. Every materials surrogate has failure modes. A model card that does not name them is hiding them. Two examples per model is the operational sweet spot — enough to be honest, not so many that the card becomes a complete failure ledger. The mitigation field forces the author to think about what the user should do when the failure is encountered.

The reproduction artefact bundle. Script (training pipeline), environment lock (Conda / Docker), data DOI (slide 36). Together these allow a user to reproduce the model from scratch. Without all three, “reproducible” is wishful.

Neuer’s framing of model cards in engineering. (Neuer et al. 2024) discusses model cards in the context of engineering deployment — useful for the materials reader because the engineering-deployment frame matches the autonomous-lab use case better than the original ML-research frame.

39. Benchmark Hygiene and the Materials Project Debates

Recent shortcut-learning findings

Several 2023–2025 papers documented shortcut learning on common Materials Project benchmarks.
Composition leakage: train and test sets share compositions through prototype duplication.
“MAE 30 meV” headline numbers degrade markedly under chemistry-family LOCO splits.

Recommended 2026 practice

Chemistry-family LOCO as the primary evaluation.
Time-stratified holdouts to test drift.
At least one OOD slice (held-out chemistry the model was not trained on).
Report all three; do not report random-split alone.

The “MatBench” and “MP benchmark” landscape, briefly. MatBench (dunn2020benchmarking?) was the first well-organised standard suite of materials-property benchmarks. It has aged: most leaderboards saturated, several entries had subtle leakage, and the random-split assumption shows its age. The community in 2024–2026 has been actively reforming benchmark hygiene — chemistry-family splits, time-based splits, stronger leakage audits.

The “composition leakage” specific. A common bug: a chemistry like Mn-O appears as MnO₂, MnO, Mn₂O₃, Mn₃O₄ in the training set; the test set contains Mn₂O₅. A naïve model “learns” Mn-O chemistry from training and predicts Mn₂O₅ accurately — but only because the chemistry was leaked, not because the model generalises. Chemistry-family LOCO splits prevent this leakage.

The “headline vs LOCO” gap, sharpened. A surrogate that scores 30 meV/atom random-split typically scores 80–150 meV/atom under chemistry-family LOCO on the same data. The gap is the true generalisation cost, hidden by random splits. Reporting only random-split is misleading.

Why this matters for autonomous labs. §E loops are exactly the LOCO regime — you query candidates from chemistry families that may not be well-represented in training. The headline random-split MAE is irrelevant to your loop performance; the LOCO MAE is the operational number. Deploying a surrogate to an autonomous lab without checking LOCO is a known way to fail.

Honest closing. “The benchmarks are improving. The papers from 2024–2026 reforming them deserve to be read carefully — they are doing the unglamorous work that makes the next generation of papers honest. If you publish in MG, your evaluation should be at the 2026 standard, not the 2020 standard.”

40. What a Reviewable Materials-ML Submission Looks Like in 2026

The minimum bundle

Dataset card (slide 37).
Model card (slide 38).
Training script + environment lock (slide 36).
Random-split + LOCO + OOD numbers (slide 39).
Conformal-coverage table per chemistry family (slide 22).

The story bundle

One named failure mode + one mitigation, in the paper text.
(If autonomous-lab work) audit-trail summary; loop iterations; one named loop failure with post-mortem.
Code + data DOIs in the paper, not just on a website that may rot.

The reviewer’s checklist: can I reproduce the model from the artefacts? Can I reproduce the numbers? Can I trust the OOD claim? If any answer is “no,” the paper needs revision.

This slide is the §F summary in operational form. Slides 36–39 explained the why; slide 40 is the what to ship. Six items, hard line. A submission that misses any of items 1–5 is not at the 2026 bar; a submission that misses 6–8 is at the bar but not a strong example of it.

Item 5’s specific importance. The conformal-coverage table per chemistry family is the trust artefact for screening claims. A paper that says “our surrogate is calibrated” without a per-family coverage table is making an unverifiable claim. The table is one figure; it carries enormous evidentiary weight.

Item 6’s specific importance. “One named failure mode + one mitigation” forces honesty. A model with no named failure modes either has a perfect reviewer-pleasing surface (suspicious) or had the failures suppressed. Naming one and mitigating it shows the model has been used, not just trained.

Item 7’s autonomous-lab specific. §E systems must report loop iterations. How many decisions did the loop make? How many succeeded? How many failed, and why? Without this, “autonomous lab” is a marketing term. With it, the work is a contribution to the operational literature.

The reviewer-checklist framing. This is the 2026 reviewer mindset for materials-ML papers. If a student produces a thesis or a paper that meets all four reviewer questions positively, it will pass review at the strongest journals. If it does not, it will be returned for revision regardless of the quality of the underlying science.

Closing the §F arc. “FAIR is not a tax on doing science. FAIR is what makes the science citable in five years. Without it, your work decays. With it, your work compounds. §G now: where the field is going, given that we have these foundations.”

§G · 2026 Outlook

41. Foundation MLIPs as the New Substrate

The 2024–2026 emergence

MACE-MP-0 (batatia2024foundation?): equivariant GNN trained on Materials Project relaxation trajectories.
CHGNet (deng2023chgnet?): charge-informed GNN for transition-metal chemistry.
ORB (Orbital Materials, 2024): general-purpose ML potential.
GNoME (merchant2023scaling?) and follow-ups: large-scale MLIP-driven discovery.

What they share

Trained on $10^6$–$10^8$ DFT structures.
Transferable across most of the periodic table.
Replace DFT for relaxation / dynamics in the chemistry space they cover.
Open weights; reproducible. The first open-source materials substrate at this scale.

Why “foundation model” is the right word. Foundation MLIPs are to materials simulation what GPT-class language models are to text: pretrained at scale, transferable across tasks, fine-tunable, and cheap to evaluate at inference. The transition is profound. As recently as 2022, every materials-ML group trained their own potential from scratch; in 2026, most groups fine-tune a foundation MLIP on their target chemistry.

Equivariance as the architectural prior. MACE, NequIP, Allegro, eSEN, ORB are all E(3)-equivariant. The §B “architectural prior” idea reaches its apotheosis in foundation MLIPs: rotation, translation, reflection, and atom-type permutation invariances are built in. The result is data-efficient training (less data needed for the same accuracy) and physical correctness (no symmetry violations).

The training data scale. GNoME (merchant2023scaling?) trained on a custom DFT dataset of $\sim 10^7$ structures generated specifically for the project. MACE-MP-0 (batatia2024foundation?) trained on Materials Project relaxation trajectories (~$10^6$ structures). The scale is far larger than any single group could afford individually; foundation MLIPs are an industrial / large-collaboration product.

The “open weights” caveat. MACE-MP-0 is genuinely open. CHGNet is open. ORB is partially open (weights available, training data partially open). GNoME’s full training set is partially restricted. The community in 2026 is converging on full openness; some commercial labs lag.

Pointer to the next slide. Slide 42 will lay out what changes about MG when foundation MLIPs are available — the bottleneck shift, the new economics. This slide just names the substrate.

42. What Foundation MLIPs Change

Cheap energy evaluation

$10^4$–$10^6$ structures per GPU-hour with a foundation MLIP.
DFT: 1–10 structures per GPU-hour.
A 1000× speedup. Real.

Bottleneck shift

The bottleneck was DFT energy evaluation.
The bottleneck is synthesis + measurement.
Discovery loop economics: spend the saved compute on more measurement, not more compute.
Multi-fidelity AL (slide 43) becomes the natural framing.

The 1000× speedup is genuine. Foundation MLIP inference on a modern GPU is dominated by graph construction (~1–10 ms per structure) and the forward pass (~1–10 ms). A typical structure takes 10–100 ms. DFT on the same structure takes minutes to hours depending on size. The 1000× number is conservative.

Why this changes MG philosophy. Pre-2023: surrogate models existed because DFT was too expensive. The whole U8–U10 program was about replacing DFT with an MLP / GNN / GP. Post-2023: foundation MLIPs are DFT-replacement, transferably. The remaining surrogate work shifts upward — from “predict energy faster than DFT” to “predict experimental property faster than measurement.” The bottleneck shift is real.

The “spend the saved compute” advice. A common 2024–2025 mistake: use foundation MLIPs to generate millions of synthetic structures, train a downstream surrogate on those millions, declare victory on a benchmark. This is self-training — the downstream surrogate learns to predict the foundation MLIP, not to predict the truth. It saturates at the foundation MLIP’s accuracy and inherits its errors. Better use: targeted experiments where the foundation MLIP says “this candidate is interesting, please measure.”

Multi-fidelity active learning is the natural framing. Three fidelities now: foundation MLIP (cheapest, ~1 ms/structure, broadly accurate), DFT (expensive, ~hour/structure, more accurate), experiment (most expensive, days/measurement, ground truth). Multi-fidelity GP / BO routes queries to the cheapest sufficient fidelity. Slide 43 covers this.

The honest caveat for foundation MLIPs. They are accurate where they were trained. Move into a chemistry family they did not see (rare-earth-rich, actinide, mixed-valence transition-metal) and accuracy degrades. §D’s OOD detection applies as much to foundation MLIPs as to any other surrogate. They are not magic; they are an excellent default substrate.

43. Multi-Fidelity Active Learning

Three fidelities in 2026 MG

Cheap: foundation MLIP inference. ~1 ms / structure.
Medium: DFT (PBE / SCAN / r²SCAN). ~1 hour / structure.
Expensive: experiment. Days / measurement.

Routing the query

Multi-fidelity GP / BO: model the joint over fidelities.
For each candidate, compute expected information gain per cost at each fidelity.
Query the cheapest fidelity that pays.
Save expensive measurement for candidates the cheap fidelities cannot resolve.

The right framing is not “DFT or experiment” — it is “spend each budget where it pays” (Murphy 2012).

The multi-fidelity framework, briefly. Forrester / Sóbester / Keane (forrester2007multi?) and Lam / Allaire / Willcox (lam2015multifidelity?) and follow-ups developed Bayesian-optimisation algorithms that handle multiple data sources at different costs and accuracies. The basic object is a joint GP over (input, fidelity) pairs; fidelities are coupled via a shared latent. The acquisition function trades off expected-information-gain against query cost.

The 2026 instantiation. A candidate composition $x$ has three possible queries: foundation MLIP energy (cheap, possibly biased), DFT energy (medium, more accurate), experimental measurement (expensive, ground truth). The acquisition function compares “expected information gain from MLIP query / cost of MLIP query” to the same ratio for DFT and for experiment. Whichever wins is the next query.

A typical strategy. Use foundation MLIP to screen $10^6$ candidates down to $10^3$. Use DFT to refine $10^3$ down to $10^2$. Use experiment to verify $10^2$ down to ~$10$ confirmed leads. Throughput: $10^6$ candidates considered, $\sim 100$ DFT calculations, $\sim 100$ experiments. Pre-foundation-MLIP, this would have been $10^4$ DFT calculations and $\sim 100$ experiments.

The “calibrate the cheap fidelity against the expensive” requirement. Foundation MLIP and DFT have systematic offsets vs experiment. Multi-fidelity BO works only if the offsets are modelled (typically as a GP discrepancy term between fidelities, fit on a small calibration set of expensive-fidelity measurements). Without this, the cheap fidelity drives the loop in the wrong direction. Calibration is mandatory; budget for it.

The honest scaling caveat. Multi-fidelity BO is currently most-tested with two fidelities (one cheap, one expensive). Three-fidelity is research-grade in 2026; it works but the method matters more than the framework. Plan for a research collaboration to deploy at three-fidelity scale.

44. Closing-the-Loop Economics

Cost per query in 2026 (order of magnitude)

Foundation MLIP: $\sim$ €0.001 per structure (cents to fractions of a cent).
DFT calculation: $\sim$ €1 per structure (compute + storage).
Experiment: $\sim$ €100–10000 per measurement (consumables, beam time, instrument time, person time).

The economic logic

Each experiment saved by a calibrated surrogate is worth $\sim$ €100–10000.
Each MLIP / DFT calculation costs $\sim$ €0.001–€1.
Spending €1000 of compute to save one €1000 experiment is even.
Spending €1000 of compute to save ten €1000 experiments is the loop’s economic basis.

The economic argument is the strongest argument for autonomous labs. Without it, “autonomous lab” sounds like a marketing term. With it, the cost-benefit is concrete: a 10× experiment-saving rate plus 100× speedup yields a 1000× effective throughput at a similar absolute cost. This is the calculation that funding agencies actually run when evaluating autonomous-lab proposals.

The cost numbers, anchored. A single DFT relaxation on a modern HPC cluster costs around 100–1000 CPU-hours; at €0.01–0.1 per CPU-hour, that is €1–€100 per calculation, with €1 being a typical mid-range. A single materials-discovery experiment (synthesis + characterisation + analysis) costs at least €100 in consumables and instrument time at academic rates; for advanced characterisation (synchrotron beam time, in-situ TEM) the number can be €10000 per measurement. Order-of-magnitude reasoning is what matters; exact numbers vary by lab.

Why this is the killer slide of §G. It is the only slide that translates the loop’s intellectual argument into euros. Funders care about euros. PIs care about euros. Department chairs care about euros. The student who can articulate the loop economics is the student who gets the autonomous-lab grant.

The “ten-experiment-save” requirement is the bar. A surrogate with random-quality predictions saves $\sim 0$ experiments (the candidates it endorses are no better than random). A well-calibrated, OOD-aware, constraint-respecting surrogate (everything in U14) saves many experiments. The U14 stack is not academic frippery; it is the difference between a loop that costs more than it saves and a loop that pays for itself.

Forward link. Slide 45 will name the open problems — the gaps where the economics still does not work yet, the places where the loop is not yet self-paying. Slide 46 will close §G with the honest 2026 status.

45. Open Problems

Methodological open problems

Cross-domain transfer (catalysis ↔︎ batteries ↔︎ structural alloys).
Honest uncertainty under distribution shift at scale.
Recipe portability across platforms.
Novelty verification without human review.
Reproducibility of multi-month autonomous campaigns.

Infrastructural open problems

Standardised recipe representation across inorganic / organic / polymer.
Cross-vendor instrument APIs.
Data-format interoperability (parsers, schemas, units).
Long-term stewardship of campaign databases (5+ year horizon).
Cost of the integration effort itself (the social problem).

The two columns are intentional. Method problems live in papers (algorithm papers, theory papers, proof-of-concept papers). Infrastructure problems live in consortia (industrial standards bodies, multi-institute software collaborations, funding-agency-coordinated efforts). Both kinds of problem need work; they need different kinds of work.

Cross-domain transfer — the deep one. A loop that works for inorganic synthesis does not transfer to polymer synthesis without months of re-engineering. Worse, the intuitions of the inorganic loop (“watch atmospheric integrity”) do not directly translate to polymer (“watch monomer purity”). True cross-domain transfer requires either a domain-agnostic abstraction (none yet exists) or a meta-loop that adapts to domain (research-grade in 2026).

Recipe portability — the hard one. Two labs run the “same recipe” on two different platforms. The products differ. Why? Differences in dispenser tolerance, ramp profile, atmospheric integrity, vessel material — none of which are typically captured in a “recipe.” Standardising what recipe means (slide 32) is a community-coordination problem more than a research problem.

Long-term stewardship. A 5-year autonomous-lab campaign generates 50000–500000 measurements. Storing them in a way that is queryable and trustworthy in year 6 is a non-trivial engineering problem. Most current campaigns store in CSV files on a research-group hard drive and lose data when students leave. The long-term solution is institutional databases with versioning, schemas, and stewardship — a problem the materials community is just beginning to address.

The “cost of integration is itself a problem” honest note. Building a working autonomous lab in 2026 costs ~1 PhD-year of engineering time plus €100k–€1M of hardware. That cost is not declining as fast as the surrogate-quality is improving. The economics of “every materials lab has its own autonomous platform” do not yet work; consortia and shared facilities are the more realistic 2026 path.

46. Where MG Is, Mid-2026

Maturity ladder, by component

Substrate (data + foundation MLIPs): mature.
Modelling (surrogates + UQ): mature in-domain, fragile OOD.
Loop infrastructure: working in single domains; portability open.
End-to-end autonomous discovery of genuinely novel materials: open. The next five years.

Reading the field honestly

The hype is real for the substrate; over-stated for the discovery claims.
A working materials-ML stack in 2026 is real research infrastructure — within slide 34’s caveats.
The next five years are the integration / cross-domain era. Your career.

The closing slide of §G is the honest assessment slide. Slide 34 was the engineering version (what works, what does not). Slide 46 is the maturity-ladder version. Together they answer the question: “is materials genomics a real field in 2026, or is it hype?” Answer: real for the substrate, real-but-narrow for loops, hype-prone for end-to-end discovery claims.

The “next five years” framing for students. Most students in this lecture will be doing materials-ML PhDs from 2026 to 2030. The maturity ladder describes their working environment: the substrate is given, the modelling is well-developed, the loop is real but single-domain, the discovery story is open. Their PhDs are about moving the open items into the working column. That is genuinely exciting work and genuinely the best time in the field’s history to start.

The hype-cycle interpretation. Materials genomics in 2026 is on the slope-of-enlightenment portion of the Gartner hype cycle: past peak hype (2022–2023, when “GNoME discovers 380000 materials” dominated the news), past trough of disillusionment (2024 with the A-Lab debate and benchmark-leakage findings), now in productive incremental work. The slope is the right time to enter.

The honest pitch to the room. “If you choose to work in MG for your master’s thesis or PhD, you are entering a field that has stable foundations, real infrastructure, named open problems, and no dominant party — academic groups, industrial labs, and national labs all have working stacks and none has the killer integration. There is room for new contributions. The work is real.”

Forward link to §H. §H is the course-wrap. We have done the field-wrap; now we close the course. Slide 47 zooms back to MG’s syllabus arc; slide 48 names the four big skills; slide 49 is the reading list; slide 50 is the exam. Five minutes. Then questions.

§H · Course Wrap

47. The MG Syllabus Arc in One Slide

The arc, walked once more

Physics (U2–U4): QM postulates, electronic structure, thermo, atomistic simulation.
Representations (U6–U7): graphs, local atomic environments, descriptors.
Models (U8–U10): regression, neural networks, learned representations.
Geometry (U11–U12): latent spaces, clustering, discovery vs labelling.

The arc’s destination

Decision (U13–U14): UQ, BO, constraints, trust, autonomous loops.
Each unit served the next.
This unit served all of them.
The integration story is the test of whether the rest taught anything operational.

Why repeat the arc again at slide 47. Slide 02 walked the arc looking forward (here is what we built). Slide 47 walks it looking back (here is what we used). The repetition is the pedagogical bookend: the same path traversed in two directions cements it.

The pedagogical claim made baldly. Each unit served the next. U2–U4 (physics) was the substrate for U6–U7 (representations). U6–U7 was the input to U8–U10 (models). U8–U10 fed U11–U12 (geometry of model outputs). U11–U12 fed U13 (UQ over model space). U13 fed U14 (constraints + trust + integration over UQ-driven decisions). This linearity is unusual for ML courses; most are a grab-bag. MG was designed to be linear, and U14 is the test of whether the linearity held.

The “this unit served all of them” frame. §A reviewed the dependencies. §B applied constraints to the U8–U12 model outputs. §C connected to U2–U4 physics via PINNs. §D applied to U13 surrogate uncertainty. §E closed the loop through experiment. §F made the artefacts citable. §G placed the work in field context. §H is now closing the loop on the course itself.

The honest meta-comment. Some students will not have followed the linear arc — they took the course as a grab-bag because their backgrounds let them. That’s fine; the arc was a scaffolding device, not a religion. But for the students who did follow it, the closing-the-loop sensation matters. State that aloud. They will remember the lecture for it.

48. The Four Big Skills

Choose and Train

Choose a representation with the right invariances for the property (U6–U7, U10, §B).
Train a surrogate with calibrated uncertainty and a defensible split protocol (U8, U13, §F).

Plan and Close

Plan an acquisition that respects feasibility, OOD coverage, and budget (U13, §B, §D).
Close the loop with reviewable artefacts (model card, dataset card, run log) (§E, §F).

If you can do all four end-to-end on a chemistry domain you care about, you are an MG practitioner. That is what this course taught.

Why four and not seven. Seven would be a list of techniques; four is a list of operational competences. The four-skill list is the answer to “what does an MG practitioner do all day?” — and it is a compact answer, memorable, exam-relevant, career-relevant.

Each skill is a set of unit-level competences fused. Skill 1 fuses U6 (graph representations), U7 (local environments), U10 (learned representations), and §B (constraint-aware heads). Skill 2 fuses U8 (regression), U13 (GP / UQ), and §F (model cards). Skill 3 fuses U13 (acquisition), §B (feasibility), and §D (conformal + OOD). Skill 4 fuses §E (loop architecture), §F (reproducibility), and the parts of ML-PC / MFML that wrap each.

The “exam-relevant” connection. Each exam question will probe one of the four skills via a specific scenario. Sample stems: “Given representation X for property Y, what invariance is needed and how do you bake it in?” “Given surrogate Z, write a model card field by field.” “Given two acquisition strategies, which respects this constraint?” “Given a loop with this failure, what audit-trail entry is missing?” One per skill, in some order.

The “career-relevant” connection. Master’s students leaving this course go to industry positions (process engineer, ML engineer, R&D scientist) or PhD positions (computational materials science, ML for materials). In both, the four skills are the hiring criteria. A student who can articulate them with examples interviews well. A student who has the titles of all 12 units memorised but cannot articulate the four skills does not interview well.

Read aloud, slowly, for the room. “Choose. Train. Plan. Close. Four skills. Twelve units served those four. If you have those four, you have what this course taught.”

49. Reading List for Going Further

The course textbooks

(Sandfeld et al. 2024) — materials data science from the engineering perspective. The most practically useful single book.
(Neuer et al. 2024) Ch6–Ch7 — physics-informed and explainable methods at engineering depth.
(Bishop 2006) §9 — clustering, EM, foundation for U12.
(Murphy 2012) — for probabilistic depth on UQ and BO.
(Goodfellow et al. 2016) Ch14 — autoencoders, the basis of U10–U12.

Beyond MG itself

MFML W13 — full PINN tutorial.
MFML W14 — generic explainability and trust.
ML-PC W14 — autonomous characterisation reflection.
2024–2026 review papers on autonomous labs (Aspuru-Guzik, A-Lab follow-ups, GNoME), MatBench reform, foundation MLIPs.
Open-source stacks: AiiDA, BoTorch, MACE-MP, HuggingFace Hub for materials.

Reading priorities for one book. (Sandfeld et al. 2024). It is the single book in the list closest to MG’s actual scope (materials-specific data + ML), and it is engineering-oriented enough to be readable without a PhD in statistics.

Reading priorities for two books. Add (Neuer et al. 2024) Ch6–Ch7. Neuer is the right physics-informed-learning reference; it covers the §C territory at appropriate depth without being a PINN tutorial.

Reading priorities for three books. Add (Goodfellow et al. 2016) Ch14 for autoencoders and the generative-model architectural foundations underlying U10–U12.

Reading priorities for four books. Add (Bishop 2006) §9 (and §3 / §6 for GPs) for the probabilistic foundations of U13 and the clustering of U12.

Reading priorities for five books. Add (Murphy 2012) for the encyclopaedic ML reference. Useful as a desk reference for the rest of one’s career; less useful as a sequential read.

Beyond books. The 2024–2026 review-paper landscape is the right next read after the books. Recommended: a survey on autonomous labs (Aspuru-Guzik group has multiple), the GNoME paper (merchant2023scaling?) and follow-ups, the MatBench reform papers, the conformal-prediction tutorial papers (angelopoulos2021gentle?). The remaining items not yet in ref.bib are the literature one chases via Google Scholar.

Open-source orientation. AiiDA / BoTorch / MACE-MP / HuggingFace Hub for materials are the four open-source ecosystems a working materials-ML practitioner uses daily in 2026. None requires payment, all have working tutorials, all have active communities. Use them for the master’s thesis.

50. Exam, Questions, and End of MG

Exam scope

The four big skills (slide 48), each via one operational scenario.
Worked examples drawn from U8, U10, U13, U14.
Vocabulary: representation, invariance, kernel, posterior, conformal, OOD, feasibility, audit trail.
Open-book in spirit: bring one A4 cheat sheet (handwritten only).

The exam rubric

Bring to every answer: a feasibility filter, a conformal wrapper, a defensible split (LOCO or time), a named failure mode, a mitigation.
That is the rubric. State the constraint. State the trust signal. State the failure mode. State the mitigation.
Thank you. Questions.

The exam framing is deliberate. Most ML exams test recall of techniques. MG’s exam tests integration of techniques: can you assemble the four skills on a fresh scenario? That is what materials-ML practitioners actually do. The exam mirrors the work.

The “five things to bring to every answer” rubric. Filter + wrapper + split + failure + mitigation. This is the closing-the-loop rubric in five words. A student who applies these five to any prompt produces a passable answer. A student who applies these five with detail and specificity produces a strong answer.

The “open-book in spirit” framing. A handwritten A4 cheat sheet forces students to choose what is essential. The act of writing the cheat sheet is the studying. Permitted: definitions, equation forms, the four-skill list, the six-step loop, the OOD-score names, the FAIR principles, named methods (MACE-MP, conformal, etc.). Not permitted: pre-written full answers; printed-out copies of slides; LLM-generated content.

Pacing the close. This slide should land at the 88-minute mark. Two minutes for the exam framing. Then “Questions” at minute 90. Stay for as long as needed; do not let the lecture run long over the framing.

The closing line, said calmly. “Thank you. This was Materials Genomics. The course tried to prepare you to do four things end-to-end. If it succeeded, you are ready to start a master’s thesis or a PhD in this area. If it did not, the reading list (slide 49) is your second pass. Questions?”

One last forward-looking sentence, if there is time. “Whatever you do next — research, industry, more coursework — the materials-ML field in 2026 is a field with real foundations and real open problems. You can contribute. Go contribute.”

Bishop, Christopher M. 2006. Pattern Recognition and Machine Learning. Springer.

Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press.

Murphy, Kevin P. 2012. Machine Learning: A Probabilistic Perspective. MIT Press.

Neuer, Michael et al. 2024. Machine Learning for Engineers: Introduction to Physics-Informed, Explainable Learning Methods for AI in Engineering Applications. Springer Nature.

Sandfeld, Stefan et al. 2024. Materials Data Science. Springer.

Continue

← Previous: Unit 13 — Uncertainty-Aware Discovery & Gaussian Processes
All courses

Materials GenomicsUnit 14: Constraints, Trust, and Integration Outlook