Machine Learning for Characterization and Processing
Unit 11: Uncertainty-aware regression & Gaussian Processes

AI 4 Materials / KI-Materialtechnologie

Prof. Dr. Philipp Pelz

FAU Erlangen-Nürnberg

01. From MFML theory to lab practice

MFML W12 recap — we use these, we do not re-derive them

Method	One-line summary	When to reach for it
Gaussian Process	Closed-form Bayesian regression over functions	Small \(n\) (\(\lesssim 10^3\)), tabular, smooth response, need calibrated CI
MC Dropout	Keep dropout on at inference, sample \(T\) passes	Big NN already trained, cheap epistemic estimate per pixel/voxel
Deep ensembles	Train \(M\) independent NNs, use disagreement	Best-calibrated NN UQ; budget for \(M\times\) training
MDN	NN outputs \((\pi_k, \mu_k, \sigma_k)\) of a Gaussian mixture	Multi-modal output (phase A or phase B from the same input)
Calibration	Reliability diagram + temperature scaling	Mandatory before any deployed model

Note

We are not re-deriving the math. See MFML W12 for posteriors, ELBO, marginal likelihood. Today: which tool, on which lab task, with which numbers.

Framing for the unit. This is a deployment-and-decisions unit, not a derivation unit. Students arrive here with the MFML W12 math in their notebooks already (GP posterior, ELBO, marginal likelihood). My job today is to turn that math into instrument-time saved, scrap rate avoided, and acquisition functions that pick the next experiment. If a student raises a hand to ask “where does the GP posterior come from?”, point at MFML W12 and keep moving.

Use this table as a navigation index. Every row is a slide later in the deck — GP in §3, MC Dropout in §4, ensembles + MDN in §6, calibration as the backbone in §6. When students get lost in the deck, return to this table.

Anti-pattern to flag early. Most published materials-ML papers ship a model with a softmax score and call it “confidence”. The whole point of this unit is to make students physically uncomfortable with that practice, so they go fix it on their own projects. Plant that discomfort here.

Pacing. This slide opens the unit and should land in under 90 seconds — it’s a map, not a lesson. The lesson starts on the next slide.

Cross-reference. MFML W12 is the math backstop; ML-PC u06 (DINOv2 transfer) is where overconfident foundation models first bite; ML-PC u09 (spectra) is where calibration on small spectroscopy datasets is non-negotiable.

Why UQ matters in characterization & processing labs

One wrong tensile-strength call on a structural part: recall, scrap, or worst case a failure in service.
One missed crack/pore in an SEM screen: a defective batch ships.
A point estimate without a confidence band is not a deliverable to a lab manager or a certifying body.

Engineering decisions are threshold decisions — accept/reject, retest/release, explore/exploit.
A threshold needs a distribution, not a number.

Note

Trust = prediction + calibrated confidence. The “calibrated” word is what most published materials-ML papers skip.

Why this slide is the emotional anchor. Engineering decisions are threshold decisions — accept/reject, retest/release, explore/exploit. A threshold needs a distribution, not a number. Until students feel that, they will keep reporting RMSE without a confidence band on their MSc projects. Spend a full minute on the “not a deliverable” line.

Concrete story to use. I usually retell a specific scrap incident: a hardness model with no uncertainty band predicted a tempered part inside spec; the part shipped, failed in service on a customer rig, and the recall cost six figures. The fix afterwards was not a better mean predictor — it was adding a CI and a hard reject rule above a width threshold. Same model, different deployment, opposite outcome.

Anti-pattern. Students will read papers reporting “test accuracy 97 %” and call that “trust”. Push back: 97 % marginal accuracy with the failures concentrated in the high-confidence tail is worse than 92 % with well-distributed mistakes. Trust requires the shape of the error distribution, not its mean.

Cross-reference. ML-PC u02 (physics-of-data) introduced “all measurement is noisy”; this slide finishes that thread — noise is not a nuisance to absorb, it is the signal the model must propagate.

Pre-empt. “Is this just for safety-critical parts?” — no. UQ is the input to the next experiment (§5). Even when the part is non-critical, your budget is critical, and a calibrated \(\sigma\) is what picks the next experiment.

Pacing. ~2 minutes. Land the threshold framing before moving on.

Two pain points unique to materials labs

(a) Tool / operator / coating domain shift

The training set was acquired on SEM #1 with operator A and a 5 nm Au coating.
Inference happens on SEM #2, operator B, no coating, on Tuesday after a chamber vent.
Your softmax does not know any of that. Your uncertainty had better grow.

(b) Experiments cost €1k+/h

Analytical S/TEM, AM build chambers, Gleeble dilatometry: each new label is hours of instrument time and days of prep.
UQ is not a publication ornament — it is the input to the next experiment: active learning only works if uncertainty is honest and spatially resolved.

Both pain points are materials-specific. Web-scale ML papers do not have either: their distribution shift is a covariate-shift benchmark, and their next “experiment” is another web crawl at marginal cost. Drive that contrast home — the reason our subfield needs honest UQ is that experiments are expensive and instruments drift.

Pain point (a) — domain shift. The realistic deployment of any materials model is on a different tool, by a different operator, with a different prep. Most softmax-based models cannot tell. Today’s unit is about models that can tell — either by ballooning \(\sigma\) (GP, ensemble), by entropy spikes (MC Dropout), or by OOD detectors (Mahalanobis, §6). A model that gets quieter as the input drifts is broken; you want one that gets louder.

Pain point (b) — €1k+/h experiments. This is the slide that motivates §5 (active learning) and §3’s ROI. A tensile + hardness round at €200/test is cheap by materials standards; analytical S/TEM is closer to €500-1000/h once you add operator + amortised instrument cost. Make the numbers concrete so students do not dismiss them as toy.

War story. I have seen a PhD project waste six months because a CNN was retrained from scratch every time a new SEM was added — there was no calibration step, so the team assumed the model itself was the problem. The model was fine. The calibration was missing.

Cross-reference. Pain point (a) → §6 (calibration on lab-realistic sets, OOD detectors); pain point (b) → §5 (AL loop). The whole deck is structured around these two failure modes.

Pacing. ~2 minutes. Two halves, one each. Move on quickly to §2 (method selection).

02. Picking a UQ method per lab task

Decision table — task → method → cost

Lab task	Recommended UQ	Rationale	Cost driver
Tabular regression, \(n \in [10, 300]\) (composition \(\to\) property)	GP, RBF or Matérn \(\nu{=}5/2\)	Closed-form CI, smooth response, hyperparams interpretable	\(O(N^3)\) once — fine for \(N \lesssim 10^3\)
Pixel-wise segmentation of microscopy (CNN, U-Net)	MC Dropout, \(T \approx 30\)	Reuse trained net, get per-pixel variance map	\(T\times\) inference per image
High-stakes property regression with budget for retraining	Deep ensemble, \(M \in [5, 10]\)	Best calibration in literature (Lakshminarayanan et al. 2017)	\(M\times\) training
Multi-modal output (one input, two phases possible)	MDN, \(K \in \{2,3\}\)	Bimodal \(p(y\\|x)\) — mean is meaningless	One training, harder to fit
Any deployed model	Reliability diagram + temp scaling	Free, post-hoc, on a held-out cal set	Trivial

This is the slide students will photograph. Make sure the four cost drivers are named, not just hinted at: \(O(N^3)\) (GP), \(T\times\) inference (MC Dropout), \(M\times\) training (ensemble), one-off fit (MDN). Each is a different bottleneck and dictates a different deployment story.

Materials hook for each row. GP row → tabular composition→property (§3); MC Dropout row → SEM segmentation (§4); ensemble row → high-stakes property prediction (e.g., yield strength for a structural alloy spec) — same model trained 5× and disagreement is the signal. MDN row → phase-field outputs where the same composition can crystallise into two phases; the mean is a meaningless intermediate composition and a single Gaussian undersells the multi-modality.

Anti-pattern. Choosing the model first and forcing UQ on afterward. The right order is (a) what is the prediction task and noise model? (b) which UQ family fits? (c) which model in that family? Walking from prediction-task to method is the whole point of this slide.

Pre-empt. “Why not Bayesian NNs / variational inference?” — defer to MFML W12 and to the textbook Murphy ch.4. In practice, ensembles dominate them for materials-scale problems on calibration and engineering simplicity. We mention BNNs only if a student raises them.

Materials-meta observation. Every row except MDN is post-hoc applicable — you can bolt it on top of a model that was trained without UQ in mind. MDN bakes UQ into the loss and the architecture, so it has to be a planned choice from day one. That asymmetry matters for retrofitting existing models on your team.

Forward link. §6 (calibration + temp scaling) is the bottom row and is mandatory, not a “nice to have”. Every other row needs calibration on top.

Materials-specific computational budgets

Training-time vs inference-time tradeoff is more loaded in a lab than in webscale ML:

Deep ensemble — cheap at inference (run \(M\) small NNs, average), expensive at training (\(M\times\) wall-clock + carbon). Acceptable when training is one-off and the model is shipped to many labs.

MC Dropout — cheap at training (one network), expensive at inference (\(T\times\) forward passes). Acceptable on a benchtop where you process tens of images per day; not acceptable for a real-time inline detector.

GP — both small at the scales we operate (\(N \lesssim 1000\)). Recompute per dataset; trivially cheap at inference.

Note

MC Dropout’s variance estimate degrades with very deep nets and very low dropout rates — it can collapse to near-zero variance and look overconfident. Always validate with a held-out reliability diagram.

Carbon-budget angle. A 5-member ensemble of ResNet-50 trained from scratch is ~5× the embodied carbon of a single model. For models that ship to many labs (foundation models, certified analyzers) this amortises; for in-house ad-hoc projects it does not. Push students to think about who pays the training cost vs who pays the inference cost.

Real-time-detector nuance. Inline SEM defect spotters running at video rate cannot afford 30 forward passes per frame. MC Dropout is then the wrong tool — either accept ensembles (parallel inference, expensive in VRAM) or drop to a single deterministic net + post-hoc calibration. Students who skip this step ship overconfident inline detectors.

The callout-note is load-bearing. MC Dropout’s variance is not a guaranteed posterior — it is an interpretation under the variational-approximation assumption of Gal & Ghahramani. In modern architectures (deep, low dropout, batchnorm everywhere) that interpretation gets shaky. The fix is always a held-out reliability diagram, not a different formula.

Anti-pattern. Using a wall-clock argument to justify the wrong UQ choice. (“We chose MC Dropout because it was 1 day to wire up.”) Wall-clock matters at deployment, not at justification. Pick by task; pay for it.

Pre-empt. “Why does GP scale as \(N^3\)?” — Cholesky of the \(N \times N\) kernel matrix. We do not re-derive in class; pointer to MFML W12.

Cross-reference. This pairs with the next slide (§2 “What to not do”) which is the negative version of the same advice.

What to not do

Do not quote raw softmax probability as “model confidence” — modern deep nets are systematically overconfident; a 0.97 softmax on an out-of-distribution micrograph is meaningless (Guo et al. 2017).

Do not report the variance of a single ensemble member’s Monte-Carlo dropout passes and call it ensemble uncertainty — you are mixing two distinct UQ techniques and double-counting.

Do not report a GP fit without showing the prior (kernel choice, length-scale prior). The kernel is the model. Hiding it makes the CI uninterpretable.

Do not report any UQ number without a reliability diagram on a held-out calibration set.

This is the “be ashamed if you do this” slide. Read each bullet slowly. Each one corresponds to a published paper I can name (and so can the students after this slide if they pay attention to their next literature review).

Bullet 1 — softmax-as-confidence. Modern deep nets are systematically overconfident; the standard literature on this is Guo et al. 2017, but the materials-specific failure mode is OOD micrographs (different tool, different coating). A 0.97 softmax on an in-distribution image and a 0.97 softmax on an OOD image look identical to the user. The fix is calibration + OOD detection (§6) on top, not a different network.

Bullet 2 — single-member MC dropout variance. This is a category error — combining “an ensemble member” with “MC dropout passes” gives a meaningless mixed estimator. Either pick one or compose them properly (run MC dropout for each ensemble member and pool). Most thesis projects make this mistake at least once.

Bullet 3 — GP without kernel. A GP fit reported without its kernel (and length scale prior) is unfalsifiable. The kernel is the model. If a paper shows a GP curve without telling you whether they used RBF or Matérn 5/2, what \(\ell\), what \(\sigma_n\), treat the CI as advertisement, not evidence.

Bullet 4 — no reliability diagram. Same flavour as bullet 3. UQ numbers without coverage diagnostics are decoration. We will keep returning to this throughout §6.

Pacing. ~2.5 minutes. The four bullets are punchlines; do not over-explain — let students feel the discomfort.

War story. Tell the class about a recent reviewer-2 incident: a paper claiming “uncertainty-aware” did all four anti-patterns simultaneously. Reviewer 2 (me) sent it back with a single rejection: Add a reliability diagram on a held-out lab-realistic set.

Forward link. §6 deals with how to not commit these anti-patterns; §3-5 are the constructive examples.

03. Case study A — GP for process\(\to\)property mapping

Setup: 21CrMoV5-7 quench-and-temper

Steel grade DIN 1.7709 / 21CrMoV5-7 (Mantzoukas et al. 2021).
Process: austenitize 960 °C, oil quench, temper 2 h at variable \(T_{\text{temper}}\).
Input: \(T_{\text{temper}} \in [200, 700]\) °C.
Output: hardness \(\text{HRC}(T_{\text{temper}})\).
Real published dataset, \(\sim 10\text{–}30\) specimens per condition with measurement scatter.

Why this is a textbook GP problem:

\(n\) small, \(d{=}1\) input.
Response monotonically softens with \(T\) but is non-linear (carbide coarsening kinetics).
We need a CI to skip half of the destructive tests.

Why this alloy, why this data set. 21CrMoV5-7 (DIN 1.7709) is a workhorse Cr-Mo-V quench-and-temper steel for pressure-vessel and rotor applications — and the Mantzoukas et al. 2021 paper publishes clean tempering response curves with replicate scatter. This is the closest thing to a textbook GP problem you find in real metallurgy: 1-D input, smooth nonlinear output, modest noise, real lab cost per data point.

Physics in one breath. Tempering at increasing \(T\) drives carbide nucleation, growth, and coarsening; the response is the competition between strengthening (fine carbides) and softening (over-aging). Below ~400 °C the response is nearly flat; between 400-600 °C carbides coarsen and HRC drops sharply; above 650 °C you saturate. That regime change is the whole reason kernel choice (Matérn vs RBF, next slide) matters.

Why \(n \sim 30\). A typical industrial heat-treatment qualification campaign measures hardness on 3 specimens per condition, at 6-10 conditions, with retries — total \(n\) lands in the 20-40 range. That is far too small for any neural network to give honest uncertainty, and exactly the regime where a GP shines.

Anti-pattern. Students will try to fit a polynomial regression with bootstrapped CIs. That hides the heteroscedasticity and overfits at the boundaries. The GP is the right tool because it has a principled prior (smoothness) and a principled posterior CI.

Cross-reference. Mantzoukas_2021_17709 in ref.bib. MFML W12 for the posterior derivation. ML-PC u02 (physics-of-data) for the discussion of replicate scatter vs model uncertainty.

Pacing. ~1.5 minutes. The setup is short — the lecture should move quickly to the kernel slide.

Why GP fits this lab task

\(n \sim 30\): too small for a deep net to give honest uncertainty, plenty for a GP.
Output is smooth in input — RBF or Matérn \(\nu{=}5/2\) is the right inductive bias.
The GP delivers what the lab manager wants: \(\hat{y}(T) \pm 2\sigma(T)\), narrowing at sampled \(T\), widening between.

Hyperparameters \((\ell, \sigma_f, \sigma_n)\) are fit by maximizing the log marginal likelihood — see MFML W12 for the formula. We use the result.

Three reasons GP wins here, in plain words. 1. Small-data calibration: with \(n \sim 30\) the GP’s posterior CI is honest by construction — no need for a separate calibration set. Neural-net UQ at \(n \sim 30\) is fantasy. 2. Smoothness prior: RBF or Matérn 5/2 encodes “the response is a smooth function of \(T\)” — exactly the inductive bias the physics gives you for free. 3. Interpretable hyperparameters: \(\ell\) is a characteristic temperature scale the metallurgist can read. Try explaining a 100-parameter MLP’s hidden activations to your lab manager.

Why we do not derive the marginal likelihood. It is a one-liner: \(\log p(y \mid X, \theta) = -\tfrac{1}{2} y^\top K_\theta^{-1} y - \tfrac{1}{2}\log|K_\theta| - \tfrac{n}{2}\log 2\pi\). The two terms balance data fit against model complexity — automatic Occam’s razor. We use this; MFML W12 derives it.

Pre-empt. “Why not just bootstrap the data and fit polynomials?” — bootstrap CIs are empirical, not predictive; they do not widen between data points. GP is the right tool because the CI grows where there is no data and shrinks where there is.

Materials-specific pre-empt. “Why not a physics-based parametric model (Hollomon-Jaffe parameter)?” — Hollomon-Jaffe is a great mean predictor for tempering. You can use it as the GP’s prior mean and let the GP correct deviations. That’s the right hybrid; in this case study we go non-parametric for simplicity, but tell students the hybrid exists.

Cross-reference. MFML W12 for the marginal-likelihood objective and the Cholesky-based solver; ML-PC u12 (PINNs, the next deck) for the physics-prior hybrid.

Pacing. ~2 minutes.

Kernel choice — what the hyperparameters mean physically

\[ k_{\text{RBF}}(T, T') = \sigma_f^2 \exp\!\left(-\frac{(T - T')^2}{2\,\ell^2}\right) \]

Length scale \(\ell\): the “characteristic process variability” on the temperature axis.
- \(\ell \approx 50\) °C → response varies sharply with temper temperature; carbide kinetics regime-change is captured.
- \(\ell \approx 300\) °C → response is essentially a slow trend; we are oversmoothing.

Signal variance \(\sigma_f^2\): amplitude of the HRC variation across the explored window. Read off the data range.
Noise variance \(\sigma_n^2\): instrument + specimen scatter at fixed \(T\). Estimate it from replicates, do not let the optimizer absorb model misfit into noise.

Note

For metallurgical responses with regime changes, prefer Matérn \(\nu{=}5/2\) over RBF — it is once-differentiable instead of \(C^\infty\), which matches the physics better.

The slide that distinguishes ML-PC from MFML. In MFML we taught the GP as math — kernels are positive-definite functions. Here, each kernel hyperparameter has a physical reading: \(\ell\) is a temperature scale, \(\sigma_f^2\) is the data range, \(\sigma_n^2\) is replicate scatter. If a student cannot give each hyperparameter a unit and a physical interpretation after this slide, they did not get the point.

Trap to flag explicitly: \(\sigma_n^2\) as a model-misfit sponge. If you let the optimiser pick all three hyperparameters freely, it will absorb model misfit (wrong kernel, wrong mean function) into \(\sigma_n^2\), blowing up the noise estimate. The right discipline is to estimate \(\sigma_n^2\) separately from replicate measurements at one or two conditions and fix it during marginal-likelihood optimisation. Free hyperparameters: \(\ell\) and \(\sigma_f\). This single discipline change is the most common audit-fix I make on student GP code.

Why Matérn 5/2 for metallurgy. RBF is \(C^\infty\) smooth — every derivative exists. Real tempering curves have kinks (regime changes at \(\sim 400\) °C). Matérn 5/2 is twice differentiable but not more — it can bend more sharply where the physics calls for it. Bishop ch.6 has the full kernel zoo.

Pre-empt. “Should I use ARD (one \(\ell\) per input dimension)?” — yes, always, when \(d > 1\). It is the cheapest interpretability win and it surfaces which inputs are irrelevant.

Anti-pattern. Reporting a GP fit without listing the optimised hyperparameter values. If the paper does not say “\(\ell = 58\) °C, \(\sigma_f = 12\) HRC, \(\sigma_n = 0.9\) HRC”, the GP is unreproducible.

Cross-reference. ML-PC u12 (next unit, PINNs) builds on the prior-mean / kernel-prior structure.

Pacing. ~3 minutes — this is the GP-internals slide. Spend the time.

The fit — described

Posterior mean: a smooth softening curve from \(\sim 55\) HRC at 200 °C to \(\sim 25\) HRC at 700 °C.
95% CI ribbon: tight (\(\pm 1\) HRC) at sampled tempers (300, 450, 600 °C), widens to \(\pm 4\) HRC in the gaps.
The reliability of the CI is checked on a leave-one-out CV reliability diagram — and only then do we trust the ribbon.

Numerical example. With \(n{=}30\), \(\sigma_n \approx 0.8\) HRC (from replicates), \(\ell \approx 60\) °C: the GP posterior at \(T{=}500\) °C (a held-out point) gives \(\hat{\text{HRC}} = 38.2 \pm 1.6\) (2\(\sigma\)). Spec sheet says 36–40 — we just skipped a destructive test.

The shape of the posterior matters more than any single number. Walk the class through the ribbon: tight at sampled points (the posterior collapses to noise-floor width \(\approx 2\sigma_n\)), wide between points (the prior takes over), reverting toward \(\sigma_f\) at the edges (no data → prior). That ribbon shape is the whole reason we used a GP.

Leave-one-out CV is the GP’s free diagnostic. Because the GP posterior is closed-form, leave-one-out predictions are cheap (Sundararajan-Keerthi trick: no refit per fold). Always compute LOO RMSE and LOO reliability before quoting any CI as trustworthy.

Why the held-out spec-sheet check matters. \(\hat{\text{HRC}} = 38.2 \pm 1.6\) at \(T=500\)°C with spec 36-40 means we just demonstrated the GP is inside spec at that condition — without running a destructive test. The dollar value of that skipped test is the whole ROI story (two slides later).

Anti-pattern. Reporting \(\pm 1\sigma\) instead of \(\pm 2\sigma\). Always quote 95 % CI (\(\pm 2\sigma\) under Gaussian); \(\pm 1\sigma\) is the standard error, half as wide, and reads like overconfidence to a lab manager.

War story. I once had a student report 90 % CI as “the model says HRC is between 38 and 40”. Two specimens fell at 36 and 41 — exactly the 10 % expected to land outside a 90 % CI. The model was right; the student’s framing made it look wrong. Always communicate CI with the coverage explicitly: “about 1 in 20 specimens will fall outside this band, by construction”.

Cross-reference. §6 reliability diagrams; conformal recalibration (§6 again).

Pacing. ~2 minutes.

When to trust extrapolation — and when not to

The GP variance grows back toward \(\sigma_f^2\) as we move away from data.
For 21CrMoV5-7: ribbon balloons outside \([200, 700]\) °C — the model is honestly saying “I have not seen tempers below 200 or above 700, my prediction here is essentially the prior.”

This is the right answer. Do not clip the variance, do not force-fit a parametric extrapolation. If you need predictions at 750 °C, run the experiment.

Note

The CI growth is only honest if the kernel is correct. A too-long \(\ell\) will make the GP overconfident outside the data. Always cross-check with a held-out CV reliability diagram before you trust extrapolation.

This is the GP’s superpower and the slide students remember. Most surrogate models lie quietly outside their training range: a polynomial fit will extrapolate confidently and wrongly; an MLP will extrapolate confidently and wrongly. The GP says, structurally, “I have not seen this — my prediction is the prior”. That is a feature, not a bug.

The honesty caveat is critical. If you let \(\ell\) optimise too freely and it lands at a too-long length scale (because your data covers a flat region), the GP will think the response is slowly varying and will extrapolate confidently. The fix is a prior on \(\ell\) (informative-prior MAP rather than free MLE), or a Matérn 5/2 with its sharper roll-off. Without those, “honest extrapolation” is a slogan, not a guarantee.

Materials hook. A tempering response below 200 °C is essentially as-quenched HRC — physically you can extrapolate downward because the carbide kinetics are frozen. So the GP is being overcautious for this physics. The lesson: if you have physics intuition, encode it (informative prior, or a parametric mean function on top of which the GP corrects deviations).

Anti-pattern. “Forcing” a parametric extrapolation by fitting a polynomial to the GP’s mean curve and extending it. This loses the honest CI growth — you have replaced a Bayesian model with a deterministic one outside the training range. If you need predictions out there, run the experiment or use a physics model.

Pre-empt. “Could I use a periodic kernel?” — yes, where physics is periodic (cyclic loading, rotating-disc spectra). Not here; tempering is monotonic in \(T\).

Cross-reference. §6 OOD detection — the same “out of training distribution” question, formalised. ML-PC u12 (PINNs) uses physics to extrapolate where data fails.

Pacing. ~1.5 minutes.

What this enables — direct ROI

Pre-GP workflow: 6 destructive tensile + hardness tests per heat treatment recipe, 4 recipes, 24 tests at €200/test → €4 800.
Post-GP workflow: 3 anchor tests + GP interpolation. CI verified, sufficient for spec compliance on standard heats. 12 tests, €2 400.

Saved per heat-treatment campaign: ~€2 400. Across a year of campaigns: into the tens of thousands.
Cost of the GP: half a day of analyst time, no infrastructure.

Caveat: the GP does not replace verification at the spec extremes. It replaces redundant tests in the smooth interior of the process window.

The ROI slide is for the lab manager, not the student. Students will compute their own ROI for their MSc projects after this slide. Make the numbers concrete: €200/test is a realistic standalone hardness + tensile test in a contract lab; €300 is a printed cube + cross-section + measurement; €1k+/h is analytical S/TEM with operator and amortised tool cost. These numbers are the currency of materials ML.

The half-day analyst cost is real. A working GP fit with kernel selection, hyperparameter optimisation, LOO reliability, and a written-up CI takes a competent analyst roughly half a day — including writing the model card. That is the true cost-to-deploy. Do not let students quote “5 lines of scikit-learn” as the cost; the discipline around the 5 lines is the work.

The spec-extreme caveat matters. A GP cannot replace verification at the spec boundary — that is where the cost of being wrong is highest and where the GP’s CI is most likely to be wrong (boundary regimes, fewer replicates, possible regime changes). Use the GP to skip tests in the interior of the safe region, not at the cliffs.

Scaling up the ROI. One heat-treatment campaign saves €2 400. A medium-sized lab runs 20-50 such campaigns per year. The headline number is €50-100k of test-cost savings annually, against a half-day-per-campaign analyst cost. That is the slide students photograph and send to their advisors.

Anti-pattern. Trying to extend the ROI argument to all tests. The GP saves the redundant interior tests; the spec-edge tests are required for certification. Reframing them as “interpolation” is how you ship a scrapped batch.

Pacing. ~1 minute. Move fast — this is the punchline, not the lecture.

Modern small-tabular alternative: TabPFN

TabPFN (Hollmann et al. 2025) is a transformer pre-trained on millions of synthetic tabular tasks to do in-context prediction — no per-task fitting.
Pass your \(\sim 30\) rows + new query → it returns a calibrated posterior predictive in one forward pass.
2025 version (v2) handles up to \(\sim 10\,000\) rows and is competitive with tuned XGBoost on small-tabular benchmarks.
Code: github.com/PriorLabs/TabPFN — pip install tabpfn, scikit-learn-compatible API.

When the GP still wins:

1-D smooth process variable + interpretable hyperparameters (length scale = physical correlation length).
You need the posterior closed-form for downstream optimisation (BO acquisition functions, gradient w.r.t. inputs).
You need to encode prior physics (Matérn smoothness, periodic kernels for cycling processes).

Note

On the 21CrMoV5-7 task TabPFN matches the GP’s leave-one-out RMSE within \(\sim 0.3\) HRC; the GP wins on interpretability of \(\ell\) and \(\sigma_f\). Use whichever your stakeholder will sign off on.

Why this slide exists in 2026. Until \(\sim 2023\), “small tabular + uncertainty” was synonymous with “Gaussian process.” TabPFN inverts that default for problems where the input space is more than 1-D and where kernel engineering would otherwise dominate the analyst’s day. The transformer is doing implicit kernel learning — billions of synthetic priors absorbed into the weights at pre-training time.

One sentence definition. TabPFN is a frozen Bayesian non-parametric regressor / classifier whose prior was learned (not hand-specified) by pre-training on synthetic data. You query it like a function; it returns calibrated probabilities.

Materials-relevant strengths. (1) In-context — no hyperparameter optimisation per dataset. (2) Calibration out of the box; the posterior is already a probability, not a softmax. (3) Handles mixed continuous/categorical features without one-hot tricks. (4) Inference is \(\mathcal{O}(n^2)\) in the in-context set but the constant is small; \(n = 1000\) runs in seconds on a single 1080Ti.

Materials-relevant limitations. (1) No kernel ⇒ no physical interpretation of length scale or smoothness — black-box prior. (2) Cannot extrapolate honestly outside the pre-training prior — the model has its own implicit OOD failure mode, which is harder to diagnose than a GP’s variance growth. (3) Heavy dependency to install (\(\sim 1\) GB checkpoint).

Connection to MFML W7 (the “Watch for: TabPFN” fragment). That slide introduces TabPFN as the first credible deep-learning approach to small tabular problems. This slide is the materials-specific deployment view.

Anti-pattern. Using TabPFN on \(n > 10\,000\) rows. Above that, gradient-boosted trees beat it and run faster. Stay in TabPFN’s sweet spot.

Forward link. §6 will conformalise predictions; TabPFN’s calibrated posterior already gives close-to-honest intervals on in-distribution data, but conformal still adds the finite-sample guarantee on top.

04. Case study B — MC Dropout for SEM defect segmentation

Setup: U-Net on SEM micrographs

Task: per-pixel segmentation of SEM micrographs into {matrix, porosity, crack, inclusion}.
Architecture: U-Net with dropout layers (\(p \approx 0.2\)) in the bottleneck and decoder.
Training data: \(\mathcal{O}(10^3)\) labeled tiles, hand-segmented by an expert. Evaluation set drawn from a different SEM session to expose tool drift (Modarres et al. 2017).

The point estimate (argmax over softmax) is fine on the training distribution and bad on the deployment distribution. We need a per-pixel uncertainty map to flag the bad regions.

Why segmentation, why MC Dropout. Segmentation is the one task where you want UQ per pixel, not per image. A 1024×1024 SEM tile is ~1M decisions; even a 99 % per-pixel accuracy means ~10 000 wrong pixels per tile — enough to lose the small pores that matter most. MC Dropout gives you a spatial uncertainty map for the same training cost as a deterministic U-Net. That is a deal you cannot easily refuse.

Why this matches Modarres et al. The Modarres 2017 SEM benchmark provides labelled defect tiles and was deliberately curated across multiple tools to expose the tool-drift problem. Cite it as the testbed; on your own data, you should always evaluate on a tool you did not train on.

Dropout placement matters. Dropout in the bottleneck + decoder is the standard recommendation: too early and the early features collapse; too late and the variance is tiny because you have already committed to a class. \(p \approx 0.2\) is the typical sweet spot; below 0.1 the variance collapses (the well-known MC-Dropout pathology in deep modern networks); above 0.4 training stalls.

Anti-pattern. Adding dropout only at training time and forgetting to keep it on at inference. PyTorch’s model.eval() switches dropout off — you have to call enable_dropout() or use model.train() to keep it stochastic. About half of student MC-Dropout implementations get this wrong silently.

Cross-reference. ML-PC u04 (CNNs / U-Net) for the architecture; ML-PC u06 (DINOv2 transfer) for the alternative — pretrained features + light head. MFML W12 for the variational-Bayes interpretation.

Pacing. ~1.5 minutes.

MC Dropout in practice

Training: standard, dropout active.
Inference: keep dropout on, run \(T = 30\) stochastic forward passes per image.
Per pixel \(i\): collect \(\{p_i^{(t)}\}_{t=1}^{T}\) — softmax distributions over classes.
Predictive mean: \(\bar{p}_i = \tfrac{1}{T}\sum_t p_i^{(t)}\) → argmax for the class label.
Per-pixel predictive entropy: \(H_i = -\sum_c \bar{p}_{i,c} \log \bar{p}_{i,c}\) — the uncertainty map.

\(T{=}30\) is the typical knee — below 10 the variance estimate is too noisy, above 50 you are paying for diminishing returns. Validate \(T\) on a held-out reliability diagram.

Two quantities, two stories. - Predictive mean \(\bar p_i\) → the prediction. Use argmax of \(\bar p\), not of any single pass — averaging removes the dropout noise. This is consistently 0.5-2 IoU points better than a single deterministic forward. - Predictive entropy \(H_i\) → the uncertainty. High \(H\) either means several classes are competing (epistemic) or the predicted class is near-uniform (aleatoric on a hard boundary). For UQ purposes we treat the entropy as a single uncertainty score; separating aleatoric from epistemic requires mutual information (see Depeweg et al. 2018) and is rarely worth the complexity.

The \(T=30\) heuristic. Why 30? Variance of a Monte Carlo mean scales as \(1/T\); the variance of the variance estimator scales as \(1/T\) too, so doubling \(T\) from 30 to 60 only tightens the variance estimate by \(\sqrt 2\). Below \(T=10\) you cannot tell calibrated from broken; above \(T=50\) you pay linearly with no payoff. Tell students to measure the knee on their own task with a reliability diagram, not to trust 30 blindly.

Compute budget. At 256×256 tiles, \(T=30\) on a 1080Ti is ~3 seconds per tile — fine for offline analysis (tens of tiles/day on a benchtop SEM), too slow for inline detection at 1 Hz. If real-time, drop to ensembles (parallelisable) or temperature-scaled deterministic + Mahalanobis OOD.

Anti-pattern. Using the variance of softmax outputs instead of the entropy of the mean softmax. They are different quantities and have different calibration properties; the predictive-entropy of \(\bar p\) is what the variational interpretation calls “total uncertainty”.

Cross-reference. Gal & Ghahramani (2016) is the seminal paper; the materials extension is straightforward. MFML W12 for the variational-approximation derivation.

Pacing. ~3 minutes — the formulas slide.

Per-pixel uncertainty maps — what they look like

Low entropy in the bulk matrix and inside large, well-formed pores: easy classification.
High entropy where you would expect:
- Grain-boundary triple junctions — class boundary on the image.
- Edge artifacts of the field of view.
- Charging zones — bright halos around insulators.
- Tile edges where the U-Net’s receptive field is incomplete.

These are diagnostic: the model is honestly uncertain exactly where a human operator would also hesitate.

The slide where MC Dropout earns its keep. A well-calibrated uncertainty map is self-interpretable: the high-entropy regions overlap exactly with where you, the human expert, would zoom in and look harder. If your map does not have that property, you have a calibration bug (most likely dropout too low, or only one dropout layer).

Demo opportunity. If you have a teaching SEM micrograph with a known charging halo (insulator next to conductor), show the entropy map next to the input image. Charging zones light up. Students gasp. This is the slide that converts the skeptics.

The four failure modes named on the slide are the real-world ones. - Grain-boundary triple junctions: legitimate class ambiguity; the model is right to hesitate. - FoV edges: receptive-field artefact; cropping fix or boundary-aware loss. - Charging zones: physical SEM artefact; the model is correctly refusing — pass it to a human or to a charge-compensation preprocessor. - Tile edges: stitching artefact; overlap-and-blend at inference.

Anti-pattern. Smoothing the uncertainty map (Gaussian blur) before thresholding. Spatial resolution of the entropy map is its point; smoothing destroys the diagnostic value. If the map is too noisy, increase \(T\) or fix the dropout placement.

Cross-reference. ML-PC u09 (segmentation in spectroscopy) uses the same diagnostic pattern. MFML W12 for the formal “the model is honest about what it does not know” claim.

Pacing. ~2 minutes.

Reject-for-human-review threshold

Pick a per-pixel entropy threshold \(\tau\). Pixels with \(H_i > \tau\) are flagged for operator review; the rest are auto-classified.

Sweep \(\tau\) to draw the operating curve: (human-review rate) on the x-axis vs (defect recall) on the y-axis.
Typical knee on a benchtop SEM benchmark: at 5% review rate we recover >98% of defect pixels; at 1% review we drop to ~92%.
The threshold is a business decision, not an ML decision — it depends on the cost of a missed defect vs the cost of operator time.

This is a real engineering decision. Make sure students understand: the threshold is not “set by the ML method”; it is set by operational cost. Two labs running the same model will pick different thresholds based on what their analyst hour costs and what a missed defect costs.

The operating-curve mindset. Treat this like an ROC curve: the model gives you a sliding tradeoff, and the deployment context picks the operating point. “98 % recall at 5 % review” is one point on the curve; the curve itself is the deliverable. Always plot it.

Numerical interpretation. At 5 % review rate, on a 1024×1024 tile that is ~50k flagged pixels — clusterable into a few hundred candidate ROIs that an analyst can scan in under a minute. At 1 % review, ~10k pixels and ~50 ROIs — too few to catch the long-tail defects. The 5 % knee is where most lab benchmarks land.

Anti-pattern. Reporting only a single threshold (“we used \(\tau = 0.5\)”). The threshold has no meaning without the operating curve. If the paper does not show the curve, the operating point is unjustified.

Cross-reference. This pattern (model + human review) is exactly the human-in-the-loop active learning setting we get back to in §5 — uncertain pixels become the next labelled tiles.

Pre-empt. “Should I use Bayes risk minimisation instead?” — yes, when you can quantify the costs explicitly. In most lab settings, costs are not known (intangibles like reputation, certification risk), so a curve + business-side choice is more honest than a single optimisation.

Pacing. ~2 minutes.

Tool-shift calibration

The reliability diagram on SEM #1 (training tool) is well-calibrated.
The reliability diagram on SEM #2 (different detector, different bias) is not — the model is overconfident on a slightly different contrast distribution.

Practical fix: per-tool temperature scaling. Collect a small (~50 image) calibration set on the new tool, fit a single scalar \(T\) to recalibrate. Cheap, post-hoc, no retraining.
Re-run calibration after any: detector swap, gun replacement, sample-coating change, large chamber vent.

Note

“My segmentation accuracy dropped after the chamber vent” is a calibration failure as often as a model failure. Diagnose with a reliability diagram before retraining.

The single most underrated slide in the deck. Most students who deploy a model and watch its accuracy drop will immediately assume retraining is the fix. Wrong instinct: recalibration (\(\sim 50\) images, 5 minutes of work) often recovers most of the loss. Retraining is the last resort.

Why a single scalar \(T\) works. Temperature scaling (Guo et al. 2017) divides logits by \(T\) before softmax: \(\hat p = \text{softmax}(z/T)\). It is a one-parameter rescaling that does not change argmax — the predictions stay the same, only the confidences change. So retraining the entire network is not required when only the confidence calibration has drifted. This is the slide that gets a “huh” reaction from the audience.

When recalibration is not enough. If the argmax predictions themselves are wrong on the new tool (true distribution shift, not just calibration drift), temperature scaling cannot rescue you. The diagnostic: predicted-class accuracy at low temperature (i.e., where the model is forced to commit). If that drops, you need fine-tuning or a new model.

The trigger list is real. Detector swaps, gun replacements, coating changes, large chamber vents — each one shifts the contrast distribution enough to break a softmax calibration. Re-run the reliability diagram on day 1 of any of these events.

War story. A collaborator’s pore-detection model dropped from 96 % to 78 % accuracy after a routine chamber clean. They retrained for two weeks and recovered to 91 %. We later showed that temperature scaling on 30 calibration images would have recovered 95 % in an hour. The training time was wasted; the diagnosis was the bug.

Cross-reference. §6 reliability diagrams (the diagnostic); ML-PC u02 (physics-of-data; the chamber vent is a physics-of-the-instrument issue).

Pacing. ~3 minutes — this is the slide where most students update their mental model of “deployment”.

05. Case study C — Active learning loop for AM process windows

Setup: laser powder-bed fusion process map

Process: laser powder-bed fusion (L-PBF) on a benchtop printer, single material.
Two-axis design space: laser power \(P \in [80, 350]\) W and scan speed \(v \in [200, 1500]\) mm/s.
Goal: identify the process window — the region in \((P, v)\) where relative density \(\rho_{\text{rel}} > 0.995\) and no melt-pool collapse is observed.
Each experiment: print a \(\sim 5\) mm cube, cross-section, measure \(\rho_{\text{rel}}\) — about €300 in materials, machine time, and metallography.

Why L-PBF is the canonical AM-AL case study. Two continuous inputs (power, speed), one measurable scalar output (relative density), explicit safety constraints (keyhole regime damages the optics), and a real per-experiment cost (€300/cube including operator + metallography). It has every feature you need to teach AL, with no time-series or multi-physics distractions.

The physical landscape. The \((P, v)\) plane has a “process window” bounded by: (a) low energy density → lack-of-fusion porosity; (b) high energy density → keyhole porosity + melt-pool collapse; (c) extreme \(P/v\) → instability and spatter. The shape of the window depends on the alloy and the layer thickness; it is not a rectangle. AL is the right tool because the window’s boundary is what we want to find, and the boundary is geometrically nontrivial.

Why €300/experiment is realistic. Materials (~€20 of powder), machine time (~€100, 30 min on a benchtop), cross-section + mount + polish + image analysis (~€150, half an analyst-day). Industrial systems are higher.

Anti-pattern. Starting AL on a process you do not understand at all. AL is exploitation-biased — if your initial surrogate is badly mis-specified, AL will spend the budget exploring the wrong region. Always seed with 5-8 grid-style points to cover the design space before switching to AL.

Cross-reference. Hernández-Lobato_2014_PES in ref.bib for the foundational Bayesian-optimisation-for-materials paper; ML-PC u14 (reflection) for the broader self-driving-lab framing.

Pacing. ~2 minutes. The setup is concrete; spend the time on the loop slide next.

The active-learning loop

GP surrogate: \(\rho_{\text{rel}}(P, v)\) with a 2-D RBF or Matérn kernel, separate length scales \(\ell_P, \ell_v\).
Acquisition function:
- Upper Confidence Bound: \(\alpha_{\text{UCB}}(P,v) = \mu(P,v) + \beta\,\sigma(P,v)\).
- Expected Improvement: \(\alpha_{\text{EI}}\) favours points likely to beat the current best.
Choose the next experiment to maximize \(\alpha\) — trade off mean (exploitation) vs variance (exploration) (Hernández-Lobato et al. 2014).

The closed-loop diagram is the conceptual heart of self-driving labs. Every Materials Acceleration Platform (MAP) — Berlinguette’s “Ada”, Aspuru-Guzik’s “ChemOS”, Mooney’s photonics loop — is exactly this picture with a different surrogate and a different instrument. Once students see this loop, they see every modern AM-+-ML paper through it.

Why UCB and EI, and when each wins. - UCB (\(\mu + \beta\sigma\)): explicit knob \(\beta\) to control exploration. Easy to tune, easy to explain to non-ML collaborators, no normality assumption. Default for new loops. - EI: integrates over the probability of improvement, weighted by how much improvement. More expensive to evaluate (no closed form for non-Gaussian posteriors), but better when the goal is finding the optimum, not mapping the window. - For “find the process window” (this case study), I usually start with UCB because the boundary is the deliverable, not the maximum density.

Length scales matter. Separate \(\ell_P\) and \(\ell_v\) are essential — power and speed have different physical effects on the melt pool. ARD (automatic relevance determination) is built into most GP libraries; turn it on.

Anti-pattern. Running AL on a deterministic simulator-derived “data” set with no real measurement noise. AL then converges in 5 iterations to the simulator’s optimum, which is not the real-world optimum. AL only earns its keep when the experiments are real.

Pre-empt. “Why not Bayesian Neural Networks instead of a GP surrogate?” — at \(n < 100\), the GP wins on calibration and on closed-form acquisition gradients. Switch to BNNs / ensembles only when \(n\) outgrows the GP’s \(O(N^3)\) ceiling.

Cross-reference. Hernandez-Lobato_2014_PES (foundational AL/BO paper); MFML W12 for the closed-form acquisition gradients.

Pacing. ~3 minutes. This is the slide students photograph.

Closed loop with the printer + safety constraints

Hard constraints the acquisition cannot violate:
- \(P/v\) ratio — prevent keyhole regime that damages optics.
- Absolute caps — machine specs.
- “No-go” zones from prior failures — flagged manually.

Constrained Bayesian optimization: \(\max_{(P,v)} \alpha(P,v)\) subject to \(g_k(P,v) \leq 0\).
Implementation: rejection sampling on the candidate set, or a second GP modeling the constraint probability.
Budget cap: stop the loop after \(N_{\max} = 30\) experiments or when the process-window area stabilizes — whichever comes first.

Safety constraints turn BO into constrained BO. In a research demo you do unconstrained BO and call it done. On a real printer you have hard constraints: keyhole damages the laser optics; absolute power/speed limits are machine spec; the no-go list grows after every failed print. Constrained BO is the only acceptable version in a lab. Make this explicit.

Two implementations, two trade-offs. - Rejection sampling on the candidate set is trivial and cheap when the constraint is a known closed-form (e.g., \(P/v < 1\) J/mm). Use this when you have a physics-derived constraint. - Constraint-GP models the probability the constraint holds, learned from past experiments. Use this when constraints are discovered empirically (e.g., “this specific (P,v) caused melt-pool collapse last Wednesday”). Combine with EI to get constrained EI.

The budget cap is operationally important. AL without a stopping rule runs until you cancel it; that is the wrong default for a shared printer. Two stopping rules in tandem: (a) hit \(N_{\max}\) (budget); (b) the area of high-density region stops growing for 3-5 consecutive iterations (convergence). When either triggers, stop. Tell students this before they run their first loop and waste a weekend’s powder budget.

War story. I once had an AL loop run unattended over a weekend. The keyhole constraint was missing from the candidate-set filter. Monday morning: damaged optics on the printer, ~€8 000 to repair, and one analyst in a very bad mood. The constraint was the first thing we added on Tuesday. Always test the constraint logic before running unattended.

Anti-pattern. “Soft” constraints (penalties added to the acquisition function). For safety constraints you want a hard constraint — the acquisition cannot propose a keyhole condition, ever. Penalties allow occasional violations and that is unacceptable.

Cross-reference. Constrained-BO references in ref.bib; ML-PC u14 (reflection) for the broader self-driving-lab framing.

Pacing. ~2.5 minutes.

Result: pareto frontier of effort vs window discovered

Plot: x-axis = number of experiments run, y-axis = area (in \((P, v)\) units) classified with \(P[\rho_{\text{rel}} > 0.995] > 0.9\).
Active-learning curve climbs steeply early — UCB / EI route experiments straight to the boundary of the process window.
Grid-search baseline: \(14 \times 14 = 196\) experiments to span the design at the same resolution.

Typical result on this kind of L-PBF problem: ~30 active-learning experiments find the process-window area that grid-search needs ~200 for — a roughly 6–7× reduction in experimental cost (Hernández-Lobato et al. 2014).
At €300/experiment: €51 000 saved per material.

The headline numbers. ~30 AL experiments find what grid-search needs ~200 for — a ~6-7× reduction in experimental cost. At €300/experiment that is €51k saved per alloy/system characterised. Across a lab’s portfolio of 5-10 alloys/year, the savings are €250-500k and the time-to-window-map drops from months to weeks. That is the business case for AL.

Why the AL curve climbs steeply early. UCB and EI route experiments straight to the boundary of the process window — exactly where the GP has the highest variance and the boundary is most informative. The interior fills in last, almost as a by-product. This is the “physicist’s mental model” of AL: it learns the shape before the interior, which is the right order for window-discovery.

The grid-search baseline is the comparison that matters. Grid-search is what every materials lab does by default (“we’ll do a 14×14 sweep”). The slide is not “AL beats random sampling” (a weak claim); it is “AL beats the actual current practice by ~7×”. That framing is what gets the slide photographed.

Anti-pattern. Reporting the AL curve without an error band. The pareto curve has Monte-Carlo variability (different random initialisations give different curves). Always plot the mean curve over 5-10 seeds with a shaded band. Single-seed AL curves are not publishable evidence.

Pre-empt. “Why not Latin Hypercube + GP fit + interpolation, no AL?” — that gives you a map, not the boundary. AL preferentially samples near the boundary because the GP variance is highest there. For window-discovery, AL wins by construction.

Cross-reference. ML-PC u05 (BO foundations) for the acquisition-function derivations; MFML W12 for the regret bounds.

Pacing. ~2 minutes. Numbers first, mechanism second.

Materials-acceleration-platform framing

This loop — surrogate model + acquisition + closed-loop instrument + safety constraints — is the prototype of a self-driving lab.
Aspuru-Guzik and Berlinguette groups have built versions of this loop for catalysis and thin-film electrochemistry. Same pattern, different instrument (Hernández-Lobato et al. 2014).

The only ingredient that makes this work is honest UQ. With overconfident or miscalibrated \(\sigma\), the acquisition function picks the wrong next experiment and you waste your budget. UQ is not a slide at the end — it is the engine.

The framing slide. This is where students realise the entire unit so far has been about MAPs. The §3 GP teaches the surrogate; §4 MC-Dropout teaches per-pixel UQ for instrument-screen; §5 AL closes the loop with an instrument. MAPs are the deployment context.

The Aspuru-Guzik / Berlinguette / NIMS lineage. Multiple research groups have built versions of this loop for catalysis (Ada, Berlinguette), drug discovery (ChemOS, Aspuru-Guzik), thin films (NIMS Japan, USTC China), polymers (BASF), and increasingly metals (CMU, NIST). The loop is not a research toy any more — Toyota, BASF, and several US national labs run versions of it in production. Tell students this is the standard of 2026, not the frontier of 2018.

Why honest UQ is the engine, restated. If \(\sigma\) is overconfident → AL converges prematurely and misses regions. If \(\sigma\) is underconfident → AL never commits and wastes budget. If \(\sigma\) is miscalibrated heterogeneously → AL is biased toward the falsely-confident region. None of these are detectable from the loop’s outputs; all of them are detectable from a held-out reliability diagram before you start the loop. Calibrate first, then run.

Anti-pattern. Treating AL as “set and forget”. Real MAPs have a human in the loop — a researcher who looks at the next-proposed experiment, sanity-checks it, and approves or vetoes. The day you remove that human is the day you damage the instrument (see the keyhole war story).

Cross-reference. ML-PC u14 (reflection on the course) returns to MAPs as the future of materials research. This is the slide that bridges to it.

Forward link. §6 is the next slide — calibration as the engine of every MAP.

Pacing. ~2 minutes. Big-picture, not technical.

06. Calibration & deployment hygiene

Reliability diagrams on lab data

Bin predictions into 10 confidence bins. For each bin, compare predicted confidence vs observed accuracy (or coverage of the CI).
Perfect calibration: diagonal.
Above the diagonal: under-confident. Below: over-confident (the dangerous failure mode).

Where typical materials models break:
- Out-of-distribution alloy family — model is highly confident, accuracy collapses.
- Different microscope / coating / operator — softmax stays at 0.95+, accuracy drops 20 points.
- Long-tail rare defect classes — confidence is high on the wrong class.

Always run a reliability diagram on a held-out, lab-realistic calibration set — not a random split of the training data.

The diagnostic plot of the whole unit. A reliability diagram is not optional — it is the single piece of evidence that converts a model from “fits well on test set” to “trustworthy for deployment”. If a paper, thesis, or model card lacks one, the UQ claims are unsubstantiated.

How to read the diagram in 30 seconds. Bin predicted confidences into 10 bins (0-10 %, 10-20 %, …). For each bin, plot empirical accuracy (y) vs mean predicted confidence (x). Diagonal = perfectly calibrated. Above diagonal = the model is less confident than it should be (under-confident — annoying but safe). Below diagonal = over-confident — the dangerous mode, the one that ships scrap.

The three named failure modes are the materials ones. - Alloy family shift: composition outside training distribution; softmax stays high, predictions wrong. - Microscope/coating/operator shift: pixel statistics differ; softmax stays high, predictions wrong. - Long-tail rare classes: model memorised “matrix” 95 % of the time, never saw the rare crack class enough; high confidence on the wrong class because the wrong class is the prior.

The lab-realistic calibration set is the key word. A random 80/20 split of the training data is not lab-realistic — it has the same distribution as training. The correct calibration set is prospectively collected on a different day, ideally a different instrument, ideally by a different operator. The point of the calibration set is to mimic the deployment shift. Skipping this is the single most common UQ mistake in published materials-ML.

Anti-pattern. Reporting Expected Calibration Error (ECE) without showing the diagram. ECE is a one-number summary of the diagram; the shape of the curve (is it consistent over-confidence? bin-specific? bimodal?) carries the diagnosis. Always show the picture.

Pre-empt. “Why not just temperature-scale once and call it done?” — temperature scaling fixes miscalibration (vertical scaling of confidence) but cannot fix distribution shift (the model is making wrong argmax predictions, not just wrong confidence). The reliability diagram tells you which one you have.

Cross-reference. Guo et al. 2017 (the original temp-scaling paper); MFML W12 for the formal definition of calibration; ML-PC u04 (CNN training) for cross-validation discipline.

Pacing. ~3 minutes.

OOD detection — when the model sees something new

Symptoms: confidence is high, prediction is wrong. Calibration alone cannot save you — you need to detect the OOD case and refuse to predict.

Practical detectors:

Mahalanobis distance in feature space (penultimate-layer activations vs the training-set Gaussian) — cheap, surprisingly effective on microscopy.
Ensemble disagreement — for a deep ensemble, high prediction variance across members \(\Rightarrow\) OOD. Free if you already have an ensemble.
GP variance — for GP surrogates, \(\sigma^2 \to \sigma_f^2\) flags inputs far from training. This is the same mechanism as the AL loop in §5.

Workflow: train your model, fit a Mahalanobis detector on training features, reject any inference with detector score above a threshold and route to a human.

Calibration ≠ OOD detection. This is the slide that splits two ideas students conflate. Calibration says “given an in-distribution input, my \(p\) matches my accuracy”. OOD detection says “this input does not look like training, refuse to predict”. Both are mandatory; neither subsumes the other.

Mahalanobis is the workhorse. Why it works: penultimate-layer activations of a CNN trained on natural-ish images are approximately Gaussian. Fit a multivariate Gaussian (one per class) on the training features; at test time, compute the minimum-class Mahalanobis distance. High distance ⇒ OOD ⇒ refuse. Cheap, training-free post-hoc, and the materials-ML literature consistently finds it competitive with much more elaborate detectors.

Ensemble disagreement. If you have an ensemble (§2), the variance across members is an OOD signal — high variance on a new image suggests members disagree because they have not seen it. Free if the ensemble exists; expensive if you have to train one for OOD detection alone.

GP variance is the textbook example. \(\sigma^2 \to \sigma_f^2\) outside training data — by construction. The §3 GP and the §5 AL loop both use exactly this property. The GP is the one model where calibration + OOD are the same mechanism.

Workflow detail not on slide. Setting the OOD threshold: take a held-out lab-realistic OOD set (e.g., images from a different microscope you definitely did not train on), compute Mahalanobis distance on training-distribution and OOD-distribution images, pick the threshold at the desired false-positive rate (e.g., reject 1 % of in-distribution to catch 95 % of OOD). Same operating-curve logic as §4’s reject-for-review.

Anti-pattern. Using softmax entropy as an OOD signal. Modern overconfident networks have low entropy even on OOD inputs — the softmax is not the right detector. The Mahalanobis distance lives in feature space, which is much better.

Cross-reference. Lee et al. 2018 (Mahalanobis OOD detector); Liu et al. 2020 (energy score). ML-PC u06 (transfer learning) is where OOD is most acute.

Pacing. ~3 minutes.

Recall: conformal coverage as the wrapper layer (MFML W7)

Split conformal (Angelopoulos and Bates 2023) and CQR (Romano et al. 2019) were derived in MFML Unit 7 — finite-sample, distribution-free coverage guarantee around any mean predictor.
We do not re-derive them here. Today’s job is deployment: which lab tool gets which wrapper, and what coverage to report.

Materials deployment defaults:

GP (§3) → split conformal. GP CI is already smooth; wrap to convert the model-conditional ribbon into a frequentist coverage guarantee on the held-out lab batch.
MC Dropout / U-Net (§4) → split conformal per class on the held-out tile set; gives a calibrated per-pixel set-valued prediction.
Heteroscedastic regression near a regime boundary (§5 keyhole) → CQR. Constant-width CP is wasteful where noise is locally regime-dependent.

Why we promote it from “footnote” to “default”:

Wraps any mean predictor — GP, ensemble, MC-Dropout, TabPFN.
Finite-sample guarantee on the \(n \sim 30\) specimens of §3.
No Gaussianity assumption on residuals.
The only assumption is exchangeability of calibration and test data — and that fails the moment you change tool, operator, or alloy family.

Warning

Materials-specific failure mode. Tool drift (§4 SEM #1 → #2) breaks exchangeability silently. Always re-run coverage on a per-tool calibration set, or accept that the guarantee is gone.

Why this is a pointer slide, not a teaching slide. Conformal prediction (split CP + CQR) is owned by MFML Unit 7 — the algorithm, the 5-line Python, the exchangeability proof, and the failure-modes slide are all there. Students taking ML-PC without MFML need to read that unit; we do not re-derive in ML-PC. This deck owns the deployment decisions: which UQ method + which wrapper + which calibration-set construction.

One sentence to say aloud. “You already know how split conformal and CQR work — MFML Unit 7. Today’s question is which lab tool deserves which wrapper.”

Three deployment defaults to remember. GP → split CP. Pixel-wise (MC-Dropout U-Net) → split CP per class. Heteroscedastic-regime (keyhole boundary, two-phase regions) → CQR. That is the table students should walk out with.

The exchangeability failure is the slide that earns its keep. In MFML Unit 7 we showed exchangeability is the only assumption. In ML-PC the failure modes of exchangeability are concrete: SEM #1 → SEM #2, alloy family A → B, operator A → B, post-vent state. Every tool change is an exchangeability break — recalibrate or lose the guarantee.

Anti-pattern. Reporting “we wrapped with split conformal” without a held-out coverage number on the deployment-tool calibration set. The wrap is free; the coverage check is the deliverable.

Cross-reference. MFML W7 §13b (derivation + 5-line recipe + CQR + exchangeability failure mode). MFML W12 has a “Recall from Unit 7” composition slide that mirrors this one’s role.

Pacing. ~2.5 minutes. Pointer + deployment table + failure-mode callout.

Reproducibility hygiene

Random seed logged for: data split, model init, dropout sampling, ensemble members, BO acquisition.
Model card alongside the model artifact: training data version, test metrics, calibration plot, OOD detector ROC, list of known failure modes.
Dataset version pin: hash of the labeled set; never silently update.

The one-paragraph “uncertainty section” of a model card:

“Uncertainty is reported as 95% CIs from \(T{=}30\) MC-dropout passes. The model is calibrated by temperature scaling on a 200-image held-out set; expected calibration error 0.03. The model is in-distribution iff Mahalanobis score \(<\tau_{\text{OOD}} = 14.2\); outside that, predictions are not returned.”

Note

If you cannot write that paragraph for your model, you cannot deploy it.

The “deployment checklist” slide. This is the closing-discipline slide of the unit. If a student remembers only one slide from §6, this should be it: the things you have to log to claim you have shipped a trustworthy model.

Random seeds matter for UQ specifically. Calibration and AL results change with seed. Logging the seeds (data split, model init, dropout, ensemble, BO acquisition) lets you reproduce. Not logging them is the cheapest mistake in materials-ML deployments — papers regularly fail to replicate because of unlogged seed.

Model cards. Hugging Face popularised the term; in materials we adapt it: training data version + hash, test metrics, calibration plot, OOD detector ROC, list of known failure modes (with named examples — “fails on uncoated insulating samples”, “fails on alloy family X”, “fails after chamber vent”). One markdown file shipped with every model artifact. Tell students to write this before deploying, not after.

Dataset version pin matters more than you think. Labelled materials data sets are quietly updated all the time — a labeller corrects a few mistakes, the dataset becomes “labels v1.1”. Without a hash pin, the model you trained is no longer reproducible six months later. Always pin the hash; treat the dataset like code.

The one-paragraph uncertainty section is the deliverable. Read it aloud, slowly. Every clause is necessary. “Uncertainty is reported as 95 % CIs from \(T=30\) MC dropout passes” (method + parameter); “calibrated by temperature scaling on a 200-image held-out set” (calibration evidence); “ECE 0.03” (number); “Mahalanobis OOD detector with threshold 14.2” (OOD policy). If your model card lacks any of these, the model is not deployable.

Anti-pattern. “We will write the model card after the paper is accepted.” No — the model card is the thing that earns the acceptance. Skip it and the model is undeployable, regardless of how well the paper reads.

Cross-reference. Mitchell et al. 2019 (the model cards paper); ML-PC u14 (reflection — model cards as the unit of accountability in materials ML).

Pacing. ~2 minutes. Read the paragraph, make students notice every clause.

What MFML W7 + W12 cover that we skipped

Pointer slide. If you want the math behind today’s tools:

MFML W7 (probabilistic view of learning):

Full split conformal derivation, 5-line algorithm, and the exchangeability proof.
CQR with pinball loss, conformalisation step, and the failure-mode slide (drift, weighted CP, Jackknife+).

MFML W12 (uncertainty in predictions):

Bayesian predictive distribution and the variance decomposition (aleatory + epistemic).
Marginal likelihood / evidence framework as automatic Occam’s razor.
Closed-form GP posterior (mean + variance), kernel hyperparameter learning by log marginal likelihood.
ELBO and variational interpretation of MC Dropout.
Calibration formalism (reliability, ECE) and recalibration methods (temperature scaling, Platt, isotonic).
TabPFN’s prior-data fitted network — what the pre-training prior actually is.

In ML-PC: we use these results on real lab data. Pick the right tool, validate it on a reliability diagram, and trust the answer enough to skip a destructive test or pick the next experiment.

The bridge slide. This is the slide that defines the relationship between the two courses for students taking both. ML-PC is the applied sibling of MFML; today’s unit consumes MFML W12 the way an engineering course consumes a thermodynamics course. Tell students this aloud.

For the MFML attendees. Their next-week lecture in MFML W12 is the math behind everything we used today. They should walk into MFML W12 with the practical motivation from this unit fresh in their heads — that improves retention dramatically. Active pre-loading.

For the ML-PC-only attendees. They do not need to re-derive the math to deploy these tools, but they do need to know which assumptions matter. The bullet list on this slide is also the “things to look up in Bishop / Murphy if a deployment surprises you” list.

Anti-pattern. Trying to learn the math during deployment. The right order is: read MFML W12, then deploy with the today’s deck open. Reverse order leads to copy-pasted scikit-learn snippets with no diagnostic intuition.

Cross-reference. MFML W12 (Bayesian inference deep dive). Bishop ch.3 and ch.6 are the standard textbook treatment.

Pacing. ~1 minute. This is a pointer, not a re-lecture.

07. Wrap

Recap: Unit 11

Pick the UQ tool per task: GP or TabPFN for small tabular data, MC dropout for trained CNNs, ensembles for high-stakes regression, MDN for multi-modal outputs.
Calibrate before deploying — a reliability diagram on a held-out lab-realistic set is non-negotiable.
Wrap with conformal (MFML W7) — split conformal or CQR adds finite-sample, distribution-free coverage on top of any mean predictor. Mandatory for safety-critical or regulator-facing models.
UQ enables active learning — and active learning is how labs scale beyond the brute-force grid search.
OOD detection complements calibration — confidence is meaningless when the input is outside the training distribution.
The math lives in MFML W7 (conformal) + W12 (Bayes, GP, calibration). ML-PC W12 is about using that math to save tests, find process windows, and ship trustworthy models.

The six-line takeaway. Each of the six items in the recap maps onto a specific section of the deck: 1. Tool-per-task → §2 (decision table) + §3-4 (case studies). 2. Calibrate first → §6 reliability diagrams; the single most important deployment-side message. 3. Wrap with conformal → §6 pointer slide; the derivation is owned by MFML W7. Deployment defaults (GP→split CP, MC-Dropout→split CP per class, regime-boundary→CQR) are the new content here. 4. UQ enables AL → §5 (AM process windows); UQ is the engine, not the ornament. 5. OOD detection → §6 Mahalanobis; calibration and OOD are separate responsibilities. 6. The math is in MFML W12 → bridge to the sibling course.

Closing line for the lecture. “A model without honest, calibrated uncertainty is a deliverable that fails on Tuesday and you do not know why. A model with it is a deliverable that fails on Tuesday and tells you exactly which experiment to run to fix it. That distinction is the whole subfield.”

Exam hint without saying “exam”. If the exam contains an open-ended question about deployment, the rubric is this slide. Tool choice + calibration + conformal + AL + OOD + math pointer = full marks. Tell them.

What the homework will ask. Take the §3 GP example, run it on your own tabular data set, produce a leave-one-out reliability diagram, add conformal prediction on top, write the model-card uncertainty paragraph. That homework is the practical test of whether the unit landed.

Forward link. Unit 12 (PINNs, the next deck) shows how to bring physics into the model itself — replacing parts of the data-driven prior with a hard physics prior. UQ + PINNs together is the modern frontier.

Pacing. ~3 minutes for the recap + closing line. Then take questions.

Continue

← Previous: Unit 10 — Transformers for materials (ViT, Flash Attention, Mamba)
→ Next: Unit 12 — Physics-informed and constrained ML
All courses

References & further reading

Mantzoukas et al. (2021) — 21CrMoV5-7 quench-and-temper data used in §3 (Mantzoukas et al. 2021).
Modarres et al. (2017) — SEM benchmark for the segmentation/classification setup of §4 (Modarres et al. 2017).
Gal & Ghahramani (2016) — MC Dropout as approximate Bayesian inference (Gal and Ghahramani 2016).
Lakshminarayanan et al. (2017) — Deep ensembles, the empirical gold standard for NN UQ (Lakshminarayanan et al. 2017).
Hernández-Lobato et al. (2014) — Predictive entropy search; foundational for BO/active learning in materials (Hernández-Lobato et al. 2014).
Guo et al. (2017) — Modern NNs are miscalibrated; temperature scaling (Guo et al. 2017).
Angelopoulos & Bates (2023) — Gentle introduction to split conformal prediction (Angelopoulos and Bates 2023).
Romano, Patterson & Candès (2019) — Conformalized Quantile Regression (Romano et al. 2019).
Hollmann et al. (2025) — TabPFN v2, foundation model for tabular data (Hollmann et al. 2025).
Rasmussen & Williams (2006) — GP reference text.
Aspuru-Guzik & Berlinguette — self-driving lab perspective; materials-acceleration platforms.
MFML W12 — full mathematical treatment of all methods used here.

Angelopoulos, Anastasios N., and Stephen Bates. 2023. “A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification.” Foundations and Trends in Machine Learning 16 (4): 494–591. https://doi.org/10.1561/2200000101.

Gal, Yarin, and Zoubin Ghahramani. 2016. “Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning.” International Conference on Machine Learning, 1050–59. https://proceedings.mlr.press/v48/gal16.pdf.

Guo, Chuan, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. 2017. “On Calibration of Modern Neural Networks.” International Conference on Machine Learning, 1321–30.

Hernández-Lobato, José Miguel, Matthew W. Hoffman, and Zoubin Ghahramani. 2014. “Predictive Entropy Search for Efficient Global Optimization of Black-Box Functions.” Advances in Neural Information Processing Systems 27.

Hollmann, Noah, Samuel Müller, Lennart Purucker, et al. 2025. “Accurate Predictions on Small Data with a Tabular Foundation Model.” Nature 637: 319–26. https://doi.org/10.1038/s41586-024-08328-6.

Lakshminarayanan, Balaji, Alexander Pritzel, and Charles Blundell. 2017. “Simple and Scalable Predictive Uncertainty Estimation Using Deep Ensembles.” Advances in Neural Information Processing Systems 30. https://papers.nips.cc/paper_files/paper/2017/hash/9ef2ed4b7fd2c810847ffa5fa85bce38-Abstract.html.

Mantzoukas, John, Dimitris G. Papageorgiou, Carmen Medrea, and Constantinos Stergiou. 2021. “Hardness Behavior of W. Nr. 1.7709 Steel, Oil Quenched and Tempered Between 475–575 c.” MATEC Web of Conferences 349: 02005. https://doi.org/10.1051/matecconf/202134902005.

Modarres, Mohammad Hadi, Rossella Aversa, Stefano Cozzini, Regina Ciancio, Angelo Leto, and Giuseppe Piero Brandino. 2017. “Neural Network for Nanoscience Scanning Electron Microscope Image Recognition.” Scientific Reports 7: 13282. https://doi.org/10.1038/s41598-017-13565-z.

Romano, Yaniv, Evan Patterson, and Emmanuel J. Candès. 2019. “Conformalized Quantile Regression.” Advances in Neural Information Processing Systems 32.

Machine Learning for Characterization and ProcessingUnit 11: Uncertainty-aware regression & Gaussian Processes

01. From MFML theory to lab practice

MFML W12 recap — we use these, we do not re-derive them

Why UQ matters in characterization & processing labs

Two pain points unique to materials labs

02. Picking a UQ method per lab task

Decision table — task → method → cost

Materials-specific computational budgets

What to not do

03. Case study A — GP for process\(\to\)property mapping

Setup: 21CrMoV5-7 quench-and-temper

Why GP fits this lab task

Kernel choice — what the hyperparameters mean physically

The fit — described

When to trust extrapolation — and when not to

What this enables — direct ROI

Modern small-tabular alternative: TabPFN

04. Case study B — MC Dropout for SEM defect segmentation

Setup: U-Net on SEM micrographs

MC Dropout in practice

Per-pixel uncertainty maps — what they look like

Reject-for-human-review threshold

Tool-shift calibration

05. Case study C — Active learning loop for AM process windows

Setup: laser powder-bed fusion process map

The active-learning loop

Closed loop with the printer + safety constraints

Result: pareto frontier of effort vs window discovered

Materials-acceleration-platform framing

06. Calibration & deployment hygiene

Reliability diagrams on lab data

OOD detection — when the model sees something new

Recall: conformal coverage as the wrapper layer (MFML W7)

Reproducibility hygiene

What MFML W7 + W12 cover that we skipped

07. Wrap

Recap: Unit 11

Continue

References & further reading

Machine Learning for Characterization and Processing
Unit 11: Uncertainty-aware regression & Gaussian Processes