Materials Genomics
Unit 13: Uncertainty-Aware Discovery and Gaussian Processes

Prof. Dr. Philipp Pelz

FAU Erlangen-Nürnberg

FAU Logo IMN Logo CENEM Logo ERC Logo Eclipse Logo

§0 · Frame

01. Today’s Question

What do you actually do with a materials database?

  • 150,000 entries in Materials Project. Roughly 1M in OQMD. Several million in AFLOW.
  • You cannot synthesise them all. You cannot even read them all.
  • You have a budget. You need a decision rule — which candidate next?

Today’s answer in one line.

  • Treat materials discovery as a sequential decision under uncertainty.
  • Use a probabilistic surrogate (Gaussian Process) plus an acquisition function to pick the next candidate.
  • Anchor everything in a concrete target: energy-above-hull.

02. Where We Are

Recap — what you already have

  • MG U6: local atomic environments and structure descriptors.
  • MG U10: learned representations and graph neural networks.
  • MFML W12: GP theory, uncertainty decomposition, marginal likelihood — full derivations.
  • ML-PC W7/W8: calibration, reliability diagrams, probabilistic forecasting.

Today — Unit 13 in one line

  • Reuse MFML W12 GP theory; reuse ML-PC W8 calibration; deploy both inside a materials-discovery loop with public databases as candidate sources.
  • New today: convex hull, E-hull as objective, acquisition over composition, closed-loop case studies.

03. Learning Outcomes

By the end of 90 minutes, you can:

  1. Use Materials Project / OQMD / AFLOW / NOMAD as candidate sources and explain when each is appropriate.
  2. Construct a convex hull from formation energies and read \(E_{\text{hull}}\) as a discoverability signal.
  3. Distinguish aleatoric from epistemic uncertainty in a screening setting and explain why point predictions are insufficient.
  1. Read a GP posterior (mean, variance) and choose a kernel for a materials descriptor.
  2. Apply EI, UCB, and Thompson sampling, including a hull-aware variant, on a candidate set.
  3. Evaluate alternatives (deep ensembles, MC-dropout, conformal) and pick the right tool for the data regime.

§A · Materials Databases and Discovery Targets

04. The Four Databases You Will Touch

Materials Project (MP) (jain2013materialsproject?)

  • ~150k DFT-computed inorganic crystals (PBE / PBE+U).
  • Properties: \(E_f\), \(E_{\text{hull}}\), band gap, elastic moduli, magnetic moment.
  • API: mp-api, pymatgen. De-facto starting point.

OQMD (saal2013materials?)

  • ~1M entries; Northwestern.
  • Heavier on intermetallics and prototype enumeration.
  • Useful as a cross-check against MP.

AFLOW (curtarolo2012aflow?)

  • ~3.5M entries; Duke.
  • Strong on prototype enumeration and high-throughput hulls.
  • Good for systematic alloy-composition sweeps.

NOMAD (draxl2018nomad?)

  • EU archive; aggregates raw DFT from many groups, many codes (VASP, QE, FHI-aims).
  • Heterogeneous — strong long-tail source, requires more provenance care.

Pedagogical message: no single database is canonical. They disagree because they use different functionals, different convergence criteria, different relaxation protocols. Cross-database disagreement is itself useful information.

05. What Is Stored (and What Is Not)

What every entry carries

  • Structure: lattice vectors, species, fractional sites.
  • Total energy and formation energy.
  • Energy-above-hull at the entry’s composition.
  • Often: band gap, magnetic moment, elastic moduli.
  • Always: computational provenance (functional, k-points, cutoffs).

What is not there

  • Synthesis route or precursor list.
  • Phase-diagram temperature dependence (entries are 0 K).
  • Defect chemistry beyond a few common point defects.
  • Kinetics, transport, catalytic activity.
  • Anything measured in a lab.

First reflex on every “predicted-stable” claim: predicted stable at 0 K, in vacuum, in an idealised periodic crystal, with one functional.

06. Formation Energy — Definition

Definition

\[E_f(C) = E(C) - \sum_i n_i \, \mu_i^{\text{ref}}\]

  • \(E(C)\): total DFT energy of compound \(C\).
  • \(n_i\): stoichiometric coefficient of element \(i\) in \(C\).
  • \(\mu_i^{\text{ref}}\): chemical potential of element \(i\) in its reference state (lowest-energy elemental phase).

Reads as

  • Energy released by forming \(C\) from its elements.
  • Negative \(E_f\): thermodynamically favoured against decomposition into elements.
  • Not yet sufficient to declare stability — need the convex hull.

Reference-state choice is not innocent. Allotropes (C, P, S) and magnetic ground states (Mn, Fe) shift \(E_f\) by tens of meV/atom across databases.

07. The Convex Hull

Construction

  • Plot \(E_f\) for every known phase in a chemical system vs composition.
  • The lower convex hull is the geometric envelope.
  • Phases on the hull: thermodynamically stable.
  • Phases above the hull: decompose into a linear combination of hull phases.

In Li–Co–O (ternary)

  • Hull is a triangulated 2D surface in formation-energy space over the (Li, Co, O) simplex.
  • Vertices: elements (Li, Co, O₂).
  • Stable phases (Li₂O, CoO, Co₃O₄, LiCoO₂, …): on the surface.
  • Metastable / unstable phases: above.

The convex-hull construction is composition-space generalisation of “is this lower than the line connecting its neighbours?” In \(n\)-component systems, the hull is an \((n-1)\)-dimensional polytope. The geometric core is unchanged.

08. Energy-Above-Hull as a Discoverability Signal

Definition

\[E_{\text{hull}}(C) = E_f(C) - E_{\text{hull-line}}(x)\]

  • Vertical distance from candidate \(C\) to the hull at its composition \(x\).
  • \(E_{\text{hull}} = 0\): on the hull, stable.
  • \(E_{\text{hull}} > 0\): above the hull, metastable or unstable.
  • Always \(\geq 0\) by construction.

The 25–50 meV/atom rule of thumb

  • \(E_{\text{hull}} < 25\) meV/atom: routinely synthesisable.
  • 25–50: often kinetically accessible, polymorph-dependent.
  • 50–100: sometimes synthesisable under metastable routes.
  • \(> 100\): rarely synthesisable.
  • Soft ranking signal, not a hard yes/no filter.

Why it is not zero. Kinetic stabilisation, finite-temperature entropy, and DFT error all contribute. A 25 meV/atom phase at 0 K may be the global free-energy minimum at 1500 K.

09. Data Quality and Provenance

Three failure modes

  • Functional drift. PBE underestimates band gaps; PBE+U mis-shifts magnetic oxides; SCAN improves some classes, ruins others.
  • Relaxation status. Some entries are fully relaxed; some are static single-points. 10 meV-scale noise.
  • Duplication. Multiple polymorphs of the same composition, sometimes with convergence-failure copies. De-duplicate before training.

Standard hygiene checklist

  • Filter by nelements, nsites, nelements_max.
  • Filter by e_above_hull < threshold for stable subset.
  • Filter by is_stable for hull entries only.
  • Inspect distinct prototypes per composition before training.
  • Document the database snapshot date — entries change over time.

10. The Discovery Loop

The loop the rest of the unit serves

┌─ database ─→ predict ─→ screen ─→ synthesise ─→ measure ─┐
│                                                          │
└────────────────────── refine ◄────────────────────────────┘
  • Database (§A): defines candidate pool and target (\(E_{\text{hull}}\)).
  • Predict (§C): probabilistic surrogate.
  • Screen (§D): acquisition function.
  • Synthesise / measure (§F): the lab.
  • Refine: re-fit surrogate; re-rank; iterate.

Why the loop motivates uncertainty

  • The loop is sequential: each iteration’s data informs the next.
  • The budget is finite: ~10–100 syntheses per campaign.
  • The right next candidate depends on what’s already known AND what’s unknown.
  • That is exactly what a posterior over predictions captures.

11. Closing §A: What We Have, What’s Next

What §A gave us

  • Four databases as candidate sources.
  • Formation energy and convex hull as the thermodynamic frame.
  • Energy-above-hull as the discoverability signal.
  • The discovery loop as the unit’s organising frame.

What §B asks

  • Inside the loop, why is a point predictor not enough?
  • What does “uncertainty” buy us in screening?
  • What is the right kind of uncertainty to optimise?

§B · Why Point Predictions Are Insufficient

12. Two Candidates, Same Mean, Different \(\sigma\)

Two candidates from the surrogate

  • Candidate A: predicted \(E_{\text{hull}} = 40 \pm 5\) meV/atom.
  • Candidate B: predicted \(E_{\text{hull}} = 40 \pm 80\) meV/atom.

Same mean. Not the same candidate.

Read the difference

  • A is a confident “mediocre but well-known” prediction. Limited upside, limited downside.
  • B is a guess. Could be a hit (\(E_{\text{hull}} = 0\)), could be a fiasco (\(E_{\text{hull}} = 200\)).
  • Which to synthesise depends on your budget and your risk appetite.

A point predictor returns only the mean. The information that decides the prioritisation is thrown away.

13. Ranking Under Uncertainty

The wrong objective

\[\text{minimise} \quad \mathbb{E}[(\hat{y} - y)^2]\]

  • A regression score, averaged over all candidates.
  • Optimises the mean prediction quality.
  • Says nothing about which candidates to synthesise.

The right objective

\[\text{maximise} \quad \mathbb{E}\bigl[\text{payoff}(\text{top-}k)\bigr]\]

  • Expected payoff of the \(k\) candidates selected for synthesis.
  • Depends on the joint distribution \((\mu_i, \sigma_i)\).
  • Reduces to a regression score only if all \(\sigma_i\) are equal.

14. Screening-Decision Economics

Costs

  • \(c_{\text{syn}}\): cost of one synthesis (EUR, hours, kWh).
  • \(c_{\text{miss}}\): opportunity cost of a missed hit.
  • Ratio \(c_{\text{miss}} / c_{\text{syn}}\) sets the threshold.

Decision

Synthesise iff \(\Pr(\text{success} \mid \mathbf{x}) \cdot c_{\text{miss}} > c_{\text{syn}}\).

Why uncertainty is required

  • \(\Pr(\text{success} \mid \mathbf{x})\) is a calibrated probability.
  • A point predictor cannot produce one — at best, it produces an indicator \(\hat{y} > \tau\).
  • Uncalibrated \(\sigma\) produces wrong probabilities — and therefore wrong threshold decisions.

15. Aleatoric vs Epistemic — Recap from MFML W12

Aleatoric (data noise)

  • Inherent in the data — measurement noise, intrinsic disorder.
  • Same input, different outputs across repeats.
  • Cannot be reduced by more data.
  • Can be reduced by better instruments / cleaner protocol.

Epistemic (model ignorance)

  • Reflects sparsely sampled regions of input space.
  • Model says “I have not seen anything like this.”
  • Can be reduced by more (well-chosen) data.
  • Reducing it is exactly what an acquisition function does.

Discovery acquisition targets epistemic uncertainty. It picks the next candidate where the model is uncertain AND the expected payoff is high.

16. Closing §B: Uncertainty Is Not Optional

Recap of §B

  • Two candidates with the same mean can be very different bets.
  • Ranking-under-uncertainty is a different objective from regression accuracy.
  • Screening economics require calibrated probabilities, not point predictions.
  • Epistemic uncertainty is what acquisition reduces.

§C question

  • Now: what is a probabilistic surrogate that gives us \(\mu\) and \(\sigma\)?
  • Today’s answer: Gaussian Process.
  • Tight recap from MFML W12 — five slides, no kernel algebra.

§C · Gaussian Processes for Materials Discovery

17. GP — Intuition

A distribution over functions

\[f(\mathbf{x}) \sim \mathcal{GP}(m(\mathbf{x}), k(\mathbf{x}, \mathbf{x}'))\]

  • \(m(\mathbf{x})\): prior mean (often zero after centring).
  • \(k(\mathbf{x}, \mathbf{x}')\): kernel — covariance between function values.
  • For any finite set \(\{\mathbf{x}_1, \dots, \mathbf{x}_n\}\), the values \(f(\mathbf{x}_i)\) are jointly Gaussian.

Two slogans

  • The kernel is the model. Kernel choice encodes “similar inputs produce similar outputs.”
  • The posterior variance is the uncertainty for free. No ensembling, no dropout, no separate variance head.

18. GP Posterior — Mean and Variance

At a test point \(\mathbf{x}_*\)

  • Posterior mean \(\mu(\mathbf{x}_*)\): best point prediction.
  • Posterior variance \(\sigma^2(\mathbf{x}_*)\): epistemic uncertainty.
  • Aleatoric noise enters through the likelihood (a separate \(\sigma^2_n\) added to the diagonal of the kernel matrix).

Shape of \(\sigma\) across the candidate set

  • Small near training data — high confidence in interpolation.
  • Growing outside the data envelope — honest about extrapolation.
  • Bounded by the prior variance far from data — does not blow up to infinity.

What the acquisition function consumes: the shape of \(\sigma\), not just its values. The contrast between low-\(\sigma\) regions (exploitable) and high-\(\sigma\) regions (explorable) drives the next-candidate decision.

19. Kernel Choice for Materials Descriptors

Workhorse kernels

  • RBF (squared-exp): smooth, infinitely differentiable. Good default for low-noise targets.
  • Matérn-5/2: smoother than 3/2, less smooth than RBF. The materials-ML default.
  • Tanimoto / fingerprint kernels: for discrete fingerprints (molecules, MOFs).
  • SOAP / smooth-overlap kernels: for structural descriptors directly.

Hyperparameters

  • Length scale \(\ell\): how far in descriptor space until correlation decays.
  • Signal variance \(\sigma_f^2\): amplitude of variation.
  • Noise variance \(\sigma_n^2\): aleatoric noise.
  • Optimised by maximising the marginal likelihood — robust on small data.

Length scale = materials-similarity assumption. Short \(\ell\): nearby compositions can have very different properties (rough surface). Long \(\ell\): smooth landscape, neighbours are informative.

20. The Small-Data Sweet Spot

Why GPs shine at small \(n\)

  • Well-behaved uncertainty without large ensembles.
  • No held-out validation set required for calibration.
  • Hyperparameter optimisation by marginal likelihood is robust on small data; cross-validation is not.
  • Handles 50–500 training points gracefully.

Why this matches materials reality

  • An active experimental campaign starts with 10–100 measurements.
  • Each new measurement costs days–weeks.
  • The surrogate must be re-fit after every batch.
  • Small-\(n\) + cheap re-fit + honest uncertainty = GP.

21. Reading a GP Posterior — Worked Example

Setting

  • Target: \(E_{\text{hull}}\) for ternary Li–Co–O candidates.
  • Descriptor: composition vector \((x_{\text{Li}}, x_{\text{Co}}, x_{\text{O}})\).
  • Training: 30 known phases from MP.
  • Test: 200 hypothetical compositions on a fine grid.

Posterior reads

  • Near training compositions: \(\sigma \approx 5\) meV/atom (interpolation).
  • At unexplored corners: \(\sigma \approx 60\) meV/atom (extrapolation).
  • Posterior mean \(\mu\) peaks where the hull dips; minima at known stable phases.
  • Both \(\mu\) and \(\sigma\) are needed to rank candidates.

22. Scaling Limits and Sparse GPs

Exact GP cost

  • Time: \(O(n^3)\) — kernel-matrix Cholesky.
  • Memory: \(O(n^2)\) — kernel matrix.
  • \(n \sim 5{,}000\): painful on a laptop.
  • \(n \sim 50{,}000\): infeasible without approximations.

Conceptual escape routes

  • Inducing points (FITC, VFE, SVGP): summarise data with \(m \ll n\) pseudo-points; cost \(O(nm^2)\).
  • Local GPs: one GP per region of input space.
  • Kernel approximations: random Fourier features and friends.

For typical campaigns of \(10^2\)\(10^3\) measurements, exact GPs are fine. Scaling matters when bolting GPs onto massive precomputed databases as a screening surrogate.

23. Closing §C: GPs as Tool, Not Gospel

§C summary

  • GP = distribution over functions, defined by a kernel.
  • Posterior gives \(\mu\) and \(\sigma\) everywhere — calibrated by construction in-distribution.
  • Kernel choice encodes the materials-similarity assumption.
  • \(O(n^3)\) scaling; \(n \lesssim 5000\) is the comfortable regime.

§D question

  • Now we have \(\mu\) and \(\sigma\) at every candidate. Which one do we synthesise next?
  • The acquisition function is the answer.
  • Three workhorses: EI, UCB, Thompson.

§D · Acquisition Functions and Bayesian Optimisation

24. Exploration vs Exploitation

Exploitation

  • Pick the candidate with the best predicted mean.
  • Greedy: trust what the model says.
  • Risk: get stuck in a local optimum.

Exploration

  • Pick the candidate with the highest uncertainty.
  • Maximally informative: reduce model ignorance.
  • Risk: waste budget on candidates with no chance of being good.

Acquisition functions trade off the two. A scalar score \(\alpha(\mathbf{x})\) over the candidate set; we maximise it to pick the next point.

25. Expected Improvement (EI)

Definition

\[\alpha_{\text{EI}}(\mathbf{x}) = \mathbb{E}\bigl[\max(f^* - f(\mathbf{x}), 0)\bigr]\]

  • \(f^*\): current best observed value.
  • Closed form for Gaussian posteriors: \(\alpha_{\text{EI}} = \sigma\bigl[z\,\Phi(z) + \phi(z)\bigr]\) with \(z = (f^* - \mu)/\sigma\).

Reads as

  • “How much better than today’s best do I expect to do?”
  • Zero when \(\mu \gg f^*\) (worse than best, deterministically).
  • Large when \(\mu < f^*\) (likely better) or \(\sigma\) large (could be much better).
  • The default choice in 90% of BO campaigns.

26. Upper Confidence Bound (UCB)

Definition

\[\alpha_{\text{UCB}}(\mathbf{x}) = \mu(\mathbf{x}) + \beta \, \sigma(\mathbf{x})\]

  • \(\beta\): aggressiveness.
  • \(\beta \to 0\): pure exploitation.
  • \(\beta \to \infty\): pure exploration.
  • Typical: \(\beta \in [1, 3]\), sometimes annealed over iterations.

Reads as

  • “Optimistic estimate of \(f\).”
  • Pick the candidate that could be best in a \(\beta\)-sigma sense.
  • Strong theoretical guarantees (srinivas2010gpucb?).
  • Choose UCB when EI gets stuck; use high \(\beta\) to force exploration.

27. Thompson Sampling

Procedure

  1. Sample one function \(\tilde{f}\) from the GP posterior.
  2. Optimise: \(\mathbf{x}^* = \arg\min_\mathbf{x} \tilde{f}(\mathbf{x})\).
  3. Query \(\mathbf{x}^*\).

That’s it.

Why it is useful

  • Naturally batches: draw \(b\) samples, get \(b\) candidates.
  • Diversity in the batch is automatic (different posterior samples disagree most where \(\sigma\) is large).
  • No closed form needed; works with approximate GPs.
  • Bayesian-optimal under certain assumptions.

28. Acquisition Over What?

Composition space

  • Candidates: discrete formulas \(A_x B_y C_z\), or continuous fractions on a simplex.
  • Smaller candidate sets, faster argmax.
  • Most published BO-for-materials work lives here.

Structure space

  • Candidates: polymorphs / prototypes at fixed composition.
  • Larger, more discrete; symmetry constraints.
  • Active research; harder to define a kernel.

Default choice: start in composition space. Only graduate to structure space when the chemistry is fixed and polymorph selection is the bottleneck.

29. Convex-Hull-Aware Acquisition

The naïve trap

  • Optimise raw \(E_f\) as the BO objective.
  • Result: BO proposes elemental references (huge \(|E_f|\) for ground states), or known stable compounds.
  • You re-discover what’s already in the database.

The fix: hull-aware EI

\[\alpha_{\text{hull-EI}}(\mathbf{x}) = \mathbb{E}\bigl[\max(0, E_{\text{hull, current}} - E_{\text{hull}}(\mathbf{x}))\bigr]\]

  • Reward candidates that lower the hull at their composition.
  • Aligns acquisition with the actual discovery objective.
  • Standard practice in modern materials BO.

30. Cost-Aware Acquisition

Definition

\[\alpha_{\text{cost}}(\mathbf{x}) = \frac{\alpha(\mathbf{x})}{c(\mathbf{x})}\]

  • \(c(\mathbf{x})\): synthesis difficulty (precursor cost, furnace time, atmospheric control).
  • Cheap candidates with moderate acquisition score beat expensive candidates with marginally better score.

When it matters

  • Heterogeneous synthesis costs: solid-state vs sol-gel vs single-crystal growth.
  • Toxic / radioactive precursors.
  • Limited reagent stocks.
  • Real labs always have cost asymmetry.

31. Multi-Fidelity Hooks

The setup

  • Cheap, biased fidelity: DFT prediction.
  • Expensive, unbiased fidelity: lab measurement.
  • Multi-fidelity GP models both, with a learned correlation between them.

The acquisition picks twice

  • Which candidate?
  • Which fidelity (DFT? lab?)?
  • Optimal split: cheap fidelity to localise; expensive fidelity to confirm.
  • Active research; mention as a hook, do not derive.

32. Batch Acquisition

The problem

  • Lab runs \(b > 1\) syntheses in parallel.
  • Top-\(b\) by single-point acquisition gives near-duplicate batches.
  • All clustered in one high-acquisition region.

Diversity strategies

  • Kriging believer: pick top-1, hallucinate \(\mu\) as the observed value, repeat.
  • Local penalisation: subtract a penalty for proximity to already-picked candidates.
  • Batch-EI / qEI: joint optimisation over \(b\) candidates.
  • Thompson sampling: \(b\) posterior samples → \(b\) argmins. Easiest.

33. Closing §D: Acquisition Is the Decision Layer

§D summary

  • Acquisition functions convert \((\mu, \sigma)\) into a next-pick decision.
  • EI is the default; UCB is the explicit-knob alternative; Thompson batches naturally.
  • Hull-aware EI is the materials-specific reformulation.
  • Cost-aware variants are non-negotiable in real labs.

§E question

  • We’ve made the GP central. When is the GP wrong?
  • When data scales beyond GP-friendly regimes, what replaces it?
  • How do you check that any uncertainty is calibrated?

§E · Alternatives, Ensembles, and Calibration

34. Deep Ensembles

Procedure

  • Train \(M \sim 5\)\(10\) NN regressors with different random seeds.
  • Ensemble mean = predictive mean.
  • Ensemble std = epistemic uncertainty proxy.
  • (Optional: each NN predicts mean and aleatoric variance separately.)

Strengths and weaknesses

  • + Scales to \(n \gg 10^4\).
  • + No kernel choice; works with any architecture.
  • Ensembles can be jointly overconfident.
  • Uncertainty uncalibrated by default.
  • Expensive at training time.

35. MC-Dropout

Procedure

  • Apply dropout at inference time (not just training).
  • Run \(T \sim 20\) stochastic forward passes per input.
  • Mean and std across passes = predictive mean and uncertainty.

Strengths and weaknesses

  • + Trivial to implement (1 line of PyTorch).
  • + Works on any pre-trained NN.
  • Notoriously miscalibrated.
  • Calibration depends on dropout rate (a hyperparameter you set during training).
  • Useful as a quick proxy, not as a primary UQ source.

36. Conformal Prediction

Procedure

  • Train any point predictor.
  • Compute residuals on a held-out calibration set.
  • Quantile of residuals → prediction interval that covers \(1-\alpha\) of test points (under exchangeability).

Properties

  • + Finite-sample coverage guarantee.
  • + Distribution-free, model-agnostic.
  • + Trivially simple.
  • Intervals are roughly uniform width — no kernel-aware \(\sigma(\mathbf{x})\).
  • Better suited to screening filters than to acquisition.

37. When Each UQ Method Wins

Decision table

  • \(n < 500\), small descriptor: GP.
  • \(n \in [500, 10{,}000]\): GP with inducing points or deep ensemble.
  • \(n > 10{,}000\), NN already trained: deep ensemble or MC-dropout.
  • Black-box predictor, need coverage: conformal.

Caveats

  • These are starting points, not laws.
  • Calibration is always mandatory regardless of method.
  • For a discovery loop, kernel-aware \(\sigma(\mathbf{x})\) matters more than coverage guarantees.
  • For a one-shot screening filter, coverage matters more than \(\sigma\)-contrast.

38. Reading a Reliability Diagram

Construction

  • Bin held-out points by predicted \(\sigma\).
  • For each bin, compute the fraction of true values within \(\pm \sigma\).
  • Plot: predicted coverage (x-axis) vs empirical coverage (y-axis).
  • Diagonal: perfectly calibrated.

Reads

  • Below the diagonal: overconfident — claimed 80% coverage, actual 65%.
  • Above the diagonal: underconfident — claimed 80% coverage, actual 92%.
  • GPs typically: well-calibrated in-distribution, overconfident OOD.
  • Deep ensembles typically: jointly overconfident across the board.

Recalibrate after every batch. Discovery campaigns shift the input distribution; calibration drifts; bad decisions follow.

39. Closing §E: UQ Is a Toolbox

§E summary

  • GP: small data, kernel-aware \(\sigma\), exact-inference \(O(n^3)\).
  • Deep ensemble: large data, NN architectures, mandatory calibration.
  • MC-dropout: quick proxy; not for primary UQ.
  • Conformal: model-agnostic coverage; better for filtering than acquisition.
  • Always check calibration on held-out data.

§F question

  • We have the toolbox. What does the full closed loop actually look like in published work?
  • Three case studies: perovskite stability; alloy BO; autonomous labs.

§F · Discovery-Loop Case Studies

40. Case 1 — Closed-Loop Perovskite Stability

Setup

  • Target: halide-perovskite stability under environmental stress.
  • Surrogate: GP over composition (mixed A-site cations, mixed halides).
  • Acquisition: hull-aware EI.
  • “Experiment”: automated stability measurement on a deposition platform.

Outcome and lesson

  • ~50 acquisitions; identified several previously unreported stable mixtures.
  • Failure mode: the GP became overconfident on extreme compositions (corners of the simplex).
  • Mitigation: periodic recalibration on a held-out slice; trust-region restriction on acquisition.

41. Case 2 — BO for Alloy Composition

Setup

  • Target: hardness or yield strength of a 4-element alloy.
  • Surrogate: GP over composition (4D simplex).
  • Acquisition: EI, with cost weighting (synthesis at \(10^4\) EUR/sample).
  • ~30 acquisitions to near-optimum.

Outcome and lesson

  • BO found a near-optimal composition using ~30 syntheses out of an effective candidate space of \(10^4\).
  • Failure mode: ignoring constraints — the “BO optimum” used a precursor combination that no real lab would actually attempt (toxicity, processing window).
  • Mitigation: constraint-aware acquisition (zero out infeasible regions before optimising \(\alpha\)).

42. Case 3 — Autonomous Labs

Setup

  • Berkeley A-Lab (szymanski2023autonomous?); Toronto self-driving labs (macleod2020selfdriving?); Argonne polybot.
  • Closed loop: synthesis robot + characterisation robot + BO planner.
  • No human in the inner loop; humans on safety + post-mortem.

Lessons learned

  • Bottleneck is rarely the GP — it’s the synthesis success rate.
  • Calibration drift is the most common silent failure.
  • Hard guardrails (feasibility filters, human review on high-\(\sigma\) picks) are non-negotiable.
  • Sets up Unit 14 directly.

§G · Wrap-Up

43. When Not to Use a GP

Three honest “don’t”s

  • Plentiful data (\(n \gg 10^4\)). A GP is an ensemble’s job done badly. Use a deep ensemble + conformal.
  • Discontinuous response surfaces (phase transitions, structural transitions). A stationary kernel will smear over them. Use a non-stationary kernel or segment the input space.
  • Bad descriptor. GP variance reflects descriptor-space density, not science. A bad descriptor gives confidently wrong uncertainty.

Diagnostics that flag these regimes

  • Marginal likelihood plateaus → too much data for current kernel.
  • Calibration deteriorates after every batch → distribution shift, kernel mismatch, or bad descriptor.
  • Length scale \(\ell\) stuck at minimum → response is rougher than kernel allows.

44. Bridge to Unit 14

What Unit 13 leaves you with

  • A calibrated probabilistic surrogate.
  • An acquisition function that picks the next candidate.
  • A loop that closes between database, prediction, and lab.
  • Diagnostics for failure modes (calibration drift, OOD, bad kernel).

What Unit 14 adds

  • Hard physics constraints in acquisition: thermodynamic, kinetic, electrochemical.
  • Trust layers: human review, hazard filters, novelty detection.
  • Integration with autonomous labs: guardrails, monitoring, post-mortem discipline.
  • The path from “uncertainty-aware” to “trustworthy.”

45. Exercise + Reading Assignment

Exercise (90 min, this afternoon)

  1. Pull a Materials Project ternary subset (Li–Co–O or similar). Reconstruct the convex hull. Identify hull and near-hull entries.
  2. Train a GP surrogate (Matérn-5/2, composition kernel) on a held-out subset. Plot \(\mu\) and \(\sigma\) over the simplex.
  3. Run BO loops with EI and with hull-aware EI; compare regret curves over 20 acquisitions.
  4. Run a calibration diagnostic; report whether the GP is over- or under-confident on the OOD slice.
  5. One-page report: claims, evidence, one named failure mode.

Reading for next week (Unit 14)

  • Murphy (2012) Ch 15 (GPs) — for any GP loose ends from today.
  • Bishop (2006) §6 (kernel methods) — kernel deep-dive.
  • Neuer et al. (2024) §6.4 — uncertainty in engineering workflows.
  • Optional: a recent autonomous-lab review paper (will be linked on the course site).

Next week (Unit 14): physics-informed constraints, trust, and discovery governance.

46. Unit 13 — One-Slide Summary

The discovery loop

database → predict → screen → synthesise → measure → refine
  • Database (§A): MP / OQMD / AFLOW / NOMAD; formation energy; convex hull; energy-above-hull as the discoverability signal.
  • Predict (§C): Gaussian Process; kernel encodes similarity; \(\mu\) and \(\sigma\) for free.
  • Screen (§D): acquisition functions (EI / UCB / Thompson); hull-aware and cost-aware variants; batch strategies.

The disciplines

  • Calibration (§E): check after every batch; recalibrate; reliability diagrams.
  • Alternatives (§E): deep ensembles at large \(n\); conformal for filtering; MC-dropout as a quick proxy.
  • Lessons (§F): corners of the simplex; constraints matter; calibration drift kills loops.
  • Bridge (§G): today’s calibrated surrogate is U14’s prerequisite.
Bishop, Christopher M. 2006. Pattern Recognition and Machine Learning. Springer.
Murphy, Kevin P. 2012. Machine Learning: A Probabilistic Perspective. MIT Press.
Neuer, Michael et al. 2024. Machine Learning for Engineers: Introduction to Physics-Informed, Explainable Learning Methods for AI in Engineering Applications. Springer Nature.

Continue