Machine Learning for Characterization and Processing
Unit 11: Uncertainty-aware regression & Gaussian Processes

AI 4 Materials / KI-Materialtechnologie

Prof. Dr. Philipp Pelz

FAU Erlangen-Nürnberg

FAU Logo IMN Logo CENEM Logo ERC Logo Eclipse Logo

01. From MFML theory to lab practice

MFML W12 recap — we use these, we do not re-derive them

Method One-line summary When to reach for it
Gaussian Process Closed-form Bayesian regression over functions Small \(n\) (\(\lesssim 10^3\)), tabular, smooth response, need calibrated CI
MC Dropout Keep dropout on at inference, sample \(T\) passes Big NN already trained, cheap epistemic estimate per pixel/voxel
Deep ensembles Train \(M\) independent NNs, use disagreement Best-calibrated NN UQ; budget for \(M\times\) training
MDN NN outputs \((\pi_k, \mu_k, \sigma_k)\) of a Gaussian mixture Multi-modal output (phase A or phase B from the same input)
Calibration Reliability diagram + temperature scaling Mandatory before any deployed model

Note

We are not re-deriving the math. See MFML W12 for posteriors, ELBO, marginal likelihood. Today: which tool, on which lab task, with which numbers.

Why UQ matters in characterization & processing labs

  • One wrong tensile-strength call on a structural part: recall, scrap, or worst case a failure in service.
  • One missed crack/pore in an SEM screen: a defective batch ships.
  • A point estimate without a confidence band is not a deliverable to a lab manager or a certifying body.
  • Engineering decisions are threshold decisions — accept/reject, retest/release, explore/exploit.
  • A threshold needs a distribution, not a number.

Note

Trust = prediction + calibrated confidence. The “calibrated” word is what most published materials-ML papers skip.

Two pain points unique to materials labs

(a) Tool / operator / coating domain shift

  • The training set was acquired on SEM #1 with operator A and a 5 nm Au coating.
  • Inference happens on SEM #2, operator B, no coating, on Tuesday after a chamber vent.
  • Your softmax does not know any of that. Your uncertainty had better grow.

(b) Experiments cost €1k+/h

  • Analytical S/TEM, AM build chambers, Gleeble dilatometry: each new label is hours of instrument time and days of prep.
  • UQ is not a publication ornament — it is the input to the next experiment: active learning only works if uncertainty is honest and spatially resolved.

02. Picking a UQ method per lab task

Decision table — task → method → cost

Lab task Recommended UQ Rationale Cost driver
Tabular regression, \(n \in [10, 300]\) (composition \(\to\) property) GP, RBF or Matérn \(\nu{=}5/2\) Closed-form CI, smooth response, hyperparams interpretable \(O(N^3)\) once — fine for \(N \lesssim 10^3\)
Pixel-wise segmentation of microscopy (CNN, U-Net) MC Dropout, \(T \approx 30\) Reuse trained net, get per-pixel variance map \(T\times\) inference per image
High-stakes property regression with budget for retraining Deep ensemble, \(M \in [5, 10]\) Best calibration in literature (Lakshminarayanan et al. 2017) \(M\times\) training
Multi-modal output (one input, two phases possible) MDN, \(K \in \{2,3\}\) Bimodal \(p(y\|x)\) — mean is meaningless One training, harder to fit
Any deployed model Reliability diagram + temp scaling Free, post-hoc, on a held-out cal set Trivial

Materials-specific computational budgets

Training-time vs inference-time tradeoff is more loaded in a lab than in webscale ML:

  • Deep ensemble — cheap at inference (run \(M\) small NNs, average), expensive at training (\(M\times\) wall-clock + carbon). Acceptable when training is one-off and the model is shipped to many labs.
  • MC Dropout — cheap at training (one network), expensive at inference (\(T\times\) forward passes). Acceptable on a benchtop where you process tens of images per day; not acceptable for a real-time inline detector.
  • GP — both small at the scales we operate (\(N \lesssim 1000\)). Recompute per dataset; trivially cheap at inference.

Note

MC Dropout’s variance estimate degrades with very deep nets and very low dropout rates — it can collapse to near-zero variance and look overconfident. Always validate with a held-out reliability diagram.

What to not do

  • Do not quote raw softmax probability as “model confidence” — modern deep nets are systematically overconfident; a 0.97 softmax on an out-of-distribution micrograph is meaningless (Guo et al. 2017).
  • Do not report the variance of a single ensemble member’s Monte-Carlo dropout passes and call it ensemble uncertainty — you are mixing two distinct UQ techniques and double-counting.
  • Do not report a GP fit without showing the prior (kernel choice, length-scale prior). The kernel is the model. Hiding it makes the CI uninterpretable.
  • Do not report any UQ number without a reliability diagram on a held-out calibration set.

03. Case study A — GP for process\(\to\)property mapping

Setup: 21CrMoV5-7 quench-and-temper

  • Steel grade DIN 1.7709 / 21CrMoV5-7 (Mantzoukas et al. 2021).
  • Process: austenitize 960 °C, oil quench, temper 2 h at variable \(T_{\text{temper}}\).
  • Input: \(T_{\text{temper}} \in [200, 700]\) °C.
  • Output: hardness \(\text{HRC}(T_{\text{temper}})\).
  • Real published dataset, \(\sim 10\text{–}30\) specimens per condition with measurement scatter.

Why this is a textbook GP problem:

  • \(n\) small, \(d{=}1\) input.
  • Response monotonically softens with \(T\) but is non-linear (carbide coarsening kinetics).
  • We need a CI to skip half of the destructive tests.

Why GP fits this lab task

  • \(n \sim 30\): too small for a deep net to give honest uncertainty, plenty for a GP.
  • Output is smooth in input — RBF or Matérn \(\nu{=}5/2\) is the right inductive bias.
  • The GP delivers what the lab manager wants: \(\hat{y}(T) \pm 2\sigma(T)\), narrowing at sampled \(T\), widening between.
  • Hyperparameters \((\ell, \sigma_f, \sigma_n)\) are fit by maximizing the log marginal likelihood — see MFML W12 for the formula. We use the result.

Kernel choice — what the hyperparameters mean physically

\[ k_{\text{RBF}}(T, T') = \sigma_f^2 \exp\!\left(-\frac{(T - T')^2}{2\,\ell^2}\right) \]

  • Length scale \(\ell\): the “characteristic process variability” on the temperature axis.
    • \(\ell \approx 50\) °C → response varies sharply with temper temperature; carbide kinetics regime-change is captured.
    • \(\ell \approx 300\) °C → response is essentially a slow trend; we are oversmoothing.
  • Signal variance \(\sigma_f^2\): amplitude of the HRC variation across the explored window. Read off the data range.
  • Noise variance \(\sigma_n^2\): instrument + specimen scatter at fixed \(T\). Estimate it from replicates, do not let the optimizer absorb model misfit into noise.

Note

For metallurgical responses with regime changes, prefer Matérn \(\nu{=}5/2\) over RBF — it is once-differentiable instead of \(C^\infty\), which matches the physics better.

The fit — described

  • Posterior mean: a smooth softening curve from \(\sim 55\) HRC at 200 °C to \(\sim 25\) HRC at 700 °C.
  • 95% CI ribbon: tight (\(\pm 1\) HRC) at sampled tempers (300, 450, 600 °C), widens to \(\pm 4\) HRC in the gaps.
  • The reliability of the CI is checked on a leave-one-out CV reliability diagram — and only then do we trust the ribbon.

Numerical example. With \(n{=}30\), \(\sigma_n \approx 0.8\) HRC (from replicates), \(\ell \approx 60\) °C: the GP posterior at \(T{=}500\) °C (a held-out point) gives \(\hat{\text{HRC}} = 38.2 \pm 1.6\) (2\(\sigma\)). Spec sheet says 36–40 — we just skipped a destructive test.

When to trust extrapolation — and when not to

  • The GP variance grows back toward \(\sigma_f^2\) as we move away from data.
  • For 21CrMoV5-7: ribbon balloons outside \([200, 700]\) °C — the model is honestly saying “I have not seen tempers below 200 or above 700, my prediction here is essentially the prior.”
  • This is the right answer. Do not clip the variance, do not force-fit a parametric extrapolation. If you need predictions at 750 °C, run the experiment.

Note

The CI growth is only honest if the kernel is correct. A too-long \(\ell\) will make the GP overconfident outside the data. Always cross-check with a held-out CV reliability diagram before you trust extrapolation.

What this enables — direct ROI

  • Pre-GP workflow: 6 destructive tensile + hardness tests per heat treatment recipe, 4 recipes, 24 tests at €200/test → €4 800.
  • Post-GP workflow: 3 anchor tests + GP interpolation. CI verified, sufficient for spec compliance on standard heats. 12 tests, €2 400.
  • Saved per heat-treatment campaign: ~€2 400. Across a year of campaigns: into the tens of thousands.
  • Cost of the GP: half a day of analyst time, no infrastructure.
  • Caveat: the GP does not replace verification at the spec extremes. It replaces redundant tests in the smooth interior of the process window.

Modern small-tabular alternative: TabPFN

  • TabPFN (Hollmann et al. 2025) is a transformer pre-trained on millions of synthetic tabular tasks to do in-context prediction — no per-task fitting.
  • Pass your \(\sim 30\) rows + new query → it returns a calibrated posterior predictive in one forward pass.
  • 2025 version (v2) handles up to \(\sim 10\,000\) rows and is competitive with tuned XGBoost on small-tabular benchmarks.

When to reach for TabPFN over a GP on the 21CrMoV5-7 task:

  • \(d > 1\): multiple input features (temper \(T\), hold time, prior austenite grain size). GP kernel design gets hard, TabPFN ingests it.
  • Mixed continuous + categorical inputs (alloy family, oil vs gas quench). Out of the box.
  • You want a baseline in 10 lines without picking a kernel or running marginal-likelihood optimisation.

When the GP still wins:

  • 1-D smooth process variable + interpretable hyperparameters (length scale = physical correlation length).
  • You need the posterior closed-form for downstream optimisation (BO acquisition functions, gradient w.r.t. inputs).
  • You need to encode prior physics (Matérn smoothness, periodic kernels for cycling processes).

Note

On the 21CrMoV5-7 task TabPFN matches the GP’s leave-one-out RMSE within \(\sim 0.3\) HRC; the GP wins on interpretability of \(\ell\) and \(\sigma_f\). Use whichever your stakeholder will sign off on.

04. Case study B — MC Dropout for SEM defect segmentation

Setup: U-Net on SEM micrographs

  • Task: per-pixel segmentation of SEM micrographs into {matrix, porosity, crack, inclusion}.
  • Architecture: U-Net with dropout layers (\(p \approx 0.2\)) in the bottleneck and decoder.
  • Training data: \(\mathcal{O}(10^3)\) labeled tiles, hand-segmented by an expert. Evaluation set drawn from a different SEM session to expose tool drift (Modarres et al. 2017).
  • The point estimate (argmax over softmax) is fine on the training distribution and bad on the deployment distribution. We need a per-pixel uncertainty map to flag the bad regions.

MC Dropout in practice

  • Training: standard, dropout active.
  • Inference: keep dropout on, run \(T = 30\) stochastic forward passes per image.
  • Per pixel \(i\): collect \(\{p_i^{(t)}\}_{t=1}^{T}\) — softmax distributions over classes.
  • Predictive mean: \(\bar{p}_i = \tfrac{1}{T}\sum_t p_i^{(t)}\) → argmax for the class label.
  • Per-pixel predictive entropy: \(H_i = -\sum_c \bar{p}_{i,c} \log \bar{p}_{i,c}\) — the uncertainty map.
  • \(T{=}30\) is the typical knee — below 10 the variance estimate is too noisy, above 50 you are paying for diminishing returns. Validate \(T\) on a held-out reliability diagram.

Per-pixel uncertainty maps — what they look like

  • Low entropy in the bulk matrix and inside large, well-formed pores: easy classification.
  • High entropy where you would expect:
    • Grain-boundary triple junctions — class boundary on the image.
    • Edge artifacts of the field of view.
    • Charging zones — bright halos around insulators.
    • Tile edges where the U-Net’s receptive field is incomplete.
  • These are diagnostic: the model is honestly uncertain exactly where a human operator would also hesitate.

Reject-for-human-review threshold

  • Pick a per-pixel entropy threshold \(\tau\). Pixels with \(H_i > \tau\) are flagged for operator review; the rest are auto-classified.
  • Sweep \(\tau\) to draw the operating curve: (human-review rate) on the x-axis vs (defect recall) on the y-axis.
  • Typical knee on a benchtop SEM benchmark: at 5% review rate we recover >98% of defect pixels; at 1% review we drop to ~92%.
  • The threshold is a business decision, not an ML decision — it depends on the cost of a missed defect vs the cost of operator time.

Tool-shift calibration

  • The reliability diagram on SEM #1 (training tool) is well-calibrated.
  • The reliability diagram on SEM #2 (different detector, different bias) is not — the model is overconfident on a slightly different contrast distribution.
  • Practical fix: per-tool temperature scaling. Collect a small (~50 image) calibration set on the new tool, fit a single scalar \(T\) to recalibrate. Cheap, post-hoc, no retraining.
  • Re-run calibration after any: detector swap, gun replacement, sample-coating change, large chamber vent.

Note

“My segmentation accuracy dropped after the chamber vent” is a calibration failure as often as a model failure. Diagnose with a reliability diagram before retraining.

05. Case study C — Active learning loop for AM process windows

Setup: laser powder-bed fusion process map

  • Process: laser powder-bed fusion (L-PBF) on a benchtop printer, single material.
  • Two-axis design space: laser power \(P \in [80, 350]\) W and scan speed \(v \in [200, 1500]\) mm/s.
  • Goal: identify the process window — the region in \((P, v)\) where relative density \(\rho_{\text{rel}} > 0.995\) and no melt-pool collapse is observed.
  • Each experiment: print a \(\sim 5\) mm cube, cross-section, measure \(\rho_{\text{rel}}\)about €300 in materials, machine time, and metallography.

The active-learning loop

  • GP surrogate: \(\rho_{\text{rel}}(P, v)\) with a 2-D RBF or Matérn kernel, separate length scales \(\ell_P, \ell_v\).
  • Acquisition function:
    • Upper Confidence Bound: \(\alpha_{\text{UCB}}(P,v) = \mu(P,v) + \beta\,\sigma(P,v)\).
    • Expected Improvement: \(\alpha_{\text{EI}}\) favours points likely to beat the current best.
  • Choose the next experiment to maximize \(\alpha\)trade off mean (exploitation) vs variance (exploration) (Hernández-Lobato et al. 2014).

Closed loop with the printer + safety constraints

  • Hard constraints the acquisition cannot violate:
    • \(P/v\) ratio — prevent keyhole regime that damages optics.
    • Absolute caps — machine specs.
    • “No-go” zones from prior failures — flagged manually.
  • Constrained Bayesian optimization: \(\max_{(P,v)} \alpha(P,v)\) subject to \(g_k(P,v) \leq 0\).
  • Implementation: rejection sampling on the candidate set, or a second GP modeling the constraint probability.
  • Budget cap: stop the loop after \(N_{\max} = 30\) experiments or when the process-window area stabilizes — whichever comes first.

Result: pareto frontier of effort vs window discovered

  • Plot: x-axis = number of experiments run, y-axis = area (in \((P, v)\) units) classified with \(P[\rho_{\text{rel}} > 0.995] > 0.9\).
  • Active-learning curve climbs steeply early — UCB / EI route experiments straight to the boundary of the process window.
  • Grid-search baseline: \(14 \times 14 = 196\) experiments to span the design at the same resolution.
  • Typical result on this kind of L-PBF problem: ~30 active-learning experiments find the process-window area that grid-search needs ~200 for — a roughly 6–7× reduction in experimental cost (Hernández-Lobato et al. 2014).
  • At €300/experiment: €51 000 saved per material.

Materials-acceleration-platform framing

  • This loop — surrogate model + acquisition + closed-loop instrument + safety constraints — is the prototype of a self-driving lab.
  • Aspuru-Guzik and Berlinguette groups have built versions of this loop for catalysis and thin-film electrochemistry. Same pattern, different instrument (Hernández-Lobato et al. 2014).
  • The only ingredient that makes this work is honest UQ. With overconfident or miscalibrated \(\sigma\), the acquisition function picks the wrong next experiment and you waste your budget. UQ is not a slide at the end — it is the engine.

06. Calibration & deployment hygiene

Reliability diagrams on lab data

  • Bin predictions into 10 confidence bins. For each bin, compare predicted confidence vs observed accuracy (or coverage of the CI).
  • Perfect calibration: diagonal.
  • Above the diagonal: under-confident. Below: over-confident (the dangerous failure mode).
  • Where typical materials models break:
    • Out-of-distribution alloy family — model is highly confident, accuracy collapses.
    • Different microscope / coating / operator — softmax stays at 0.95+, accuracy drops 20 points.
    • Long-tail rare defect classes — confidence is high on the wrong class.
  • Always run a reliability diagram on a held-out, lab-realistic calibration set — not a random split of the training data.

OOD detection — when the model sees something new

  • Symptoms: confidence is high, prediction is wrong. Calibration alone cannot save you — you need to detect the OOD case and refuse to predict.

Practical detectors:

  • Mahalanobis distance in feature space (penultimate-layer activations vs the training-set Gaussian) — cheap, surprisingly effective on microscopy.
  • Ensemble disagreement — for a deep ensemble, high prediction variance across members \(\Rightarrow\) OOD. Free if you already have an ensemble.
  • GP variance — for GP surrogates, \(\sigma^2 \to \sigma_f^2\) flags inputs far from training. This is the same mechanism as the AL loop in §5.
  • Workflow: train your model, fit a Mahalanobis detector on training features, reject any inference with detector score above a threshold and route to a human.

Conformal prediction — distribution-free coverage in 5 lines

Split conformal (Angelopoulos and Bates 2023). Given any pre-trained predictor \(\hat{f}\):

  1. Hold out a fresh calibration set of size \(n_{\text{cal}}\).
  2. Compute non-conformity scores \(s_i = |y_i - \hat{f}(x_i)|\) on it.
  3. Take the empirical quantile \(\hat{q} = \text{Quantile}_{\lceil(n_{\text{cal}}+1)(1-\alpha)\rceil/n_{\text{cal}}}(s)\).
  4. At test time, emit the interval \([\hat{f}(x) - \hat{q},\ \hat{f}(x) + \hat{q}]\).
  5. Theorem. Marginal coverage \(\geq 1-\alpha\), finite sample, distribution-free — no Gaussianity, no calibration assumption.

\[ \mathbb{P}\big[\,Y_{\text{test}} \in C(X_{\text{test}})\,\big] \;\geq\; 1 - \alpha \]

Why this lands in materials labs

  • Works as a wrapper around the GP, MC-Dropout, ensemble, or TabPFN model you already trust for the mean prediction.
  • The coverage guarantee is finite-sample — even on the \(n \sim 30\) specimens of §3.
  • No assumption that the residuals are Gaussian — robust to the heavy tails real lab data has.

Note

The only assumption is exchangeability of calibration and test data. Under distribution shift (new alloy family, new microscope) this breaks — width must grow or coverage drops silently.

Adaptive widths — Conformalized Quantile Regression (CQR)

  • Problem with vanilla split conformal. The interval has constant width \(2\hat{q}\) — even where the underlying process is noisier or less so. In labs, noise is heteroscedastic: scatter grows in two-phase regions, near the keyhole boundary, at low temper.

Conformalized Quantile Regression (CQR) (Romano et al. 2019):

  1. Train a quantile regressor for \(\alpha/2\) and \(1-\alpha/2\) levels (e.g., a quantile NN with pinball loss, or quantile forests).
  2. Compute non-conformity \(s_i = \max\{\hat{q}_{\alpha/2}(x_i) - y_i,\, y_i - \hat{q}_{1-\alpha/2}(x_i)\}\).
  3. Wrap: \([\hat{q}_{\alpha/2}(x) - \hat{Q},\ \hat{q}_{1-\alpha/2}(x) + \hat{Q}]\) with \(\hat Q\) the empirical quantile of \(s_i\).
  4. Marginal coverage \(\geq 1 - \alpha\) and widths now adapt to local noise.

Materials picture. On a LPBF process map (laser power vs scan velocity), CQR widens the predicted-hardness interval inside the keyhole-onset band and tightens it in the safe interior — automatically. Vanilla split conformal would put the same interval everywhere.

When to use which

Setting Use
Homoscedastic, in-control process Split conformal (cheapest)
Heteroscedastic / regime-dependent noise CQR
Online streaming with drift Adaptive conformal (Gibbs & Candès 2021)
Safety-critical, regulator-facing Any conformal + held-out coverage report

Reproducibility hygiene

  • Random seed logged for: data split, model init, dropout sampling, ensemble members, BO acquisition.
  • Model card alongside the model artifact: training data version, test metrics, calibration plot, OOD detector ROC, list of known failure modes.
  • Dataset version pin: hash of the labeled set; never silently update.

The one-paragraph “uncertainty section” of a model card:

“Uncertainty is reported as 95% CIs from \(T{=}30\) MC-dropout passes. The model is calibrated by temperature scaling on a 200-image held-out set; expected calibration error 0.03. The model is in-distribution iff Mahalanobis score \(<\tau_{\text{OOD}} = 14.2\); outside that, predictions are not returned.”

Note

If you cannot write that paragraph for your model, you cannot deploy it.

What MFML W12 covers that we skipped

Pointer slide. If you want the math behind today’s tools, MFML W12 has it:

  • Bayesian predictive distribution and the variance decomposition (aleatory + epistemic).
  • Marginal likelihood / evidence framework as automatic Occam’s razor.
  • Closed-form GP posterior (mean + variance), kernel hyperparameter learning by log marginal likelihood.
  • ELBO and variational interpretation of MC Dropout.
  • Calibration formalism (reliability, ECE) and recalibration methods (temperature scaling, Platt, isotonic).
  • Full conformal-prediction proofs — finite-sample exchangeability, conditional-coverage limits, split / CQR / adaptive variants.
  • TabPFN’s prior-data fitted network — what the pre-training prior actually is.
  • In ML-PC: we use these results on real lab data. Pick the right tool, validate it on a reliability diagram, and trust the answer enough to skip a destructive test or pick the next experiment.

07. Wrap

Recap: Unit 11

  1. Pick the UQ tool per task: GP or TabPFN for small tabular data, MC dropout for trained CNNs, ensembles for high-stakes regression, MDN for multi-modal outputs.
  2. Calibrate before deploying — a reliability diagram on a held-out lab-realistic set is non-negotiable.
  3. Wrap with conformal — split conformal or CQR adds finite-sample, distribution-free coverage on top of any mean predictor. Mandatory for safety-critical or regulator-facing models.
  4. UQ enables active learning — and active learning is how labs scale beyond the brute-force grid search.
  5. OOD detection complements calibration — confidence is meaningless when the input is outside the training distribution.
  6. The math lives in MFML W12. ML-PC W12 is about using that math to save tests, find process windows, and ship trustworthy models.

Continue

References & further reading

  • Mantzoukas et al. (2021) — 21CrMoV5-7 quench-and-temper data used in §3 (Mantzoukas et al. 2021).
  • Modarres et al. (2017) — SEM benchmark for the segmentation/classification setup of §4 (Modarres et al. 2017).
  • Gal & Ghahramani (2016) — MC Dropout as approximate Bayesian inference (Gal and Ghahramani 2016).
  • Lakshminarayanan et al. (2017) — Deep ensembles, the empirical gold standard for NN UQ (Lakshminarayanan et al. 2017).
  • Hernández-Lobato et al. (2014) — Predictive entropy search; foundational for BO/active learning in materials (Hernández-Lobato et al. 2014).
  • Guo et al. (2017) — Modern NNs are miscalibrated; temperature scaling (Guo et al. 2017).
  • Angelopoulos & Bates (2023) — Gentle introduction to split conformal prediction (Angelopoulos and Bates 2023).
  • Romano, Patterson & Candès (2019) — Conformalized Quantile Regression (Romano et al. 2019).
  • Hollmann et al. (2025) — TabPFN v2, foundation model for tabular data (Hollmann et al. 2025).
  • Rasmussen & Williams (2006) — GP reference text.
  • Aspuru-Guzik & Berlinguette — self-driving lab perspective; materials-acceleration platforms.
  • MFML W12 — full mathematical treatment of all methods used here.
Angelopoulos, Anastasios N., and Stephen Bates. 2023. “A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification.” Foundations and Trends in Machine Learning 16 (4): 494–591. https://doi.org/10.1561/2200000101.
Gal, Yarin, and Zoubin Ghahramani. 2016. “Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning.” International Conference on Machine Learning, 1050–59.
Guo, Chuan, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. 2017. “On Calibration of Modern Neural Networks.” International Conference on Machine Learning, 1321–30.
Hernández-Lobato, José Miguel, Matthew W. Hoffman, and Zoubin Ghahramani. 2014. “Predictive Entropy Search for Efficient Global Optimization of Black-Box Functions.” Advances in Neural Information Processing Systems 27.
Hollmann, Noah, Samuel Müller, Lennart Purucker, et al. 2025. “Accurate Predictions on Small Data with a Tabular Foundation Model.” Nature 637: 319–26. https://doi.org/10.1038/s41586-024-08328-6.
Lakshminarayanan, Balaji, Alexander Pritzel, and Charles Blundell. 2017. “Simple and Scalable Predictive Uncertainty Estimation Using Deep Ensembles.” Advances in Neural Information Processing Systems 30.
Mantzoukas, John, Dimitris G. Papageorgiou, Carmen Medrea, and Constantinos Stergiou. 2021. “Hardness Behavior of W. Nr. 1.7709 Steel, Oil Quenched and Tempered Between 475–575 c.” MATEC Web of Conferences 349: 02005. https://doi.org/10.1051/matecconf/202134902005.
Modarres, Mohammad Hadi, Rossella Aversa, Stefano Cozzini, Regina Ciancio, Angelo Leto, and Giuseppe Piero Brandino. 2017. “Neural Network for Nanoscience Scanning Electron Microscope Image Recognition.” Scientific Reports 7: 13282. https://doi.org/10.1038/s41598-017-13565-z.
Romano, Yaniv, Evan Patterson, and Emmanuel J. Candès. 2019. “Conformalized Quantile Regression.” Advances in Neural Information Processing Systems 32.