Machine Learning for Characterization and Processing
Unit 11: Uncertainty-aware regression & Gaussian Processes

AI 4 Materials / KI-Materialtechnologie

Prof. Dr. Philipp Pelz

FAU Erlangen-Nürnberg

FAU Logo IMN Logo CENEM Logo ERC Logo Eclipse Logo

01. From MFML theory to lab practice

MFML W12 recap — we use these, we do not re-derive them

Method One-line summary When to reach for it
Gaussian Process Closed-form Bayesian regression over functions Small \(n\) (\(\lesssim 10^3\)), tabular, smooth response, need calibrated CI
MC Dropout Keep dropout on at inference, sample \(T\) passes Big NN already trained, cheap epistemic estimate per pixel/voxel
Deep ensembles Train \(M\) independent NNs, use disagreement Best-calibrated NN UQ; budget for \(M\times\) training
MDN NN outputs \((\pi_k, \mu_k, \sigma_k)\) of a Gaussian mixture Multi-modal output (phase A or phase B from the same input)
Calibration Reliability diagram + temperature scaling Mandatory before any deployed model

Note

We are not re-deriving the math. See MFML W12 for posteriors, ELBO, marginal likelihood. Today: which tool, on which lab task, with which numbers.

Why UQ matters in characterization & processing labs

  • One wrong tensile-strength call on a structural part: recall, scrap, or worst case a failure in service.
  • One missed crack/pore in an SEM screen: a defective batch ships.
  • A point estimate without a confidence band is not a deliverable to a lab manager or a certifying body.
  • Engineering decisions are threshold decisions — accept/reject, retest/release, explore/exploit.
  • A threshold needs a distribution, not a number.

Note

Trust = prediction + calibrated confidence. The “calibrated” word is what most published materials-ML papers skip.

Two pain points unique to materials labs

(a) Tool / operator / coating domain shift

  • The training set was acquired on SEM #1 with operator A and a 5 nm Au coating.
  • Inference happens on SEM #2, operator B, no coating, on Tuesday after a chamber vent.
  • Your softmax does not know any of that. Your uncertainty had better grow.

(b) Experiments cost €1k+/h

  • Analytical S/TEM, AM build chambers, Gleeble dilatometry: each new label is hours of instrument time and days of prep.
  • UQ is not a publication ornament — it is the input to the next experiment: active learning only works if uncertainty is honest and spatially resolved.

02. Picking a UQ method per lab task

Decision table — task → method → cost

Lab task Recommended UQ Rationale Cost driver
Tabular regression, \(n \in [10, 300]\) (composition \(\to\) property) GP, RBF or Matérn \(\nu{=}5/2\) Closed-form CI, smooth response, hyperparams interpretable \(O(N^3)\) once — fine for \(N \lesssim 10^3\)
Pixel-wise segmentation of microscopy (CNN, U-Net) MC Dropout, \(T \approx 30\) Reuse trained net, get per-pixel variance map \(T\times\) inference per image
High-stakes property regression with budget for retraining Deep ensemble, \(M \in [5, 10]\) Best calibration in literature (Lakshminarayanan et al. 2017) \(M\times\) training
Multi-modal output (one input, two phases possible) MDN, \(K \in \{2,3\}\) Bimodal \(p(y\|x)\) — mean is meaningless One training, harder to fit
Any deployed model Reliability diagram + temp scaling Free, post-hoc, on a held-out cal set Trivial

Materials-specific computational budgets

Training-time vs inference-time tradeoff is more loaded in a lab than in webscale ML:

  • Deep ensemble — cheap at inference (run \(M\) small NNs, average), expensive at training (\(M\times\) wall-clock + carbon). Acceptable when training is one-off and the model is shipped to many labs.
  • MC Dropout — cheap at training (one network), expensive at inference (\(T\times\) forward passes). Acceptable on a benchtop where you process tens of images per day; not acceptable for a real-time inline detector.
  • GP — both small at the scales we operate (\(N \lesssim 1000\)). Recompute per dataset; trivially cheap at inference.

Note

MC Dropout’s variance estimate degrades with very deep nets and very low dropout rates — it can collapse to near-zero variance and look overconfident. Always validate with a held-out reliability diagram.

What to not do

  • Do not quote raw softmax probability as “model confidence” — modern deep nets are systematically overconfident; a 0.97 softmax on an out-of-distribution micrograph is meaningless (Guo et al. 2017).
  • Do not report the variance of a single ensemble member’s Monte-Carlo dropout passes and call it ensemble uncertainty — you are mixing two distinct UQ techniques and double-counting.
  • Do not report a GP fit without showing the prior (kernel choice, length-scale prior). The kernel is the model. Hiding it makes the CI uninterpretable.
  • Do not report any UQ number without a reliability diagram on a held-out calibration set.

03. Case study A — GP for process\(\to\)property mapping

Setup: 21CrMoV5-7 quench-and-temper

  • Steel grade DIN 1.7709 / 21CrMoV5-7 (Mantzoukas et al. 2021).
  • Process: austenitize 960 °C, oil quench, temper 2 h at variable \(T_{\text{temper}}\).
  • Input: \(T_{\text{temper}} \in [200, 700]\) °C.
  • Output: hardness \(\text{HRC}(T_{\text{temper}})\).
  • Real published dataset, \(\sim 10\text{–}30\) specimens per condition with measurement scatter.

Why this is a textbook GP problem:

  • \(n\) small, \(d{=}1\) input.
  • Response monotonically softens with \(T\) but is non-linear (carbide coarsening kinetics).
  • We need a CI to skip half of the destructive tests.

Why GP fits this lab task

  • \(n \sim 30\): too small for a deep net to give honest uncertainty, plenty for a GP.
  • Output is smooth in input — RBF or Matérn \(\nu{=}5/2\) is the right inductive bias.
  • The GP delivers what the lab manager wants: \(\hat{y}(T) \pm 2\sigma(T)\), narrowing at sampled \(T\), widening between.
  • Hyperparameters \((\ell, \sigma_f, \sigma_n)\) are fit by maximizing the log marginal likelihood — see MFML W12 for the formula. We use the result.

Kernel choice — what the hyperparameters mean physically

\[ k_{\text{RBF}}(T, T') = \sigma_f^2 \exp\!\left(-\frac{(T - T')^2}{2\,\ell^2}\right) \]

  • Length scale \(\ell\): the “characteristic process variability” on the temperature axis.
    • \(\ell \approx 50\) °C → response varies sharply with temper temperature; carbide kinetics regime-change is captured.
    • \(\ell \approx 300\) °C → response is essentially a slow trend; we are oversmoothing.
  • Signal variance \(\sigma_f^2\): amplitude of the HRC variation across the explored window. Read off the data range.
  • Noise variance \(\sigma_n^2\): instrument + specimen scatter at fixed \(T\). Estimate it from replicates, do not let the optimizer absorb model misfit into noise.

Note

For metallurgical responses with regime changes, prefer Matérn \(\nu{=}5/2\) over RBF — it is once-differentiable instead of \(C^\infty\), which matches the physics better.

The fit — described

  • Posterior mean: a smooth softening curve from \(\sim 55\) HRC at 200 °C to \(\sim 25\) HRC at 700 °C.
  • 95% CI ribbon: tight (\(\pm 1\) HRC) at sampled tempers (300, 450, 600 °C), widens to \(\pm 4\) HRC in the gaps.
  • The reliability of the CI is checked on a leave-one-out CV reliability diagram — and only then do we trust the ribbon.

Numerical example. With \(n{=}30\), \(\sigma_n \approx 0.8\) HRC (from replicates), \(\ell \approx 60\) °C: the GP posterior at \(T{=}500\) °C (a held-out point) gives \(\hat{\text{HRC}} = 38.2 \pm 1.6\) (2\(\sigma\)). Spec sheet says 36–40 — we just skipped a destructive test.

When to trust extrapolation — and when not to

  • The GP variance grows back toward \(\sigma_f^2\) as we move away from data.
  • For 21CrMoV5-7: ribbon balloons outside \([200, 700]\) °C — the model is honestly saying “I have not seen tempers below 200 or above 700, my prediction here is essentially the prior.”
  • This is the right answer. Do not clip the variance, do not force-fit a parametric extrapolation. If you need predictions at 750 °C, run the experiment.

Note

The CI growth is only honest if the kernel is correct. A too-long \(\ell\) will make the GP overconfident outside the data. Always cross-check with a held-out CV reliability diagram before you trust extrapolation.

What this enables — direct ROI

  • Pre-GP workflow: 6 destructive tensile + hardness tests per heat treatment recipe, 4 recipes, 24 tests at €200/test → €4 800.
  • Post-GP workflow: 3 anchor tests + GP interpolation. CI verified, sufficient for spec compliance on standard heats. 12 tests, €2 400.
  • Saved per heat-treatment campaign: ~€2 400. Across a year of campaigns: into the tens of thousands.
  • Cost of the GP: half a day of analyst time, no infrastructure.
  • Caveat: the GP does not replace verification at the spec extremes. It replaces redundant tests in the smooth interior of the process window.

Modern small-tabular alternative: TabPFN

  • TabPFN (Hollmann et al. 2025) is a transformer pre-trained on millions of synthetic tabular tasks to do in-context prediction — no per-task fitting.
  • Pass your \(\sim 30\) rows + new query → it returns a calibrated posterior predictive in one forward pass.
  • 2025 version (v2) handles up to \(\sim 10\,000\) rows and is competitive with tuned XGBoost on small-tabular benchmarks.
  • Code: github.com/PriorLabs/TabPFNpip install tabpfn, scikit-learn-compatible API.

When the GP still wins:

  • 1-D smooth process variable + interpretable hyperparameters (length scale = physical correlation length).
  • You need the posterior closed-form for downstream optimisation (BO acquisition functions, gradient w.r.t. inputs).
  • You need to encode prior physics (Matérn smoothness, periodic kernels for cycling processes).

Note

On the 21CrMoV5-7 task TabPFN matches the GP’s leave-one-out RMSE within \(\sim 0.3\) HRC; the GP wins on interpretability of \(\ell\) and \(\sigma_f\). Use whichever your stakeholder will sign off on.

04. Case study B — MC Dropout for SEM defect segmentation

Setup: U-Net on SEM micrographs

  • Task: per-pixel segmentation of SEM micrographs into {matrix, porosity, crack, inclusion}.
  • Architecture: U-Net with dropout layers (\(p \approx 0.2\)) in the bottleneck and decoder.
  • Training data: \(\mathcal{O}(10^3)\) labeled tiles, hand-segmented by an expert. Evaluation set drawn from a different SEM session to expose tool drift (Modarres et al. 2017).
  • The point estimate (argmax over softmax) is fine on the training distribution and bad on the deployment distribution. We need a per-pixel uncertainty map to flag the bad regions.

MC Dropout in practice

  • Training: standard, dropout active.
  • Inference: keep dropout on, run \(T = 30\) stochastic forward passes per image.
  • Per pixel \(i\): collect \(\{p_i^{(t)}\}_{t=1}^{T}\) — softmax distributions over classes.
  • Predictive mean: \(\bar{p}_i = \tfrac{1}{T}\sum_t p_i^{(t)}\) → argmax for the class label.
  • Per-pixel predictive entropy: \(H_i = -\sum_c \bar{p}_{i,c} \log \bar{p}_{i,c}\) — the uncertainty map.
  • \(T{=}30\) is the typical knee — below 10 the variance estimate is too noisy, above 50 you are paying for diminishing returns. Validate \(T\) on a held-out reliability diagram.

Per-pixel uncertainty maps — what they look like

  • Low entropy in the bulk matrix and inside large, well-formed pores: easy classification.
  • High entropy where you would expect:
    • Grain-boundary triple junctions — class boundary on the image.
    • Edge artifacts of the field of view.
    • Charging zones — bright halos around insulators.
    • Tile edges where the U-Net’s receptive field is incomplete.
  • These are diagnostic: the model is honestly uncertain exactly where a human operator would also hesitate.

Reject-for-human-review threshold

  • Pick a per-pixel entropy threshold \(\tau\). Pixels with \(H_i > \tau\) are flagged for operator review; the rest are auto-classified.
  • Sweep \(\tau\) to draw the operating curve: (human-review rate) on the x-axis vs (defect recall) on the y-axis.
  • Typical knee on a benchtop SEM benchmark: at 5% review rate we recover >98% of defect pixels; at 1% review we drop to ~92%.
  • The threshold is a business decision, not an ML decision — it depends on the cost of a missed defect vs the cost of operator time.

Tool-shift calibration

  • The reliability diagram on SEM #1 (training tool) is well-calibrated.
  • The reliability diagram on SEM #2 (different detector, different bias) is not — the model is overconfident on a slightly different contrast distribution.
  • Practical fix: per-tool temperature scaling. Collect a small (~50 image) calibration set on the new tool, fit a single scalar \(T\) to recalibrate. Cheap, post-hoc, no retraining.
  • Re-run calibration after any: detector swap, gun replacement, sample-coating change, large chamber vent.

Note

“My segmentation accuracy dropped after the chamber vent” is a calibration failure as often as a model failure. Diagnose with a reliability diagram before retraining.

05. Case study C — Active learning loop for AM process windows

Setup: laser powder-bed fusion process map

  • Process: laser powder-bed fusion (L-PBF) on a benchtop printer, single material.
  • Two-axis design space: laser power \(P \in [80, 350]\) W and scan speed \(v \in [200, 1500]\) mm/s.
  • Goal: identify the process window — the region in \((P, v)\) where relative density \(\rho_{\text{rel}} > 0.995\) and no melt-pool collapse is observed.
  • Each experiment: print a \(\sim 5\) mm cube, cross-section, measure \(\rho_{\text{rel}}\)about €300 in materials, machine time, and metallography.

The active-learning loop

  • GP surrogate: \(\rho_{\text{rel}}(P, v)\) with a 2-D RBF or Matérn kernel, separate length scales \(\ell_P, \ell_v\).
  • Acquisition function:
    • Upper Confidence Bound: \(\alpha_{\text{UCB}}(P,v) = \mu(P,v) + \beta\,\sigma(P,v)\).
    • Expected Improvement: \(\alpha_{\text{EI}}\) favours points likely to beat the current best.
  • Choose the next experiment to maximize \(\alpha\)trade off mean (exploitation) vs variance (exploration) (Hernández-Lobato et al. 2014).

Closed loop with the printer + safety constraints

  • Hard constraints the acquisition cannot violate:
    • \(P/v\) ratio — prevent keyhole regime that damages optics.
    • Absolute caps — machine specs.
    • “No-go” zones from prior failures — flagged manually.
  • Constrained Bayesian optimization: \(\max_{(P,v)} \alpha(P,v)\) subject to \(g_k(P,v) \leq 0\).
  • Implementation: rejection sampling on the candidate set, or a second GP modeling the constraint probability.
  • Budget cap: stop the loop after \(N_{\max} = 30\) experiments or when the process-window area stabilizes — whichever comes first.

Result: pareto frontier of effort vs window discovered

  • Plot: x-axis = number of experiments run, y-axis = area (in \((P, v)\) units) classified with \(P[\rho_{\text{rel}} > 0.995] > 0.9\).
  • Active-learning curve climbs steeply early — UCB / EI route experiments straight to the boundary of the process window.
  • Grid-search baseline: \(14 \times 14 = 196\) experiments to span the design at the same resolution.
  • Typical result on this kind of L-PBF problem: ~30 active-learning experiments find the process-window area that grid-search needs ~200 for — a roughly 6–7× reduction in experimental cost (Hernández-Lobato et al. 2014).
  • At €300/experiment: €51 000 saved per material.

Materials-acceleration-platform framing

  • This loop — surrogate model + acquisition + closed-loop instrument + safety constraints — is the prototype of a self-driving lab.
  • Aspuru-Guzik and Berlinguette groups have built versions of this loop for catalysis and thin-film electrochemistry. Same pattern, different instrument (Hernández-Lobato et al. 2014).
  • The only ingredient that makes this work is honest UQ. With overconfident or miscalibrated \(\sigma\), the acquisition function picks the wrong next experiment and you waste your budget. UQ is not a slide at the end — it is the engine.

06. Calibration & deployment hygiene

Reliability diagrams on lab data

  • Bin predictions into 10 confidence bins. For each bin, compare predicted confidence vs observed accuracy (or coverage of the CI).
  • Perfect calibration: diagonal.
  • Above the diagonal: under-confident. Below: over-confident (the dangerous failure mode).
  • Where typical materials models break:
    • Out-of-distribution alloy family — model is highly confident, accuracy collapses.
    • Different microscope / coating / operator — softmax stays at 0.95+, accuracy drops 20 points.
    • Long-tail rare defect classes — confidence is high on the wrong class.
  • Always run a reliability diagram on a held-out, lab-realistic calibration set — not a random split of the training data.

OOD detection — when the model sees something new

  • Symptoms: confidence is high, prediction is wrong. Calibration alone cannot save you — you need to detect the OOD case and refuse to predict.

Practical detectors:

  • Mahalanobis distance in feature space (penultimate-layer activations vs the training-set Gaussian) — cheap, surprisingly effective on microscopy.
  • Ensemble disagreement — for a deep ensemble, high prediction variance across members \(\Rightarrow\) OOD. Free if you already have an ensemble.
  • GP variance — for GP surrogates, \(\sigma^2 \to \sigma_f^2\) flags inputs far from training. This is the same mechanism as the AL loop in §5.
  • Workflow: train your model, fit a Mahalanobis detector on training features, reject any inference with detector score above a threshold and route to a human.

Recall: conformal coverage as the wrapper layer (MFML W7)

  • Split conformal (Angelopoulos and Bates 2023) and CQR (Romano et al. 2019) were derived in MFML Unit 7 — finite-sample, distribution-free coverage guarantee around any mean predictor.
  • We do not re-derive them here. Today’s job is deployment: which lab tool gets which wrapper, and what coverage to report.

Materials deployment defaults:

  • GP (§3) → split conformal. GP CI is already smooth; wrap to convert the model-conditional ribbon into a frequentist coverage guarantee on the held-out lab batch.
  • MC Dropout / U-Net (§4) → split conformal per class on the held-out tile set; gives a calibrated per-pixel set-valued prediction.
  • Heteroscedastic regression near a regime boundary (§5 keyhole) → CQR. Constant-width CP is wasteful where noise is locally regime-dependent.

Why we promote it from “footnote” to “default”:

  • Wraps any mean predictor — GP, ensemble, MC-Dropout, TabPFN.
  • Finite-sample guarantee on the \(n \sim 30\) specimens of §3.
  • No Gaussianity assumption on residuals.
  • The only assumption is exchangeability of calibration and test data — and that fails the moment you change tool, operator, or alloy family.

Warning

Materials-specific failure mode. Tool drift (§4 SEM #1 → #2) breaks exchangeability silently. Always re-run coverage on a per-tool calibration set, or accept that the guarantee is gone.

Reproducibility hygiene

  • Random seed logged for: data split, model init, dropout sampling, ensemble members, BO acquisition.
  • Model card alongside the model artifact: training data version, test metrics, calibration plot, OOD detector ROC, list of known failure modes.
  • Dataset version pin: hash of the labeled set; never silently update.

The one-paragraph “uncertainty section” of a model card:

“Uncertainty is reported as 95% CIs from \(T{=}30\) MC-dropout passes. The model is calibrated by temperature scaling on a 200-image held-out set; expected calibration error 0.03. The model is in-distribution iff Mahalanobis score \(<\tau_{\text{OOD}} = 14.2\); outside that, predictions are not returned.”

Note

If you cannot write that paragraph for your model, you cannot deploy it.

What MFML W7 + W12 cover that we skipped

Pointer slide. If you want the math behind today’s tools:

MFML W7 (probabilistic view of learning):

  • Full split conformal derivation, 5-line algorithm, and the exchangeability proof.
  • CQR with pinball loss, conformalisation step, and the failure-mode slide (drift, weighted CP, Jackknife+).

MFML W12 (uncertainty in predictions):

  • Bayesian predictive distribution and the variance decomposition (aleatory + epistemic).
  • Marginal likelihood / evidence framework as automatic Occam’s razor.
  • Closed-form GP posterior (mean + variance), kernel hyperparameter learning by log marginal likelihood.
  • ELBO and variational interpretation of MC Dropout.
  • Calibration formalism (reliability, ECE) and recalibration methods (temperature scaling, Platt, isotonic).
  • TabPFN’s prior-data fitted network — what the pre-training prior actually is.
  • In ML-PC: we use these results on real lab data. Pick the right tool, validate it on a reliability diagram, and trust the answer enough to skip a destructive test or pick the next experiment.

07. Wrap

Recap: Unit 11

  1. Pick the UQ tool per task: GP or TabPFN for small tabular data, MC dropout for trained CNNs, ensembles for high-stakes regression, MDN for multi-modal outputs.
  2. Calibrate before deploying — a reliability diagram on a held-out lab-realistic set is non-negotiable.
  3. Wrap with conformal (MFML W7) — split conformal or CQR adds finite-sample, distribution-free coverage on top of any mean predictor. Mandatory for safety-critical or regulator-facing models.
  4. UQ enables active learning — and active learning is how labs scale beyond the brute-force grid search.
  5. OOD detection complements calibration — confidence is meaningless when the input is outside the training distribution.
  6. The math lives in MFML W7 (conformal) + W12 (Bayes, GP, calibration). ML-PC W12 is about using that math to save tests, find process windows, and ship trustworthy models.

Continue

References & further reading

  • Mantzoukas et al. (2021) — 21CrMoV5-7 quench-and-temper data used in §3 (Mantzoukas et al. 2021).
  • Modarres et al. (2017) — SEM benchmark for the segmentation/classification setup of §4 (Modarres et al. 2017).
  • Gal & Ghahramani (2016) — MC Dropout as approximate Bayesian inference (Gal and Ghahramani 2016).
  • Lakshminarayanan et al. (2017) — Deep ensembles, the empirical gold standard for NN UQ (Lakshminarayanan et al. 2017).
  • Hernández-Lobato et al. (2014) — Predictive entropy search; foundational for BO/active learning in materials (Hernández-Lobato et al. 2014).
  • Guo et al. (2017) — Modern NNs are miscalibrated; temperature scaling (Guo et al. 2017).
  • Angelopoulos & Bates (2023) — Gentle introduction to split conformal prediction (Angelopoulos and Bates 2023).
  • Romano, Patterson & Candès (2019) — Conformalized Quantile Regression (Romano et al. 2019).
  • Hollmann et al. (2025) — TabPFN v2, foundation model for tabular data (Hollmann et al. 2025).
  • Rasmussen & Williams (2006) — GP reference text.
  • Aspuru-Guzik & Berlinguette — self-driving lab perspective; materials-acceleration platforms.
  • MFML W12 — full mathematical treatment of all methods used here.
Angelopoulos, Anastasios N., and Stephen Bates. 2023. “A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification.” Foundations and Trends in Machine Learning 16 (4): 494–591. https://doi.org/10.1561/2200000101.
Gal, Yarin, and Zoubin Ghahramani. 2016. “Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning.” International Conference on Machine Learning, 1050–59. https://proceedings.mlr.press/v48/gal16.pdf.
Guo, Chuan, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. 2017. “On Calibration of Modern Neural Networks.” International Conference on Machine Learning, 1321–30.
Hernández-Lobato, José Miguel, Matthew W. Hoffman, and Zoubin Ghahramani. 2014. “Predictive Entropy Search for Efficient Global Optimization of Black-Box Functions.” Advances in Neural Information Processing Systems 27.
Hollmann, Noah, Samuel Müller, Lennart Purucker, et al. 2025. “Accurate Predictions on Small Data with a Tabular Foundation Model.” Nature 637: 319–26. https://doi.org/10.1038/s41586-024-08328-6.
Lakshminarayanan, Balaji, Alexander Pritzel, and Charles Blundell. 2017. “Simple and Scalable Predictive Uncertainty Estimation Using Deep Ensembles.” Advances in Neural Information Processing Systems 30. https://papers.nips.cc/paper_files/paper/2017/hash/9ef2ed4b7fd2c810847ffa5fa85bce38-Abstract.html.
Mantzoukas, John, Dimitris G. Papageorgiou, Carmen Medrea, and Constantinos Stergiou. 2021. “Hardness Behavior of W. Nr. 1.7709 Steel, Oil Quenched and Tempered Between 475–575 c.” MATEC Web of Conferences 349: 02005. https://doi.org/10.1051/matecconf/202134902005.
Modarres, Mohammad Hadi, Rossella Aversa, Stefano Cozzini, Regina Ciancio, Angelo Leto, and Giuseppe Piero Brandino. 2017. “Neural Network for Nanoscience Scanning Electron Microscope Image Recognition.” Scientific Reports 7: 13282. https://doi.org/10.1038/s41598-017-13565-z.
Romano, Yaniv, Evan Patterson, and Emmanuel J. Candès. 2019. “Conformalized Quantile Regression.” Advances in Neural Information Processing Systems 32.