Mathematical Foundations of AI & ML Unit 12: Uncertainty in Predictions
Prof. Dr. Philipp Pelz
FAU Erlangen-Nürnberg
Title + Unit 12 positioning
Unit 11 discovered structure in data. Unit 12 asks: how confident are our predictions?
Single-point predictions are insufficient for engineering decisions.
We need principled uncertainty quantification — from Bayesian inference to Gaussian Processes.
Learning outcomes for Unit 12
By the end of this lecture, students can:
derive the Bayesian predictive distribution and interpret the variance decomposition,
describe the evidence framework and the concept of effective parameters,
define a GP, derive its posterior, and interpret uncertainty bands,
compare practical UQ methods: MC Dropout, ensembles, and MDNs.
Why point predictions are not enough
A model predicting 450 MPa tensile strength is useless without knowing if the uncertainty is \(\pm 5\) or \(\pm 100\) MPa.
In safety-critical applications, the uncertainty drives the decision, not the prediction.
Overconfident models are more dangerous than inaccurate ones.
Recall: aleatory vs epistemic uncertainty (Unit 8)
Aleatory : inherent noise in the data-generating process — irreducible.
Epistemic : uncertainty from limited data or model capacity — reducible.
A complete UQ framework must quantify and distinguish both types.
As training data grows, epistemic uncertainty should shrink; aleatory uncertainty should not.
The Bayesian predictive distribution
Instead of predicting with a single \(\hat{\theta}\) , integrate over all plausible \(\theta\) :
\[
p(\mathbf{y}^* | \mathbf{x}^*, \mathcal{D}) = \int p(\mathbf{y}^* | \mathbf{x}^*, \theta) \, p(\theta | \mathcal{D}) \, d\theta
\]
This accounts for parameter uncertainty — the full posterior contributes to the prediction.
The result is a distribution over predictions, not a single point.
Variance decomposition
\[
\text{Var}[\mathbf{y}^*] = \underbrace{\mathbb{E}_\theta[\sigma^2(\theta)]}_{\text{aleatory}} + \underbrace{\text{Var}_\theta[\boldsymbol{\mu}(\theta)]}_{\text{epistemic}}
\]
Aleatory component: average noise variance across parameter values.
Epistemic component: how much the prediction mean varies across plausible parameters.
The epistemic term shrinks with more data; the aleatory term does not.
Point estimates vs full distributions
MLE/MAP
Single \(\hat{\mathbf{y}}\)
None (or ad-hoc)
Low
Bayesian (exact)
Full \(p(\mathbf{y}^*|\mathbf{x}^*,\mathcal{D})\)
Principled
High
Bayesian (approx.)
Approximate distribution
Approximate
Moderate
When uncertainty matters most
Safety-critical : structural components, medical devices — failure consequences are severe.
Expensive experiments : each new alloy costs $10K to synthesize — guide experiments with uncertainty.
Extrapolation : new compositions, extreme conditions — the model is outside its training domain.
Active learning : acquire data where uncertainty is highest for maximum information gain.
Practical UQ: a taxonomy
Exact Bayesian : Gaussian Processes — closed-form posterior, principled, but \(O(N^3)\) .
Approximate Bayesian : MC Dropout, variational inference — scale to large data, approximate.
Frequentist ensembles : deep ensembles — no Bayesian formalism, practical uncertainty from disagreement.
Direct prediction : MDNs — predict distribution parameters directly.
Roadmap of today’s 90 min
10–25 min : Bayesian predictive distribution and variance decomposition.
25–40 min : Evidence framework, marginal likelihood, effective parameters.
40–60 min : Gaussian Processes — definition, posterior, uncertainty bands.
60–75 min : Practical UQ — MC Dropout, ensembles, MDNs, stochastic enrichment.
75–85 min : Calibration and engineering applications.
The marginal likelihood (evidence)
\[
p(\mathcal{D} | \mathcal{M}) = \int p(\mathcal{D} | \theta, \mathcal{M}) \, p(\theta | \mathcal{M}) \, d\theta
\]
Measures how well model \(\mathcal{M}\) explains the data, averaging over all parameter values.
Automatically balances fit (likelihood) and complexity (prior spread) [@murphy2012machine] .
Evidence as automatic Occam’s razor
Simple model : prior concentrated on few parameter values → high evidence if data is simple.
Complex model : prior spread thinly over many parameters → lower evidence unless data demands complexity.
The evidence automatically penalizes unnecessary complexity — no need for explicit regularization.
Model comparison via evidence
Bayes factor: \(\frac{p(\mathcal{M}_1 | \mathcal{D})}{p(\mathcal{M}_2 | \mathcal{D})} = \frac{p(\mathcal{D} | \mathcal{M}_1)}{p(\mathcal{D} | \mathcal{M}_2)} \cdot \frac{p(\mathcal{M}_1)}{p(\mathcal{M}_2)}\) .
With equal model priors: the model with higher evidence is preferred.
Unlike cross-validation, this uses all the data for both fitting and evaluation.
Effective number of parameters
\[
\gamma = \sum_i \frac{\lambda_i}{\lambda_i + \alpha}
\]
\(\lambda_i\) : eigenvalues of the data precision matrix. \(\alpha\) : prior precision.
\(\gamma \leq\) total number of parameters. Often \(\gamma \ll\) total parameters.
Interpretation: only \(\gamma\) parameters are effectively constrained by the data [@bishop2006pattern] .
Empirical Bayes
Instead of fixing hyperparameters (prior variance, noise level), optimize them by maximizing the evidence.
\(\hat{\alpha}, \hat{\sigma}^2 = \arg\max_{\alpha, \sigma^2} \log p(\mathcal{D} | \alpha, \sigma^2)\) .
This is a principled alternative to cross-validation for hyperparameter selection.
Also called type-II maximum likelihood .
Checkpoint: evidence interpretation
Question : Model A has 100 parameters and log-evidence −500. Model B has 10 parameters and log-evidence −480. Which is preferred?
Answer : Model B — higher evidence means it explains the data better relative to its complexity.
What is a Gaussian Process?
A GP is a distribution over functions: \(f \sim \mathcal{GP}(m(\mathbf{x}), k(\mathbf{x}, \mathbf{x}'))\) .
Any finite collection of function values \([f(\mathbf{x}_1), \dots, f(\mathbf{x}_N)]\) is jointly Gaussian.
The GP is fully specified by its mean function \(m(\mathbf{x})\) and kernel function \(k(\mathbf{x}, \mathbf{x}')\) .
graph LR
A[GP Prior] -- Kernel + Mean --> B{Condition on Data}
B -- Training Data --> C[GP Posterior]
C -- Prediction --> D[Uncertainty Bands]
GP as infinite-dimensional Gaussian
A multivariate Gaussian is a distribution over vectors .
A GP extends this to a distribution over functions (infinite-dimensional objects).
The kernel function \(k(\mathbf{x}, \mathbf{x}')\) plays the role of the covariance matrix.
This is a Bayesian nonparametric model — complexity grows with data [@murphy2012machine] .
Mean function m(x)
\(m(\mathbf{x}) = \mathbb{E}[f(\mathbf{x})]\) : the expected function value at each input.
Common choice: \(m(\mathbf{x}) = 0\) (zero-mean prior).
Can encode prior knowledge: \(m(\mathbf{x}) = a\mathbf{x} + b\) for a linear trend.
The mean function is updated to the posterior mean after observing data.
Kernel (covariance) function k(x, x’)
\(k(\mathbf{x}, \mathbf{x}') = \text{Cov}[f(\mathbf{x}), f(\mathbf{x}')]\) : encodes the correlation between function values.
The kernel determines the properties of sampled functions:
Smoothness, periodicity, length scale, amplitude.
The kernel must be positive semi-definite (valid covariance matrix for any finite set of points).
The RBF (squared exponential) kernel
\[
k(\mathbf{x}, \mathbf{x}') = \sigma_f^2 \exp\!\left(-\frac{\|\mathbf{x} - \mathbf{x}'\|^2}{2\ell^2}\right)
\]
Length scale \(\ell\) : controls how far apart inputs can be and still be correlated.
Signal variance \(\sigma_f^2\) : controls the amplitude of function variation.
Produces infinitely differentiable (very smooth) functions.
Other kernels
Matérn : \(k(\mathbf{x},\mathbf{x}') \propto\) Bessel function — adjustable smoothness via parameter \(\nu\) .
Periodic : captures repeating patterns.
Linear : \(k(\mathbf{x},\mathbf{x}') = \sigma^2 \mathbf{x}^\top \mathbf{x}'\) — equivalent to Bayesian linear regression.
Composite : sums and products of kernels combine properties (e.g., smooth + periodic).
GP prior: sampling functions
Before seeing data, sample functions from the prior: \(f \sim \mathcal{GP}(0, k)\) .
With RBF kernel: smooth, random functions with length scale \(\ell\) and amplitude \(\sigma_f\) .
Different kernel parameters produce visually different function families.
The prior encodes our beliefs about what functions are plausible.
GP posterior: conditioning on data
Observe \(\mathcal{D} = \{(\mathbf{x}_i, y_i)\}_{i=1}^N\) with \(y_i = f(\mathbf{x}_i) + \epsilon\) , \(\epsilon \sim \mathcal{N}(0, \sigma_n^2)\) .
The posterior \(f | \mathcal{D}\) is also a GP with updated mean and covariance.
The posterior passes through (or near) the training points.
Away from data, the posterior reverts to the prior.
GP posterior: closed-form formulas
\[
\boldsymbol{\mu}^*(\mathbf{x}^*) = \mathbf{k}_*^\top (\mathbf{K} + \sigma_n^2 \mathbf{I})^{-1} \mathbf{y}
\]
\[
\sigma^{*2}(\mathbf{x}^*) = k(\mathbf{x}^*, \mathbf{x}^*) - \mathbf{k}_*^\top (\mathbf{K} + \sigma_n^2 \mathbf{I})^{-1} \mathbf{k}_*
\]
\(\mathbf{K}\) : kernel matrix \([k(\mathbf{x}_i, \mathbf{x}_j)]_{N \times N}\) . \(\mathbf{k}_*\) : vector \([k(\mathbf{x}^*, \mathbf{x}_i)]\) .
The key operation is inverting \((\mathbf{K} + \sigma_n^2 \mathbf{I})\) — cost \(O(N^3)\) .
GP posterior: interpretation
Mean \(\boldsymbol{\mu}^*(\mathbf{x}^*)\) : best prediction — a weighted combination of training outputs.
Variance \(\sigma^{*2}(\mathbf{x}^*)\) :
Small near training data (low epistemic uncertainty).
Large far from training data (high epistemic uncertainty).
Approaches prior variance \(\sigma_f^2\) as distance from data grows.
GP uncertainty bands
Plot \(\boldsymbol{\mu}(\mathbf{x}) \pm 2\sigma(\mathbf{x})\) : the 95% credible band .
Bands are narrow near observed data (confident predictions).
Bands widen away from data (uncertain predictions).
This is honest uncertainty — the GP admits what it does not know.
GP hyperparameter learning
Optimize kernel hyperparameters \(\ell, \sigma_f, \sigma_n\) by maximizing the log marginal likelihood :
\[
\log p(\mathbf{y} | \mathbf{X}) = -\frac{1}{2}\mathbf{y}^\top(\mathbf{K} + \sigma_n^2 \mathbf{I})^{-1}\mathbf{y} - \frac{1}{2}\log|\mathbf{K} + \sigma_n^2 \mathbf{I}| - \frac{N}{2}\log 2\pi
\]
Three terms: data fit, complexity penalty, normalization.
Gradient-based optimization (L-BFGS is common).
Length scale effect
Short \(\ell\) : wiggly functions, fits local patterns (and possibly noise).
Long \(\ell\) : smooth functions, captures global trends (may miss local structure).
Optimal \(\ell\) : balances data fit and smoothness — determined by marginal likelihood.
Visualize: same data, three length scales → very different posterior functions.
GP: computational cost
Training: \(O(N^3)\) for matrix inversion + \(O(N^2)\) storage.
Prediction: \(O(N)\) per test point (after training).
Practical limit: \(N \approx 10^3 - 10^4\) for exact GPs.
Approximations exist for larger datasets: sparse GPs, inducing points, random features.
GP: strengths and limitations
Strengths
Principled uncertainty quantification
Automatic complexity control (evidence)
Interpretable hyperparameters
Works well with small data
Limitations
\(O(N^3)\) training cost
Kernel design requires domain knowledge
Gaussian assumption may be limiting
Scales poorly to high-dimensional inputs
Checkpoint: GP prediction
Question : A GP is trained on 10 data points. You query a point very far from all training data. What happens to the uncertainty?
Answer : The posterior variance grows toward the prior variance \(\sigma_f^2\) . The GP honestly reports high uncertainty in unexplored regions.
Mixture-Density Networks (MDNs)
A standard NN outputs a single \(\hat{\mathbf{y}}\) (or \(\hat{y}\) ). An MDN outputs parameters of a mixture of Gaussians :
\[
p(\mathbf{y}|\mathbf{x}) = \sum_{k=1}^{K} \pi_k(\mathbf{x}) \, \mathcal{N}(\mathbf{y} | \boldsymbol{\mu}_k(\mathbf{x}), \sigma_k^2(\mathbf{x}))
\]
The network predicts mixing coefficients, means, and variances — all functions of input \(\mathbf{x}\) [@neuer2024machine] .
MDN: capturing multi-modal uncertainty
Standard regression assumes unimodal output distribution.
MDNs can represent branching predictions: “this composition could yield phase A or phase B.”
The number of mixture components \(K\) is a design choice.
Particularly useful for inverse problems with multiple solutions.
MC Dropout for uncertainty estimation
Standard dropout: randomly zero neurons during training .
MC Dropout: keep dropout active at test time .
Run \(T\) stochastic forward passes → \(T\) predictions \(\{\hat{y}_1, \dots, \hat{y}_T\}\) .
Mean = prediction. Variance across samples ≈ epistemic uncertainty.
MC Dropout: interpretation
Each forward pass uses a different randomly thinned network.
Equivalent to sampling from an approximate posterior over network architectures.
Theoretical connection to variational inference (Gal & Ghahramani, 2016).
Advantage: zero additional training cost — uncertainty is free at test time.
Deep ensembles
Train \(M\) independent networks (different random initializations, same architecture).
Each network produces a prediction \(\hat{y}_m\) .
Mean : \(\bar{y} = \frac{1}{M}\sum_m \hat{y}_m\) . Variance : \(\frac{1}{M}\sum_m (\hat{y}_m - \bar{y})^2\) .
Empirically produces well-calibrated uncertainties. Cost: \(M\\times\) training.
Stochastic enrichment
Add noise to inputs during prediction: \(\tilde{\mathbf{x}} = \mathbf{x} + \boldsymbol{\epsilon}\) , \(\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \boldsymbol{\Sigma}_\epsilon)\) .
Run multiple predictions with different noise realizations.
High variance across perturbed predictions = model is sensitive = high uncertainty.
Matches real-world measurement noise propagation [@neuer2024machine] .
Calibration: are uncertainties trustworthy?
A model is well-calibrated if predicted \(p\) % confidence intervals contain \(p\) % of test points.
Calibration plot : predicted confidence level vs observed coverage.
Perfect calibration = diagonal line.
Overconfident : intervals too narrow (common in NNs). Underconfident : intervals too wide.
Recalibration methods
Temperature scaling : divide logits by a learned temperature \(T\) before softmax.
Platt scaling : fit a logistic regression on validation predictions.
Isotonic regression : non-parametric calibration mapping.
Applied post-hoc on a held-out calibration set — does not change the model.
Comparison of UQ methods
GP
Exact Bayesian
\(O(N^3)\)
Excellent
Small \(N\)
MC Dropout
Approx. Bayesian
\(T \times\) inference
Good
Any
Deep ensemble
Frequentist
\(M \times\) training
Very good
Any
MDN
Direct
1× training
Requires tuning
Any
Checkpoint: choosing a UQ method
Small dataset, need exact UQ : Gaussian Process.
Large dataset, budget for training : deep ensemble.
Large dataset, need cheap inference : MC Dropout.
Multi-modal outputs : Mixture-Density Network.
Materials example: GP for composition-property mapping
GP regression from alloy composition (5 features) to yield strength.
50 training samples from expensive tensile tests.
GP provides uncertainty bands → compositions with high uncertainty are targets for next experiments.
Active learning with GP uncertainty reduces required experiments by 40%.
[PLACEHOLDER: GP regression plot] - x-axis: principal composition component - y-axis: yield strength - Show training points, GP mean, and shaded 95% confidence interval
Materials example: active learning with GP uncertainty
Goal : map the composition-property landscape with minimum experiments.
Strategy : train GP, identify input with highest uncertainty, synthesize and test it.
Iterate : retrain GP, select next experiment, repeat.
This is Bayesian optimization applied to materials discovery.
[PLACEHOLDER: Active Learning animation/sequence] - Panel 1: Initial GP with high uncertainty - Panel 2: Selection of point with max variance - Panel 3: Updated GP with reduced uncertainty after adding point
Materials example: MDN for multi-phase prediction
Some alloy compositions can yield different crystallographic phases depending on processing.
A standard NN predicts the average — meaningless for bimodal distributions.
An MDN with 2 Gaussian components correctly captures both possible phases and their probabilities.
[PLACEHOLDER: MDN bimodal prediction plot] - Show a bifurcation in the prediction where two distinct peaks exist for a single input - Contrast with a single-Gaussian fit that fails to capture the modality
Lecture-essential vs exercise content split
Lecture : Bayesian prediction, evidence framework, GP derivation, practical UQ taxonomy, calibration.
Exercise : GP implementation from scratch, kernel hyperparameter exploration, ensemble comparison, MDN bonus.
Exercise setup summary
Implement GP regression (RBF kernel) in NumPy: compute posterior mean and variance.
Compare GP uncertainty bands with predictions from an NN ensemble (3 networks).
Vary length scale \(\ell\) and observe effect on fit and uncertainty.
Bonus: implement a simple MDN with 2 Gaussian components in PyTorch.
Exam-aligned summary: 10 must-know statements
The Bayesian predictive distribution integrates over parameter uncertainty.
Total prediction variance = aleatory variance + epistemic variance.
The marginal likelihood (evidence) measures model fit with automatic complexity penalty.
A GP is a distribution over functions specified by mean and kernel functions.
The GP posterior has closed-form mean and variance (for Gaussian likelihood).
GP uncertainty grows away from training data — honest epistemic uncertainty.
Kernel hyperparameters (length scale, signal variance) control GP behavior.
MC Dropout approximates Bayesian inference by sampling sub-networks at test time.
Deep ensembles provide uncertainty via disagreement among independently trained models.
Calibration plots verify that predicted confidence matches observed accuracy.