Mathematical Foundations of AI & ML
Unit 12: Uncertainty in Predictions
FAU Erlangen-Nürnberg
By the end of this lecture, students can:

\[ p(\mathbf{y}^* | \mathbf{x}^*, \mathcal{D}) = \int p(\mathbf{y}^* | \mathbf{x}^*, \theta) \, p(\theta | \mathcal{D}) \, d\theta \]

\[ \text{Var}[\mathbf{y}^*] = \underbrace{\mathbb{E}_\theta[\sigma^2(\theta)]}_{\text{aleatory}} + \underbrace{\text{Var}_\theta[\boldsymbol{\mu}(\theta)]}_{\text{epistemic}} \]
| Approach | Output | Uncertainty | Cost |
|---|---|---|---|
| MLE/MAP | Single \(\hat{\mathbf{y}}\) | None (or ad-hoc) | Low |
| Bayesian (exact) | Full \(p(\mathbf{y}^*|\mathbf{x}^*,\mathcal{D})\) | Principled | High |
| Bayesian (approx.) | Approximate distribution | Approximate | Moderate |
\[ p(\mathcal{D} | \mathcal{M}) = \int p(\mathcal{D} | \theta, \mathcal{M}) \, p(\theta | \mathcal{M}) \, d\theta \]


\[ \gamma = \sum_i \frac{\lambda_i}{\lambda_i + \alpha} \]
\[ k(\mathbf{x}, \mathbf{x}') = \sigma_f^2 \exp\!\left(-\frac{\|\mathbf{x} - \mathbf{x}'\|^2}{2\ell^2}\right) \]


\[ \boldsymbol{\mu}^*(\mathbf{x}^*) = \mathbf{k}_*^\top (\mathbf{K} + \sigma_n^2 \mathbf{I})^{-1} \mathbf{y} \]
\[ \sigma^{*2}(\mathbf{x}^*) = k(\mathbf{x}^*, \mathbf{x}^*) - \mathbf{k}_*^\top (\mathbf{K} + \sigma_n^2 \mathbf{I})^{-1} \mathbf{k}_* \]

\[ \log p(\mathbf{y} | \mathbf{X}) = -\frac{1}{2}\mathbf{y}^\top(\mathbf{K} + \sigma_n^2 \mathbf{I})^{-1}\mathbf{y} - \frac{1}{2}\log|\mathbf{K} + \sigma_n^2 \mathbf{I}| - \frac{N}{2}\log 2\pi \]


Strengths
Limitations
\[ p(\mathbf{y}|\mathbf{x}) = \sum_{k=1}^{K} \pi_k(\mathbf{x}) \, \mathcal{N}(\mathbf{y} | \boldsymbol{\mu}_k(\mathbf{x}), \sigma_k^2(\mathbf{x})) \]

MC Dropout predictive uncertainty on the Mauna Loa CO\(_2\) dataset. Red = predictive mean; shaded = uncertainty band. Standard dropout (a) underestimates; MC Dropout with ReLU (c) grows uncertainty outside training range. (Gal and Ghahramani 2016, fig. 2)
Results on a toy regression task: x-axis denotes x. On the y-axis, the blue line is the ground truth curve, the red dots are observed noisy training data points and the gray lines correspond to the predicted mean along with three standard deviations. Left most plot corresponds to empirical variance of 5 networks trained using MSE, second plot shows the effect of training using NLL using a single net, third plot shows the additional effect of adversarial training, and final plot shows the effect of using an ensemble of 5 networks respectively. (Lakshminarayanan et al. 2017, fig. 1)
Recall from Unit 7. Split conformal and CQR were introduced as the distribution-free coverage layer of the probabilistic toolbox.
Why it shows up here.
In 2026 practice the default UQ stack for a regression NN is quantile heads + CQR on top of whatever this unit’s method produced as a point estimate.


| Method | Type | Cost | Calibration | Scalability |
|---|---|---|---|---|
| GP | Exact Bayesian | \(O(N^3)\) | Excellent | Small \(N\) |
| MC Dropout | Approx. Bayesian | \(T \times\) inference | Good | Any |
| Deep ensemble | Frequentist | \(M \times\) training | Very good | Any |
| MDN | Direct | 1× training | Requires tuning | Any |
| Conformal / CQR | Distribution-free wrapper | 1 calibration pass (\(\sim 10^3\) pts) | Guaranteed (finite-sample, marginal) | Any (model-agnostic) |

[PLACEHOLDER: Active Learning animation/sequence] - Panel 1: Initial GP with high uncertainty - Panel 2: Selection of point with max variance - Panel 3: Updated GP with reduced uncertainty after adding point


© Philipp Pelz - Mathematical Foundations of AI & ML