Mathematical Foundations of AI & ML
Unit 7: The Probabilistic View of Learning; Conformal Prediction

Prof. Dr. Philipp Pelz

FAU Erlangen-Nürnberg

FAU Logo IMN Logo CENEM Logo ERC Logo Eclipse Logo

Title + Unit 7 positioning

  • Units 1–6 built the optimization machinery for learning.
  • Unit 7 introduces the probabilistic foundations that underlie everything.
  • Probability is the language of uncertainty — and learning is fundamentally about reasoning under uncertainty.
  • We close the unit with conformal prediction: a distribution-free coverage guarantee that any downstream UQ method can plug into.

Recap: what risk minimization assumes

  • Unit 1: \(\hat{\boldsymbol{\theta}} = \arg\min_{\boldsymbol{\theta}} \mathbb{E}_{(\mathbf{x},y) \sim P}[L(f_{\boldsymbol{\theta}}(\mathbf{x}), y)]\).
  • The expectation is over a probability distribution \(P\) of data.
  • Until now, we treated this as a mathematical abstraction. Now we make it concrete.

Learning outcomes for Unit 7

By the end of this lecture, students can:

  • classify uncertainty as aleatory or epistemic and explain why this matters,
  • write the Gaussian in 1D and multivariate form and explain its maximum-entropy property,
  • compute and interpret KL divergence between distributions, in particular the closed form between two Gaussians,
  • derive the MLE for Gaussian parameters and connect it to MSE minimization,
  • apply Bayes’ theorem to update prior beliefs into posterior distributions,
  • apply split conformal prediction to wrap any predictor with a finite-sample coverage guarantee, and recognise when the exchangeability assumption breaks.

Why probability is the language of learning

  • Data is inherently noisy — repeated measurements give different results.
  • Models are uncertain — finite data cannot determine parameters exactly.
  • Probability provides a consistent, rigorous framework for quantifying both.
  • Without probability, we cannot define what “learning from data” means.

Aleatory uncertainty — definition

  • Aleatory (from Latin alea = dice): irreducible randomness in the data-generating process.
  • Examples: thermal noise in sensors, quantum measurement, turbulent flow variability.
  • No amount of additional data or better models can eliminate aleatory uncertainty.
  • It sets a floor on achievable prediction error (the Bayes error — formally treated in Unit 8 with the bias-variance decomposition).

Epistemic uncertainty — definition

  • Epistemic (from Greek episteme = knowledge): uncertainty from limited knowledge.
  • Reducible by collecting more data, improving the model, or adding features.
  • Examples: parameter uncertainty with small \(N\), model misspecification, missing variables.
  • Epistemic uncertainty decreases as the training set grows.

Why the distinction matters

Uncertainty breakdown

  • Aleatory uncertainty: set appropriate error bars; do not waste resources trying to reduce it.
  • Epistemic uncertainty: invest in data collection or model improvement.
  • Confusing the two leads to wasted effort (trying to reduce noise) or false confidence (ignoring model uncertainty).
  • Engineering systems must handle both types appropriately (Neuer et al. 2024).

Interactive: Aleatory vs. Epistemic Uncertainty

  • The gray dashed line is the true, hidden process.
  • The red dots are sampled data. Their spread around the true process is aleatory uncertainty (irreducible noise).
  • The blue band represents our model’s confidence in the mean (epistemic uncertainty).
  • Try it: Increase N. Notice how the blue band shrinks, but the red dots remain scattered.

The sampling process

  • Data is a collection of realizations from a random process.
  • The sampling rate, resolution, and digitization affect what information is preserved.
  • Insufficient sampling introduces systematic errors that no model can correct.

Nyquist-Shannon theorem

  • To reconstruct a signal with maximum frequency \(f_{\max}\), the sampling rate must satisfy:

\[ f_s \geq 2 f_{\max} \]

  • Below this rate: aliasing — high-frequency components fold into low frequencies.
  • Relevance to ML: undersampled data contains phantom patterns that models can overfit to (Neuer et al. 2024).

Engineering example: sensor data and uncertainty sources

  • A temperature sensor measuring a furnace:
    • Aleatory: thermal fluctuations (\(\pm 2°C\) at steady state).
    • Epistemic: calibration drift (systematic, correctable with recalibration).
  • An ML model trained on this data must account for both sources to make reliable predictions.

Random variables and probability distributions

  • A random variable \(X\) maps outcomes to numbers.
  • Discrete: probability mass function \(P(X = x)\).
  • Continuous: probability density function \(p(x)\) where \(\int p(x)\,dx = 1\).
  • The PDF gives relative likelihood — not probability — at each point.

Expected value and variance

  • Expected value (mean): \(\mu = \mathbb{E}[X] = \int x \, p(x) \, dx\).
  • Variance: \(\sigma^2 = \text{Var}[X] = \mathbb{E}[(X - \mu)^2]\).
  • Standard deviation: \(\sigma = \sqrt{\text{Var}[X]}\) — same units as \(X\).
  • The mean locates the distribution; the variance measures its spread.

Higher moments: skewness and kurtosis

  • Skewness \(= \mathbb{E}\left[\left(\frac{X-\mu}{\sigma}\right)^3\right]\): measures asymmetry (0 for symmetric distributions).
  • Kurtosis \(= \mathbb{E}\left[\left(\frac{X-\mu}{\sigma}\right)^4\right]\): measures tail heaviness (3 for the Gaussian).
  • Excess kurtosis \(= \text{kurtosis} - 3\): deviation from Gaussian tail behavior.

The Gaussian distribution (1D)

\[ p(x \mid \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\!\left(-\frac{(x-\mu)^2}{2\sigma^2}\right) \]

  • Completely characterized by two parameters: mean \(\mu\) and variance \(\sigma^2\).
  • Symmetric, unimodal, bell-shaped.
  • The 68-95-99.7 rule: probability within \(1\sigma, 2\sigma, 3\sigma\) of the mean.

Interactive: The 1D Gaussian & Empirical Rule

  • Modifying the Mean (\(\mu\)) shifts the distribution left or right. It represents the center of mass.
  • Modifying the Standard Deviation (\(\sigma\)) stretches the distribution.
  • Notice that the peak height drops as it stretches, to ensure the total area (probability) always integrates to 1.

Why the Gaussian is special: maximum entropy

  • Among all distributions with a given mean \(\mu\) and variance \(\sigma^2\), the Gaussian has maximum entropy.
  • Maximum entropy = maximum uncertainty = fewest additional assumptions.
  • Using a Gaussian is therefore the most conservative choice when only mean and variance are known.
  • This is the information-theoretic justification for the Gaussian’s ubiquity (Murphy 2012).
  • (Entropy \(H(p)\) is defined formally in the information-theoretic primer later in this unit.)

Central Limit Theorem connection

  • The sum (or average) of many independent random variables converges to a Gaussian, regardless of their individual distributions.
  • This explains why the Gaussian appears everywhere:
    • Measurement errors = sum of many small independent perturbations.
    • Aggregate quantities in materials science follow approximately Gaussian distributions.

Histograms of the mean of \(N\) uniform random variables for \(N=1,2,10\). As \(N\) grows the distribution rapidly becomes Gaussian (Bishop 2006).

Multivariate Gaussian distribution

\[ p(\mathbf{x} \mid \boldsymbol{\mu}, \boldsymbol{\Sigma}) = (2\pi)^{-d/2} |\boldsymbol{\Sigma}|^{-1/2} \exp\!\left(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1}(\mathbf{x}-\boldsymbol{\mu})\right) \]

  • \(\boldsymbol{\mu} \in \mathbb{R}^d\): mean vector. \(\boldsymbol{\Sigma} \in \mathbb{R}^{d \times d}\): covariance matrix (symmetric, positive definite).
  • Level sets are ellipsoids whose axes align with eigenvectors of \(\boldsymbol{\Sigma}\).
  • Eigenvectors \(\mathbf{u}_i\) give the ellipse orientation; eigenvalues \(\lambda_i\) give the axis lengths \(\lambda_i^{1/2}\).

2D Gaussian density ellipse. The principal axes are the eigenvectors \(\mathbf{u}_1, \mathbf{u}_2\) of \(\boldsymbol{\Sigma}\), scaled by \(\lambda_i^{1/2}\) (Murphy 2012).

Covariance matrix: diagonal vs full

  • Diagonal \(\boldsymbol{\Sigma}\): features are uncorrelated; ellipsoids are axis-aligned.
  • Full \(\boldsymbol{\Sigma}\): features are correlated; ellipsoids are rotated.
  • Spherical (\(\boldsymbol{\Sigma} = \sigma^2 \mathbf{I}\)): isotropic; level sets are spheres.
  • The eigenvalues of \(\boldsymbol{\Sigma}\) determine the extent along each principal axis.

Contours of constant density for (a) full, (b) diagonal, and (c) spherical (isotropic) covariance matrices (Bishop 2006).

Interactive: Conditioning a 2D Gaussian

(Unit standardises \(\sigma_x=\sigma_y=1\), \(\mu=0\) so the conditioning formula is visible directly.)

  • Slicing the cloud at \(X=x_0\) (red line) leaves a 1D Gaussian — the red conditional curve.
  • Conditional mean moves: \(\;\mathbb{E}[Y\mid X=x_0] = \rho\, x_0\) (the regression line).
  • Conditional variance shrinks: \(\;\mathrm{Var}[Y\mid X=x_0] = 1-\rho^2\) — observing \(X\) reduced our uncertainty about \(Y\).
  • At \(\rho=0\) the slice is unchanged (\(X\) tells us nothing); as \(|\rho|\to 1\) the conditional collapses to a line.

Marginal and conditional Gaussians

  • A key property: marginals and conditionals of a joint Gaussian are also Gaussian.
  • Marginal: integrate out some variables — the result is Gaussian with sub-matrix of \(\boldsymbol{\Sigma}\).
  • Conditional: condition on some variables — the result is Gaussian with updated \(\boldsymbol{\mu}\) and reduced \(\boldsymbol{\Sigma}\).
  • This closure property makes Gaussian models analytically tractable (Bishop 2006).

Checkpoint: interpret the covariance matrix

  • Given \(\boldsymbol{\Sigma} = \begin{pmatrix} 4 & 3 \\ 3 & 9 \end{pmatrix}\):
    • Feature 1 has variance 4, feature 2 has variance 9.
    • Correlation coefficient: \(\rho = 3/\sqrt{4 \cdot 9} = 0.5\) — moderate positive correlation.
    • The contour ellipse is tilted toward the upper-right.

Entropy of a distribution

For a continuous distribution \(p(x)\), the (differential) entropy is

\[ H(p) \;=\; -\int p(x)\, \log p(x)\, dx \;=\; -\mathbb{E}_p[\log p(X)] \]

(discrete analogue: \(H(p) = -\sum_x p(x) \log p(x)\)).

  • Intuition: expected surprise \(-\log p(X)\) — rare outcomes carry more information.
  • Larger \(H\) = more uncertainty / less concentration of mass.
  • 1D Gaussian: \(H(\mathcal{N}(\mu,\sigma^2)) = \tfrac{1}{2}\log(2\pi e\, \sigma^2)\) — depends on \(\sigma\), not \(\mu\).
  • This formalizes the earlier claim: among distributions with given mean and variance, \(\mathcal{N}(\mu,\sigma^2)\) maximizes \(H\) (Bishop 2006).

KL divergence: comparing two distributions

For distributions \(q\) and \(p\) on the same space:

\[ \mathrm{KL}(q \,\|\, p) \;=\; \mathbb{E}_q\!\left[\log \tfrac{q(x)}{p(x)}\right] \;=\; \int q(x)\, \log \tfrac{q(x)}{p(x)}\, dx \]

  • Three load-bearing properties:
    1. \(\mathrm{KL}(q\|p) \ge 0\) (Gibbs inequality, via Jensen’s inequality on \(-\log\)).
    2. \(\mathrm{KL}(q\|p) = 0\) iff \(q = p\) almost everywhere.
    3. Asymmetric in general: \(\mathrm{KL}(q\|p) \ne \mathrm{KL}(p\|q)\).
  • Intuition: extra cost (in nats) of describing samples from \(q\) using a code optimized for \(p\) (Bishop 2006).
  • KL is therefore a directed dissimilarity — not a metric.

KL between two Gaussians (the VAE-relevant case)

For two 1D Gaussians, KL admits a closed form:

\[ \mathrm{KL}\!\left(\mathcal{N}(\mu_1,\sigma_1^2)\,\|\,\mathcal{N}(\mu_2,\sigma_2^2)\right) \;=\; \log\frac{\sigma_2}{\sigma_1} \;+\; \frac{\sigma_1^2 + (\mu_1 - \mu_2)^2}{2\sigma_2^2} \;-\; \frac{1}{2} \]

The form used to regularize variational autoencoders — \(p = \mathcal{N}(\mathbf{0}, I)\) vs. \(q = \mathcal{N}(\boldsymbol{\mu},\, \mathrm{diag}(\sigma_1^2,\dots,\sigma_d^2))\) — is the per-dimension sum:

\[ \mathrm{KL}(q\,\|\,p) \;=\; \tfrac{1}{2} \sum_{j=1}^{d} \left( \mu_j^2 + \sigma_j^2 - \log \sigma_j^2 - 1 \right) \]

  • Sanity check: vanishes iff \(\mu_j = 0\) and \(\sigma_j = 1\) for all \(j\) — i.e., \(q\) already matches the standard-normal prior.
  • Forward pointer: Unit 11 will use exactly this expression as the regularizer in the VAE loss (Bishop 2006).

The likelihood function

  • Given observed data \(\mathcal{D} = \{\mathbf{x}_1, \dots, \mathbf{x}_N\}\) (assumed i.i.d.):

\[ \mathcal{L}(\boldsymbol{\theta}) = p(\mathcal{D} \mid \boldsymbol{\theta}) = \prod_{i=1}^{N} p(\mathbf{x}_i \mid \boldsymbol{\theta}) \]

  • \(\mathcal{L}(\boldsymbol{\theta})\) is a function of the parameters \(\boldsymbol{\theta}\), not the data.
  • It measures how well \(\boldsymbol{\theta}\) explains the observed data.

Log-likelihood

  • Taking the log converts the product into a sum:

\[ \ell(\boldsymbol{\theta}) = \log \mathcal{L}(\boldsymbol{\theta}) = \sum_{i=1}^{N} \log p(\mathbf{x}_i \mid \boldsymbol{\theta}) \]

  • Sums are numerically stable and easier to differentiate.
  • Maximizing \(\ell(\boldsymbol{\theta})\) gives the same solution as maximizing \(\mathcal{L}(\boldsymbol{\theta})\).

MLE principle

\[ \hat{\boldsymbol{\theta}}_{\text{MLE}} = \arg\max_{\boldsymbol{\theta}} \ell(\boldsymbol{\theta}) \]

  • Choose the parameters that make the observed data most probable under the model.
  • MLE is the most widely used estimation principle in statistics and machine learning.
  • Set \(\nabla_{\boldsymbol{\theta}} \ell(\boldsymbol{\theta}) = 0\) and solve.

Interactive: Maximum Likelihood Estimation

  • Adjust the Mean (\(\mu\)) and Std Dev (\(\sigma\)) to try and trap the red data points under the highest part of the blue curve.
  • Watch the Log-Likelihood gauge increase.
  • The red circles show the individual likelihood \(p(x_i|\mu,\sigma)\). The product of these heights determines the log-likelihood (converted to a sum).
  • Maximizing log-likelihood means pushing the curve up directly over the data points without spreading it too thin!

MLE for Gaussian mean

  • Gaussian log-likelihood (with known \(\sigma^2\)):

\[ \ell(\mu) = -\frac{N}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^{N}(x_i - \mu)^2 \]

  • Differentiate w.r.t. \(\mu\), set to zero:

\[ \hat{\mu}_{\text{MLE}} = \frac{1}{N}\sum_{i=1}^{N} x_i = \bar{x} \]

  • The MLE for the mean is the sample mean — intuitive and unbiased.

MLE for Gaussian variance

  • Differentiate w.r.t. \(\sigma^2\):

\[ \hat{\sigma}^2_{\text{MLE}} = \frac{1}{N}\sum_{i=1}^{N}(x_i - \hat{\mu})^2 \]

  • This is the biased sample variance (divides by \(N\), not \(N-1\)).
  • The bias vanishes as \(N \to \infty\) — MLE is consistent.
  • For small \(N\), the unbiased estimator (\(N-1\)) is often preferred.

MLE and MSE: the connection

  • For a regression model \(y = f_{\boldsymbol{\theta}}(\mathbf{x}) + \epsilon\) with \(\epsilon \sim \mathcal{N}(0, \sigma^2)\):

\[ \ell(\boldsymbol{\theta}) = -\frac{N}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^{N}(y_i - f_{\boldsymbol{\theta}}(\mathbf{x}_i))^2 \]

  • Maximizing \(\ell(\boldsymbol{\theta})\) w.r.t. \(\boldsymbol{\theta}\) is equivalent to minimizing MSE.
  • This provides the probabilistic justification for using MSE as a loss function.

MLE for multivariate Gaussian

  • For \(\mathbf{x}_i \in \mathbb{R}^d\):

\[ \hat{\boldsymbol{\mu}} = \frac{1}{N}\sum_{i=1}^{N}\mathbf{x}_i, \quad \hat{\boldsymbol{\Sigma}} = \frac{1}{N}\sum_{i=1}^{N}(\mathbf{x}_i - \hat{\boldsymbol{\mu}})(\mathbf{x}_i - \hat{\boldsymbol{\mu}})^\top \]

  • Direct extension of the 1D case to vectors and matrices (Murphy 2012).
  • Requires \(N > d\) for \(\hat{\boldsymbol{\Sigma}}\) to be invertible.

MLE: properties and limitations

  • Consistency: \(\hat{\theta}_{\text{MLE}} \to \theta_{\text{true}}\) as \(N \to \infty\).
  • Efficiency: achieves the lowest possible variance among unbiased estimators (Cramér-Rao bound).
  • Limitation: can overfit with small \(N\) — MLE has no built-in regularization.
  • MLE treats all parameter values as equally plausible before seeing data.

Robustness: the outlier problem

  • The Gaussian has light tails — extreme values are extremely unlikely under the model.
  • When outliers are present, MLE distorts \(\hat{\mu}\) and inflates \(\hat{\sigma}^2\) to accommodate them.
  • A single outlier can shift the mean by \(O(1/N)\) of its magnitude.
  • Need: a distribution with heavier tails that accommodates outliers without distortion.

Student’s t-distribution for robust estimation

  • The Student’s t-distribution has a parameter \(\nu\) (degrees of freedom) controlling tail heaviness.
  • \(\nu \to \infty\): converges to Gaussian. \(\nu = 1\): Cauchy distribution (very heavy tails).
  • MLE with Student’s t automatically downweights outliers.
  • Practical recommendation: use \(\nu \approx 4{-}10\) for moderate robustness (Murphy 2012).

Gaussian, Student-\(t\), and Laplace pdfs (left) and log-pdfs (right). The Student-\(t\) has heavier tails than the Gaussian (Murphy 2012).

Fitting Gaussian, Student, and Laplace distributions without (a) and with (b) outliers. The Gaussian fit is strongly affected by the outliers (Murphy 2012).

Bayes’ theorem — statement

\[ p(\boldsymbol{\theta} \mid \mathcal{D}) = \frac{p(\mathcal{D} \mid \boldsymbol{\theta}) \, p(\boldsymbol{\theta})}{p(\mathcal{D})} \]

  • Posterior \(p(\boldsymbol{\theta} \mid \mathcal{D})\): what we believe about \(\boldsymbol{\theta}\) after seeing data.
  • Likelihood \(p(\mathcal{D} \mid \boldsymbol{\theta})\): how probable the data is under each \(\boldsymbol{\theta}\).
  • Prior \(p(\boldsymbol{\theta})\): what we believed before seeing data.
  • Evidence \(p(\mathcal{D})\): normalizing constant.

Components of Bayes’ theorem

  • The prior encodes domain knowledge or assumptions (e.g., “weights should be small”).
  • The likelihood is the same function used in MLE — it connects data to parameters \(\boldsymbol{\theta}\).
  • The posterior combines both: it is a compromise between prior knowledge and data evidence.
  • More data → posterior dominated by likelihood. Less data → posterior dominated by prior.

Beta distribution \(\text{Beta}(\mu|a,b)\) for different hyperparameters. Flat (\(a=b=1\)) = non-informative prior; peaked = strong prior belief (Bishop 2006).

The evidence (marginal likelihood)

\[ p(\mathcal{D}) = \int p(\mathcal{D} \mid \boldsymbol{\theta}) \, p(\boldsymbol{\theta}) \, d\boldsymbol{\theta} \]

  • Integrates the likelihood over all possible parameter values, weighted by the prior.
  • Ensures the posterior integrates to 1.
  • Often intractable for complex models — motivates approximation methods (MCMC, variational inference).
  • Also used for model comparison: models with higher evidence explain the data better.

Bayesian inference for Gaussian mean (known variance)

  • Prior: \(\mu \sim \mathcal{N}(\mu_0, \sigma_0^2)\).
  • Likelihood: \(x_i | \mu \sim \mathcal{N}(\mu, \sigma^2)\) (known \(\sigma^2\)).
  • Posterior: \(\mu | \mathcal{D} \sim \mathcal{N}(\mu_N, \sigma_N^2)\) where:

\[ \mu_N = \frac{\sigma^2 \mu_0 + N \sigma_0^2 \bar{x}}{\sigma^2 + N \sigma_0^2}, \quad \sigma_N^2 = \frac{\sigma^2 \sigma_0^2}{\sigma^2 + N \sigma_0^2} \]

  • This is a conjugate pair: Gaussian prior + Gaussian likelihood = Gaussian posterior (Bishop 2006).

One step of Bayesian update: Beta prior (left) × Bernoulli likelihood (centre) = Beta posterior (right). Conjugate structure yields the same family (Bishop 2006).

Posterior update: visual intuition

  • Before data (\(N=0\)): posterior = prior (wide, uncertain).
  • After a few points (\(N=2\)): posterior narrows, shifts toward sample mean.
  • More data (\(N=10\)): posterior is very narrow, centered near \(\bar{x}\).
  • As \(N \to \infty\): posterior concentrates at \(\hat{\mu}_{\text{MLE}}\) — the prior washes out.

Bayesian inference for a Gaussian mean with known variance. Each curve is the posterior after observing \(N\) data points. The distribution narrows as \(N\) grows (Bishop 2006).

Interactive: Bayesian Posterior Update

  • \(N = 0\): The Posterior (red) perfectly matches the Prior (gray).
  • Small \(N\): The Posterior is a compromise between the Prior and the Data Likelihood (blue).
  • Large \(N\): The Likelihood narrows dramatically, pulling the Posterior entirely towards the Data Mean (\(2.5\)). The Prior is “washed out”.
  • Strong Prior (small \(\sigma_0^2\)): The Prior resists the data pull much longer. Try changing Prior Var to 0.1 and notice how much data it takes to shift the belief!

Bayesian vs frequentist comparison

Aspect Frequentist Bayesian
Parameters Fixed, unknown Random variables
Inference Point estimate + CI Full posterior distribution
Prior knowledge Not incorporated Formally included
Uncertainty Sampling variability Posterior width
Interpretation Long-run frequency Degree of belief
  • Frequentist: Objective, based on repeated trials.
  • Bayesian: Subjective, based on evidence update.
  • Choice: Often depends on data availability and prior confidence.

Credible interval vs confidence interval

  • 95% Bayesian credible interval: “Given the data, there is a 95% probability that \(\theta\) lies in this interval.”
  • 95% frequentist confidence interval: “If we repeated the experiment many times, 95% of such intervals would contain the true \(\theta\).”
  • The Bayesian interpretation is often more natural for engineering decisions.

MAP estimation

  • Maximum A Posteriori: find the mode of the posterior:

\[ \hat{\boldsymbol{\theta}}_{\text{MAP}} = \arg\max_{\boldsymbol{\theta}} \, p(\boldsymbol{\theta} \mid \mathcal{D}) = \arg\max_{\boldsymbol{\theta}} \left[\log p(\mathcal{D} \mid \boldsymbol{\theta}) + \log p(\boldsymbol{\theta})\right] \]

  • MAP is a point estimate — it summarizes the posterior by its peak.
  • MAP = MLE when the prior is uniform (non-informative).

MAP closes the regularization loop from Unit 3

  • Unit 3 asserted: Gaussian prior → Ridge, Laplace prior → Lasso. Here we derive why.
  • Plug the conjugate Gaussian posterior into \(\hat{\boldsymbol{\theta}}_{\text{MAP}} = \arg\max[\log p(\mathcal{D}\mid\boldsymbol{\theta}) + \log p(\boldsymbol{\theta})]\):

\[ \underbrace{\tfrac{1}{2\sigma^2}\sum_i (y_i - f_{\boldsymbol{\theta}}(\mathbf{x}_i))^2}_{\text{negative log-likelihood}} \;+\; \underbrace{\tfrac{1}{2\tau^2}\|\boldsymbol{\theta}\|_2^2}_{\text{negative log-prior}} \]

  • So the penalty strength is not a free knob: \(\lambda = \sigma^2/\tau^2\) = (noise variance)/(prior variance).
  • MAP keeps only the mode; the posterior also carries the width \(\sigma_N^2\) — the uncertainty, which is the whole reason we went Bayesian. Caveat: mode ≠ mean for skewed posteriors, and the mode is not reparametrisation-invariant.

When to be Bayesian vs frequentist

  • Small data, safety-critical: Bayesian — uncertainty quantification is essential.
  • Large data, fast iteration: MLE/frequentist — posterior approximates MLE anyway.
  • Model comparison: Bayesian evidence is a principled selection criterion.
  • Engineering practice: often a pragmatic mix — MLE for training, Bayesian for uncertainty.

Predictive distribution

  • Instead of predicting with a point estimate, integrate over parameter uncertainty:

\[ p(\mathbf{x}_{\text{new}} \mid \mathcal{D}) = \int p(\mathbf{x}_{\text{new}} \mid \boldsymbol{\theta}) \, p(\boldsymbol{\theta} \mid \mathcal{D}) \, d\boldsymbol{\theta} \]

  • The predictive distribution is wider than the distribution under a point estimate.
  • It honestly reflects both data noise (aleatory) and parameter uncertainty (epistemic).

Checkpoint: update a prior

  • Setup: Prior \(\mu_0 = 0\), \(\sigma_0^2 = 10\). Known noise \(\sigma^2 = 1\). Five observations with \(\bar{x} = 3.2\).
  • Compute \(\mu_N\) and \(\sigma_N^2\).
  • \(\sigma_N^2 = \frac{1 \cdot 10}{1 + 5 \cdot 10} = \frac{10}{51} \approx 0.196\).
  • \(\mu_N = \frac{1 \cdot 0 + 5 \cdot 10 \cdot 3.2}{1 + 50} = \frac{160}{51} \approx 3.14\).

Stochastic enrichment of input data

  • Add Gaussian noise to training inputs to simulate measurement uncertainty:
    • \(\tilde{x}_i = x_i + \epsilon\), \(\epsilon \sim \mathcal{N}(0, \sigma_{\text{noise}}^2)\).
  • This augments the training set and makes the model robust to input perturbations.
  • Especially effective when the noise level matches real deployment conditions (Neuer et al. 2024).

Noise injection for model robustness.

Mixture-density networks

  • Standard networks predict a single value \(\hat{y}\) — they cannot express multi-modal uncertainty.
  • A mixture-density network predicts the parameters of a Gaussian mixture:

\[ p(y|x) = \sum_{k=1}^{K} \pi_k(x) \, \mathcal{N}(y | \mu_k(x), \sigma_k^2(x)) \]

  • Mixing coefficients \(\pi_k\), means \(\mu_k\), and variances \(\sigma_k^2\) are all functions of input \(x\) (Neuer et al. 2024).

MDN predicting a bimodal distribution.

Process corridors via 2D histograms

  • In manufacturing, define acceptable parameter ranges as probability contours.
  • A 2D histogram of (process parameter, quality metric) shows the process corridor.
  • Points outside the corridor flag anomalies or process drift.
  • This converts probabilistic thinking into actionable quality control (Neuer et al. 2024).

Process corridor showing safe probability regions.

Materials example: property prediction with uncertainty

  • Predicting tensile strength of a new alloy composition.
  • Point prediction: 450 MPa. But how confident are we?
  • With uncertainty: 450 ± 35 MPa (95% credible interval).
  • Epistemic uncertainty is large in composition regions far from training data — flagging extrapolation.

Practical diagnostic: calibration plots

  • A well-calibrated model’s predicted \(p\)% confidence intervals should contain \(p\)% of test points.
  • Plot: predicted confidence level vs observed coverage.
  • Perfect calibration = diagonal line.
  • Overconfident models: predicted intervals are too narrow (points fall outside too often).

Distribution-Free Coverage: Conformal Prediction

The gap calibration plots leave open.

  • Bayesian / MAP intervals depend on the model being right. Mis-specify the prior or the likelihood and the credible interval is no longer trustworthy.
  • Calibration plots diagnose miscalibration but do not fix it for a new test point.
  • We want a wrapper that takes any point predictor and produces an interval with a finite-sample, distribution-free coverage guarantee.

Conformal prediction in one sentence.

  • Pick miscoverage level \(\alpha\). Conformal prediction outputs \(C(\mathbf{x})\) such that, for any exchangeable new \((X, Y)\), \[\Pr\!\left(Y \in C(X)\right) \;\geq\; 1 - \alpha.\]
  • No assumption on the data distribution. No assumption on the model.
  • Only assumption: exchangeability of calibration and test data (typically i.i.d. from the same source) (Angelopoulos and Bates 2023).

The guarantee is marginal — averaged over the test distribution. Conditional coverage (per input \(\mathbf{x}\)) needs CQR, two slides from now.

Split conformal in 5 lines

The algorithm. Given a trained predictor \(\hat f\) and a separate calibration set \(\{(\mathbf{x}_i, y_i)\}_{i=1}^{n}\):

  1. Compute nonconformity scores \(s_i = |y_i - \hat f(\mathbf{x}_i)|\) for \(i = 1, \dots, n\).
  2. Compute the empirical quantile \[ \hat q = \mathrm{Quantile}\!\left(\{s_i\};\; \tfrac{\lceil (n+1)(1-\alpha) \rceil}{n}\right). \]
  3. For a new \(\mathbf{x}\), output the interval \[ C(\mathbf{x}) = \big[\hat f(\mathbf{x}) - \hat q,\; \hat f(\mathbf{x}) + \hat q\big]. \]

For any exchangeable new \((X, Y)\): \[\Pr\!\left(Y \in C(X)\right) \;\geq\; 1 - \alpha.\]

The 5-line Python.

# Held-out calibration set (x_cal, y_cal);
# trained model `model`; alpha = 0.1
scores = np.abs(y_cal - model.predict(x_cal))
n = len(scores)
qhat = np.quantile(
    scores, np.ceil((n + 1) * (1 - alpha)) / n)

# At test time:
def conformal_interval(x):
    yhat = model.predict(x)
    return yhat - qhat, yhat + qhat

\(\hat q\) is a single scalar. The interval width is the same for every test point in vanilla split CP — the next slide (CQR) is the standard fix (Angelopoulos and Bates 2023).

Conformalized Quantile Regression (CQR) — adaptive interval widths

Why vanilla split CP is too coarse.

  • \(\hat q\) is the same constant for every \(\mathbf{x}\).
  • Easy inputs get unnecessarily wide intervals; hard inputs may still under-cover conditionally.
  • We want intervals that widen near hard regions of input space — i.e., locally heteroscedastic coverage.

CQR fix (Romano et al. 2019).

  1. Train a quantile-regression model predicting \(\hat q_{\alpha/2}(\mathbf{x})\) and \(\hat q_{1-\alpha/2}(\mathbf{x})\) — e.g., a NN with two heads using the pinball loss \(\ell_\tau(r) = \max(\tau r, (\tau - 1) r)\), or a quantile gradient-boosted tree.
  2. Compute conformity scores \[ s_i = \max\!\big(\hat q_{\alpha/2}(\mathbf{x}_i) - y_i,\; y_i - \hat q_{1-\alpha/2}(\mathbf{x}_i)\big). \]
  3. Take \(\hat q = \mathrm{Quantile}\!\left(s_i;\; \tfrac{\lceil (n+1)(1-\alpha) \rceil}{n}\right)\) and output \[ C(\mathbf{x}) = \big[\hat q_{\alpha/2}(\mathbf{x}) - \hat q,\; \hat q_{1-\alpha/2}(\mathbf{x}) + \hat q\big]. \]
  • Marginal coverage is preserved; intervals adapt to local input difficulty.

The 2026 default UQ stack for a regression NN: quantile heads + CQR. Calibrate once, ship with finite-sample coverage.

Failure mode — exchangeability under drift

When the guarantee breaks.

  • The coverage proof needs exchangeability of \(\{(X_1, Y_1), \dots, (X_n, Y_n), (X_{n+1}, Y_{n+1})\}\).
  • I.i.d. data: exchangeable. Time-series with drift, train/test from different labs, post-deployment shifts: not exchangeable.
  • Under covariate shift the marginal coverage drops below \(1 - \alpha\); under harder shifts the guarantee can be lost entirely.

Mitigations (named, not derived).

  • Weighted conformal prediction — reweight calibration scores by an importance ratio (assumes known shift structure).
  • Online conformal — adapt the quantile \(\hat q\) from streaming feedback.
  • Jackknife+ / CV+ — recover finite-sample coverage with leave-one-out scoring (Angelopoulos and Bates 2023).

Engineering rule of thumb. Report split-conformal / CQR intervals and the exchangeability assumption you are making. When the assumption is shaky, instrument for coverage tracking after deployment.

Lecture-essential vs exercise content split

  • Lecture: uncertainty taxonomy, Gaussian distribution, MLE derivation, Bayesian framework, MAP-regularization connection, split conformal + CQR.
  • Exercise: noise injection and Nyquist demo, MLE implementation in NumPy, Bayesian posterior updating, MLE-MSE equivalence proof, calibration plots, 5-line split-conformal wrapper around a baseline regressor.

Exam-aligned summary: must-know statements

  1. Aleatory uncertainty is irreducible; epistemic uncertainty is reducible with more data.
  2. The Gaussian is the maximum-entropy distribution for given mean and variance.
  3. MLE maximizes the probability of the observed data under the model.
  4. For Gaussian noise, MLE is equivalent to MSE minimization.
  5. Bayes’ theorem: posterior \(\propto\) likelihood \(\times\) prior.
  6. Conjugate priors yield closed-form posteriors (e.g., Gaussian-Gaussian).
  7. MAP with a Gaussian prior is equivalent to Ridge regression.
  8. The predictive distribution integrates over parameter uncertainty.
  9. Student’s t-distribution provides robustness to outliers.
  10. Calibration plots assess whether predicted uncertainties match observed frequencies.
  11. KL divergence is non-negative, zero iff the two distributions agree, and asymmetric; the closed form between Gaussians is the regularizer in the VAE loss.
  12. Split conformal prediction wraps any predictor with a finite-sample, distribution-free marginal-coverage guarantee under exchangeability; CQR adapts the width to local input difficulty.

Continue

References + reading assignment for next unit

  • Required reading before Unit 8:
    • Neuer: Ch. 4.5.9 (overfitting and cross-validation)
    • McClarren: Ch. 2.4 (Ridge, Lasso, elastic net)
  • Optional depth:
    • Murphy: Ch. 6.4.4, 6.5.3 (bias-variance, CV for \(\lambda\) selection)
    • Bishop: Ch. 3.2 (bias-variance decomposition)
    • Angelopoulos & Bates 2023 (conformal): §1–§3 of Angelopoulos and Bates (2023)
  • Next unit: Generalization, Bias-Variance, Regularization, Tree Ensembles — using the probabilistic machinery to reason about overfitting.
Angelopoulos, Anastasios N., and Stephen Bates. 2023. “A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification.” Foundations and Trends in Machine Learning 16 (4): 494–591. https://doi.org/10.1561/2200000101.
Bishop, Christopher M. 2006. Pattern Recognition and Machine Learning. Springer.
Murphy, Kevin P. 2012. Machine Learning: A Probabilistic Perspective. MIT Press.
Neuer, Michael et al. 2024. Machine Learning for Engineers: Introduction to Physics-Informed, Explainable Learning Methods for AI in Engineering Applications. Springer Nature.
Romano, Yaniv, Evan Patterson, and Emmanuel J. Candès. 2019. “Conformalized Quantile Regression.” Advances in Neural Information Processing Systems 32.