ECLIPSE Presentations – Mathematical Foundations of AI & ML Unit 7: The Probabilistic View of Learning; Conformal Prediction

Title + Unit 7 positioning

Units 1–6 built the optimization machinery for learning.
Unit 7 introduces the probabilistic foundations that underlie everything.
Probability is the language of uncertainty — and learning is fundamentally about reasoning under uncertainty.
We close the unit with conformal prediction: a distribution-free coverage guarantee that any downstream UQ method can plug into.

Open by naming the pivot: Units 1–6 answered “how do we fit a model” (loss, gradients, optimizers). This unit answers “what does the data mean” — every loss we minimized is secretly a probabilistic statement, and making that explicit is what lets us put error bars on predictions.
Set the destination early: the unit ends with conformal prediction, a guarantee that holds even if our probability model is wrong. Tell students that arc — “we build the probabilistic worldview, then give them a safety net that survives its failure” — so the long Gaussian/MLE/Bayes middle has a clear payoff.
Audience anchor: this is a materials-science cohort. Promise concrete returns — why MSE is the “right” loss, why ridge regression is not arbitrary, how to report a trustworthy ±interval to a metallurgist. Keep that promissory note visible all lecture.
Timing: this is a content-dense 90 min. Budget ~10 min framing+uncertainty, ~25 min Gaussian/entropy/KL, ~20 min MLE, ~20 min Bayes, ~15 min conformal. Flag the interactives as the pace-recovery points you can compress if behind.

Recap: what risk minimization assumes

Unit 1: \(\hat{\boldsymbol{\theta}} = \arg\min_{\boldsymbol{\theta}} \mathbb{E}_{(\mathbf{x},y) \sim P}[L(f_{\boldsymbol{\theta}}(\mathbf{x}), y)]\).
The expectation is over a probability distribution \(P\) of data.
Until now, we treated this as a mathematical abstraction. Now we make it concrete.

This slide is the “aha” the whole unit rests on: the \(\mathbb{E}_{P}\) we wrote casually in Unit 1 was never decorative — it presupposes a data distribution \(P\). Learning is estimating properties of \(P\) from finite samples.
Make the abstraction concrete with one sentence: “We never see \(P\); we see a handful of draws from it, and we gamble that minimizing average loss on those draws minimizes it on the unseen rest.” That gamble is exactly what the rest of the course (generalization, Unit 8) interrogates.
Common student misconception to preempt: that the training set is the problem. Reframe — the training set is a noisy window onto \(P\); the real target is performance under \(P\).
Transition line: “To reason about \(P\) we need its vocabulary — distributions, expectations, likelihood, Bayes. That vocabulary is today.”

Learning outcomes for Unit 7

By the end of this lecture, students can:

classify uncertainty as aleatory or epistemic and explain why this matters,
write the Gaussian in 1D and multivariate form and explain its maximum-entropy property,
compute and interpret KL divergence between distributions, in particular the closed form between two Gaussians,
derive the MLE for Gaussian parameters and connect it to MSE minimization,
apply Bayes’ theorem to update prior beliefs into posterior distributions,
apply split conformal prediction to wrap any predictor with a finite-sample coverage guarantee, and recognise when the exchangeability assumption breaks.

Why probability is the language of learning

Data is inherently noisy — repeated measurements give different results.
Models are uncertain — finite data cannot determine parameters exactly.
Probability provides a consistent, rigorous framework for quantifying both.
Without probability, we cannot define what “learning from data” means.

Aleatory uncertainty — definition

Aleatory (from Latin alea = dice): irreducible randomness in the data-generating process.
Examples: thermal noise in sensors, quantum measurement, turbulent flow variability.
No amount of additional data or better models can eliminate aleatory uncertainty.
It sets a floor on achievable prediction error (the Bayes error — formally treated in Unit 8 with the bias-variance decomposition).

Epistemic uncertainty — definition

Epistemic (from Greek episteme = knowledge): uncertainty from limited knowledge.
Reducible by collecting more data, improving the model, or adding features.
Examples: parameter uncertainty with small \(N\), model misspecification, missing variables.
Epistemic uncertainty decreases as the training set grows.

Why the distinction matters

Uncertainty breakdown

Aleatory uncertainty: set appropriate error bars; do not waste resources trying to reduce it.
Epistemic uncertainty: invest in data collection or model improvement.
Confusing the two leads to wasted effort (trying to reduce noise) or false confidence (ignoring model uncertainty).
Engineering systems must handle both types appropriately (Neuer et al. 2024).

Interactive: Aleatory vs. Epistemic Uncertainty

Interactive Plot
Interpretation

viewof n_samples_unc = Inputs.range([5, 500], {value: 20, step: 1, label: "Sample Size (N) 📉"})
viewof aleatory_noise = Inputs.range([0, 2], {value: 0.5, step: 0.1, label: "Aleatory Noise (σ) 🎲"})

true_func = (x) => Math.sin(x * Math.PI) + 0.5 * x;

// Generate Data
unc_data = {
  const points = [];
  const rng = d3.randomNormal(0, aleatory_noise);
  for (let i = 0; i < n_samples_unc; i++) {
    const x = d3.randomUniform(-2, 2)();
    points.push({x: x, y: true_func(x) + rng()});
  }
  return points.sort((a,b) => a.x - b.x);
}

// Fit a simple polynomial (degree 3) for the epistemic uncertainty band
// Instead of complex regression, let's just use a confidence interval approach
// width of CI shrinks as 1 / sqrt(N)
unc_ci_width = aleatory_noise * 1.96 / Math.sqrt(n_samples_unc);
  
Plot.plot({
  width: 1400,
  height: 700,
  x: {domain: [-2.1, 2.1], label: "x"},
  y: {domain: [-3, 3], label: "y"},
  marks: [
    // True function
    Plot.line(d3.range(-2, 2.1, 0.1), {x: d => d, y: d => true_func(d), stroke: "lightgray", strokeWidth: 3, strokeDasharray: "6,4", title: "True Process"}),
    // Epistemic Uncertainty Band
    Plot.areaY(d3.range(-2, 2.1, 0.1), {
      x: d => d, 
      y1: d => true_func(d) - unc_ci_width,
      y2: d => true_func(d) + unc_ci_width,
      fill: "cyan", 
      fillOpacity: 0.4, 
      title: "Epistemic Uncertainty (Model Confidence)"
    }),
    // Aleatory Noise (Data points)
    Plot.dot(unc_data, {x: "x", y: "y", r: 4, fill: "red", fillOpacity: 0.6, title: "Observations (Aleatory Noise)"})
  ]
})

The gray dashed line is the true, hidden process.
The red dots are sampled data. Their spread around the true process is aleatory uncertainty (irreducible noise).
The blue band represents our model’s confidence in the mean (epistemic uncertainty).
Try it: Increase N. Notice how the blue band shrinks, but the red dots remain scattered.

Demo script (do it live, ~90 s): start N≈20, σ≈0.5. Then drag N to 500 — narrate “the blue confidence band collapses: epistemic uncertainty is being bought down by data.” Then reset N, drag σ up — “the red scatter explodes but the band barely moves: aleatory uncertainty is immune to anything we do.”
The punchline to say out loud: increasing N shrinks the band but the dots stay scattered — the visual proof of the previous two slides. Have students predict before you drag.
Honesty note for sharp students: the band here is a stylized \(1.96\,\sigma/\sqrt{N}\), not a fitted model’s true posterior — the qualitative behavior is right, the exact shape is illustrative. Say this so no one over-reads the figure.
Recovery point: if running behind, this interactive can be shown for 30 s instead of 2 min without losing the thread.

The sampling process

Data is a collection of realizations from a random process.
The sampling rate, resolution, and digitization affect what information is preserved.
Insufficient sampling introduces systematic errors that no model can correct.

Nyquist-Shannon theorem

To reconstruct a signal with maximum frequency \(f_{\max}\), the sampling rate must satisfy:

\[ f_s \geq 2 f_{\max} \]

Below this rate: aliasing — high-frequency components fold into low frequencies.
Relevance to ML: undersampled data contains phantom patterns that models can overfit to (Neuer et al. 2024).

Engineering example: sensor data and uncertainty sources

A temperature sensor measuring a furnace:
- Aleatory: thermal fluctuations (\(\pm 2°C\) at steady state).
- Epistemic: calibration drift (systematic, correctable with recalibration).
An ML model trained on this data must account for both sources to make reliable predictions.

Random variables and probability distributions

A random variable \(X\) maps outcomes to numbers.
Discrete: probability mass function \(P(X = x)\).
Continuous: probability density function \(p(x)\) where \(\int p(x)\,dx = 1\).
The PDF gives relative likelihood — not probability — at each point.

This is the formal vocabulary slide — keep it crisp, the payoff is downstream. The one point students must internalize: a PDF is not a probability. \(p(x)\) can exceed 1; only \(\int p\,dx\) over an interval is a probability. This trips up nearly everyone the first time and causes real bugs (e.g. “likelihood > 1, is that wrong?” — no).
Discrete vs continuous: the PMF/PDF distinction is what determines whether you sum or integrate — and whether you write cross-entropy as a sum (classification, Unit 4) or a density log-likelihood (regression, this unit). Make that connection so it isn’t abstract.
Transition: “with random variables defined, we can finally summarize a distribution with numbers — expectation and variance — which is where every loss function secretly lives.”

Expected value and variance

Expected value (mean): \(\mu = \mathbb{E}[X] = \int x \, p(x) \, dx\).
Variance: \(\sigma^2 = \text{Var}[X] = \mathbb{E}[(X - \mu)^2]\).
Standard deviation: \(\sigma = \sqrt{\text{Var}[X]}\) — same units as \(X\).
The mean locates the distribution; the variance measures its spread.

Higher moments: skewness and kurtosis

Skewness \(= \mathbb{E}\left[\left(\frac{X-\mu}{\sigma}\right)^3\right]\): measures asymmetry (0 for symmetric distributions).
Kurtosis \(= \mathbb{E}\left[\left(\frac{X-\mu}{\sigma}\right)^4\right]\): measures tail heaviness (3 for the Gaussian).
Excess kurtosis \(= \text{kurtosis} - 3\): deviation from Gaussian tail behavior.

The Gaussian distribution (1D)

\[ p(x \mid \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\!\left(-\frac{(x-\mu)^2}{2\sigma^2}\right) \]

Completely characterized by two parameters: mean \(\mu\) and variance \(\sigma^2\).
Symmetric, unimodal, bell-shaped.
The 68-95-99.7 rule: probability within \(1\sigma, 2\sigma, 3\sigma\) of the mean.

Interactive: The 1D Gaussian & Empirical Rule

Interactive Plot
Interpretation

viewof g_mean = Inputs.range([-3, 3], {value: 0, step: 0.1, label: "Mean (μ)"})
viewof g_std = Inputs.range([0.1, 3], {value: 1, step: 0.1, label: "Std Dev (σ)"})
viewof show_intervals = Inputs.checkbox(["1σ (68%)", "2σ (95%)", "3σ (99.7%)"], {label: "Show Intervals", value: ["1σ (68%)"]})

gaussian_pdf = (x, mu, sigma) => {
  const variance = sigma * sigma;
  return (1 / Math.sqrt(2 * Math.PI * variance)) * Math.exp(-Math.pow(x - mu, 2) / (2 * variance));
}

g_x_vals = d3.range(-6, 6.05, 0.05);
g_data = g_x_vals.map(x => ({x: x, y: gaussian_pdf(x, g_mean, g_std)}));

Plot.plot({
  width: 800,
  height: 350,
  x: {domain: [-6, 6], label: "x"},
  y: {domain: [0, 4.2], label: "Density p(x)"},
  marks: [
    // 3 sigma
    show_intervals.includes("3σ (99.7%)") ? Plot.areaY(g_data.filter(d => d.x >= g_mean - 3*g_std && d.x <= g_mean + 3*g_std), {x: "x", y: "y", fill: "#ffbaba", fillOpacity: 0.5}) : null,
    // 2 sigma
    show_intervals.includes("2σ (95%)") ? Plot.areaY(g_data.filter(d => d.x >= g_mean - 2*g_std && d.x <= g_mean + 2*g_std), {x: "x", y: "y", fill: "#ff7b7b", fillOpacity: 0.6}) : null,
    // 1 sigma
    show_intervals.includes("1σ (68%)") ? Plot.areaY(g_data.filter(d => d.x >= g_mean - 1*g_std && d.x <= g_mean + 1*g_std), {x: "x", y: "y", fill: "#ff5252", fillOpacity: 0.8}) : null,
    
    // PDF Line
    Plot.line(g_data, {x: "x", y: "y", stroke: "white", strokeWidth: 3}),
    
    // Mean Line
    Plot.ruleX([g_mean], {stroke: "white", strokeDasharray: "4,4"})
  ]
})

Modifying the Mean (\(\mu\)) shifts the distribution left or right. It represents the center of mass.
Modifying the Standard Deviation (\(\sigma\)) stretches the distribution.
Notice that the peak height drops as it stretches, to ensure the total area (probability) always integrates to 1.

Why the Gaussian is special: maximum entropy

Among all distributions with a given mean \(\mu\) and variance \(\sigma^2\), the Gaussian has maximum entropy.
Maximum entropy = maximum uncertainty = fewest additional assumptions.
Using a Gaussian is therefore the most conservative choice when only mean and variance are known.
This is the information-theoretic justification for the Gaussian’s ubiquity (Murphy 2012).
(Entropy \(H(p)\) is defined formally in the information-theoretic primer later in this unit.)

Central Limit Theorem connection

The sum (or average) of many independent random variables converges to a Gaussian, regardless of their individual distributions.
This explains why the Gaussian appears everywhere:
- Measurement errors = sum of many small independent perturbations.
- Aggregate quantities in materials science follow approximately Gaussian distributions.

Histograms of the mean of \(N\) uniform random variables for \(N=1,2,10\). As \(N\) grows the distribution rapidly becomes Gaussian (Bishop 2006).

Multivariate Gaussian distribution

\[ p(\mathbf{x} \mid \boldsymbol{\mu}, \boldsymbol{\Sigma}) = (2\pi)^{-d/2} |\boldsymbol{\Sigma}|^{-1/2} \exp\!\left(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1}(\mathbf{x}-\boldsymbol{\mu})\right) \]

\(\boldsymbol{\mu} \in \mathbb{R}^d\): mean vector. \(\boldsymbol{\Sigma} \in \mathbb{R}^{d \times d}\): covariance matrix (symmetric, positive definite).
Level sets are ellipsoids whose axes align with eigenvectors of \(\boldsymbol{\Sigma}\).
Eigenvectors \(\mathbf{u}_i\) give the ellipse orientation; eigenvalues \(\lambda_i\) give the axis lengths \(\lambda_i^{1/2}\).

2D Gaussian density ellipse. The principal axes are the eigenvectors \(\mathbf{u}_1, \mathbf{u}_2\) of \(\boldsymbol{\Sigma}\), scaled by \(\lambda_i^{1/2}\) (Murphy 2012).

This is a direct callback to Unit 2 (PCA/SVD) and Unit 6 (Hessian eigenstructure): the quadratic form \((\mathbf{x}-\boldsymbol\mu)^\top\boldsymbol\Sigma^{-1}(\mathbf{x}-\boldsymbol\mu)\) is the same eigen-geometry — eigenvectors set ellipse orientation, eigenvalues set axis lengths. Say “you have seen this matrix three times now in three disguises” — it consolidates the course’s central linear-algebra thread.
The \(\boldsymbol\Sigma^{-1}\) is a Mahalanobis distance: it rescales each direction by its variance so “far” means “statistically surprising,” not “large in raw units.” This is the right mental model for anomaly detection / process corridors later in the unit.
Practical note: \(\boldsymbol\Sigma\) symmetric PSD is what guarantees the ellipsoid picture is even valid; a non-PSD “covariance” estimate is a real bug students hit with small N (next slide’s \(N>d\) condition).

Covariance matrix: diagonal vs full

Diagonal \(\boldsymbol{\Sigma}\): features are uncorrelated; ellipsoids are axis-aligned.
Full \(\boldsymbol{\Sigma}\): features are correlated; ellipsoids are rotated.
Spherical (\(\boldsymbol{\Sigma} = \sigma^2 \mathbf{I}\)): isotropic; level sets are spheres.
The eigenvalues of \(\boldsymbol{\Sigma}\) determine the extent along each principal axis.

Contours of constant density for (a) full, (b) diagonal, and (c) spherical (isotropic) covariance matrices (Bishop 2006).

The three-case taxonomy maps directly to modeling choices students will make: spherical = “I assume all features equivalent and independent” (rarely true), diagonal = “independent but differently scaled” (the naive-Bayes / mean-field assumption), full = “I model correlations” (most expressive, needs the most data).
The cost ladder is the real lesson: full \(\boldsymbol\Sigma\) has \(d(d{+}1)/2\) parameters — quadratic in dimension. This is why high-dimensional Gaussian models default to diagonal/low-rank, and it foreshadows the diagonal-covariance approximation in VAEs (Unit 11) and Gaussian-mixture clustering (Unit 5).
Quick interpretive question to the room: “Diagonal Σ — can features still be dependent?” Answer: uncorrelated ≠ independent in general, but for a joint Gaussian, zero covariance does imply independence. That Gaussian-only equivalence is worth stating cleanly; it’s a classic exam trap.

viewof cond_rho = Inputs.range([-0.95, 0.95], {value: 0.7, step: 0.05, label: "Correlation (ρ)"})
viewof cond_x0  = Inputs.range([-3, 3], {value: 1.0, step: 0.1, label: "Observed x"})

(Unit standardises \(\sigma_x=\sigma_y=1\), \(\mu=0\) so the conditioning formula is visible directly.)

cond_data = {
  const N = 500, pts = [];
  const L21 = cond_rho, L22 = Math.sqrt(Math.max(1e-9, 1 - cond_rho * cond_rho));
  const rng = d3.randomNormal(0, 1);
  for (let i = 0; i < N; i++) {
    const z1 = rng(), z2 = rng();
    pts.push({x: z1, y: L21 * z1 + L22 * z2});
  }
  return pts;
}

// Conditional p(y | x = x0): mean = rho*x0, var = 1 - rho^2
cond_mu  = cond_rho * cond_x0;
cond_var = 1 - cond_rho * cond_rho;
cond_curve = d3.range(-4, 4.05, 0.05).map(y => ({
  y: y,
  // density scaled into the plot's x-units for display alongside the cloud
  d: cond_x0 + 2.2 * (1 / Math.sqrt(2 * Math.PI * cond_var)) *
       Math.exp(-Math.pow(y - cond_mu, 2) / (2 * cond_var))
}));

Plot.plot({
  width: 600, height: 600,
  x: {domain: [-4, 4], label: "X"},
  y: {domain: [-4, 4], label: "Y"},
  aspectRatio: 1,
  marks: [
    Plot.dot(cond_data, {x: "x", y: "y", r: 3, fill: "steelblue", fillOpacity: 0.25}),
    Plot.density(cond_data, {x: "x", y: "y", stroke: "white", thresholds: 5}),
    Plot.ruleX([cond_x0], {stroke: "#ff7b7b", strokeWidth: 2, strokeDasharray: "4,3"}),
    Plot.line(cond_curve, {x: "d", y: "y", stroke: "#ff7b7b", strokeWidth: 3}),
    Plot.dot([{x: cond_x0, y: cond_mu}], {x: "x", y: "y", r: 7, fill: "#ff7b7b"})
  ]
})

Slicing the cloud at \(X=x_0\) (red line) leaves a 1D Gaussian — the red conditional curve.
Conditional mean moves: \(\;\mathbb{E}[Y\mid X=x_0] = \rho\, x_0\) (the regression line).
Conditional variance shrinks: \(\;\mathrm{Var}[Y\mid X=x_0] = 1-\rho^2\) — observing \(X\) reduced our uncertainty about \(Y\).
At \(\rho=0\) the slice is unchanged (\(X\) tells us nothing); as \(|\rho|\to 1\) the conditional collapses to a line.

This is the slide the rest of the unit runs on, so make conditioning visceral, not algebraic. Demo (~60 s): set ρ≈0.7, drag x — narrate “the red curve is what we’d believe about Y after measuring X = this value.” The dot is the conditional mean = the regression prediction; the curve’s width is the leftover uncertainty.
Two examinable facts to say out loud, both readable off the screen: conditioning shifts the mean by ρ·x₀ (this is literally linear regression) and shrinks the variance by the factor (1−ρ²) (this is “data reduces uncertainty”). Contrast with the Unit-2 covariance widget: that one showed the shape of the cloud; this one shows what learning from an observation does to it.
Forward-point hard: this exact operation — “Gaussian in, observe something, Gaussian out, narrower” — IS the Bayesian update (next block, prior→posterior) and IS a Gaussian process (Unit 12). Tell them they are watching the engine of the next two weeks. ρ→0 means an uninformative observation: same as a flat prior.

Marginal and conditional Gaussians

A key property: marginals and conditionals of a joint Gaussian are also Gaussian.
Marginal: integrate out some variables — the result is Gaussian with sub-matrix of \(\boldsymbol{\Sigma}\).
Conditional: condition on some variables — the result is Gaussian with updated \(\boldsymbol{\mu}\) and reduced \(\boldsymbol{\Sigma}\).
This closure property makes Gaussian models analytically tractable (Bishop 2006).

Checkpoint: interpret the covariance matrix

Given \(\boldsymbol{\Sigma} = \begin{pmatrix} 4 & 3 \\ 3 & 9 \end{pmatrix}\):
- Feature 1 has variance 4, feature 2 has variance 9.
- Correlation coefficient: \(\rho = 3/\sqrt{4 \cdot 9} = 0.5\) — moderate positive correlation.
- The contour ellipse is tilted toward the upper-right.

Entropy of a distribution

For a continuous distribution \(p(x)\), the (differential) entropy is

\[ H(p) \;=\; -\int p(x)\, \log p(x)\, dx \;=\; -\mathbb{E}_p[\log p(X)] \]

(discrete analogue: \(H(p) = -\sum_x p(x) \log p(x)\)).

Intuition: expected surprise \(-\log p(X)\) — rare outcomes carry more information.
Larger \(H\) = more uncertainty / less concentration of mass.
1D Gaussian: \(H(\mathcal{N}(\mu,\sigma^2)) = \tfrac{1}{2}\log(2\pi e\, \sigma^2)\) — depends on \(\sigma\), not \(\mu\).
This formalizes the earlier claim: among distributions with given mean and variance, \(\mathcal{N}(\mu,\sigma^2)\) maximizes \(H\) (Bishop 2006).

KL divergence: comparing two distributions

For distributions \(q\) and \(p\) on the same space:

\[ \mathrm{KL}(q \,\|\, p) \;=\; \mathbb{E}_q\!\left[\log \tfrac{q(x)}{p(x)}\right] \;=\; \int q(x)\, \log \tfrac{q(x)}{p(x)}\, dx \]

Three load-bearing properties:
1. \(\mathrm{KL}(q\|p) \ge 0\) (Gibbs inequality, via Jensen’s inequality on \(-\log\)).
2. \(\mathrm{KL}(q\|p) = 0\) iff \(q = p\) almost everywhere.
3. Asymmetric in general: \(\mathrm{KL}(q\|p) \ne \mathrm{KL}(p\|q)\).
Intuition: extra cost (in nats) of describing samples from \(q\) using a code optimized for \(p\) (Bishop 2006).
KL is therefore a directed dissimilarity — not a metric.

The three properties are the exam content; drill them. Non-negativity + (zero iff equal) is what licenses KL as a training objective (“drive it to zero = match the distributions”). Asymmetry is the one students get wrong on exams and in code.
Make asymmetry concrete: \(\mathrm{KL}(q\|p)\) with the expectation under q penalizes \(q\) putting mass where \(p\) has none (mode-seeking / zero-forcing); the reverse is mean-seeking. This is the reason variational inference and VAEs use \(\mathrm{KL}(q\|p)\) specifically — forward-pointer to Unit 11, don’t derive.
The coding interpretation (extra nats to encode q-samples with a p-optimal code) is worth one sentence — it grounds “divergence” as a real, operational cost, not an abstract metric. And stress: not a metric (no symmetry, no triangle inequality).

KL between two Gaussians (the VAE-relevant case)

For two 1D Gaussians, KL admits a closed form:

\[ \mathrm{KL}\!\left(\mathcal{N}(\mu_1,\sigma_1^2)\,\|\,\mathcal{N}(\mu_2,\sigma_2^2)\right) \;=\; \log\frac{\sigma_2}{\sigma_1} \;+\; \frac{\sigma_1^2 + (\mu_1 - \mu_2)^2}{2\sigma_2^2} \;-\; \frac{1}{2} \]

The form used to regularize variational autoencoders — \(p = \mathcal{N}(\mathbf{0}, I)\) vs. \(q = \mathcal{N}(\boldsymbol{\mu},\, \mathrm{diag}(\sigma_1^2,\dots,\sigma_d^2))\) — is the per-dimension sum:

\[ \mathrm{KL}(q\,\|\,p) \;=\; \tfrac{1}{2} \sum_{j=1}^{d} \left( \mu_j^2 + \sigma_j^2 - \log \sigma_j^2 - 1 \right) \]

Sanity check: vanishes iff \(\mu_j = 0\) and \(\sigma_j = 1\) for all \(j\) — i.e., \(q\) already matches the standard-normal prior.
Forward pointer: Unit 11 will use exactly this expression as the regularizer in the VAE loss (Bishop 2006).

The likelihood function

Given observed data \(\mathcal{D} = \{\mathbf{x}_1, \dots, \mathbf{x}_N\}\) (assumed i.i.d.):

\[ \mathcal{L}(\boldsymbol{\theta}) = p(\mathcal{D} \mid \boldsymbol{\theta}) = \prod_{i=1}^{N} p(\mathbf{x}_i \mid \boldsymbol{\theta}) \]

\(\mathcal{L}(\boldsymbol{\theta})\) is a function of the parameters \(\boldsymbol{\theta}\), not the data.
It measures how well \(\boldsymbol{\theta}\) explains the observed data.

Log-likelihood

Taking the log converts the product into a sum:

\[ \ell(\boldsymbol{\theta}) = \log \mathcal{L}(\boldsymbol{\theta}) = \sum_{i=1}^{N} \log p(\mathbf{x}_i \mid \boldsymbol{\theta}) \]

Sums are numerically stable and easier to differentiate.
Maximizing \(\ell(\boldsymbol{\theta})\) gives the same solution as maximizing \(\mathcal{L}(\boldsymbol{\theta})\).

MLE principle

\[ \hat{\boldsymbol{\theta}}_{\text{MLE}} = \arg\max_{\boldsymbol{\theta}} \ell(\boldsymbol{\theta}) \]

Choose the parameters that make the observed data most probable under the model.
MLE is the most widely used estimation principle in statistics and machine learning.
Set \(\nabla_{\boldsymbol{\theta}} \ell(\boldsymbol{\theta}) = 0\) and solve.

Interactive: Maximum Likelihood Estimation

Interactive Fit
Try it yourself!

viewof mle_mu = Inputs.range([-4, 4], {value: 0, step: 0.1, label: "Guess Mean (μ)"})
viewof mle_sigma = Inputs.range([0.1, 3], {value: 1, step: 0.1, label: "Guess Std Dev (σ)"})

mle_fixed_data = [{x: -0.5}, {x: 0.2}, {x: 1.1}, {x: 1.5}, {x: 2.2}]

// Calculate true MLE
mle_true_mu = d3.mean(mle_fixed_data, d => d.x);
mle_true_var = d3.mean(mle_fixed_data, d => Math.pow(d.x - mle_true_mu, 2));
mle_true_sigma = Math.sqrt(mle_true_var);

// Calculate log likelihood for current guess
mle_log_likelihood = {
  let ll = 0;
  for(let p of mle_fixed_data) {
    const variance = mle_sigma * mle_sigma;
    const pdf = (1 / Math.sqrt(2 * Math.PI * variance)) * Math.exp(-Math.pow(p.x - mle_mu, 2) / (2 * variance));
    ll = ll + Math.log(pdf);
  }
  return ll;
}

// Max log likelihood for the true parameters
mle_max_ll = {
  let ll = 0;
  for(let p of mle_fixed_data) {
    const variance = mle_true_var;
    const pdf = (1 / Math.sqrt(2 * Math.PI * variance)) * Math.exp(-Math.pow(p.x - mle_true_mu, 2) / (2 * variance));
    ll = ll + Math.log(pdf);
  }
  return ll;
}

mle_pdf_curve = d3.range(-5, 5.05, 0.05).map(x => {
  const variance = mle_sigma * mle_sigma;
  return {x: x, y: (1 / Math.sqrt(2 * Math.PI * variance)) * Math.exp(-Math.pow(x - mle_mu, 2) / (2 * variance))}
});

html`
<div style="margin-bottom: 20px;">
  <strong>Current Log-Likelihood: <span style="color: ${mle_log_likelihood > mle_max_ll - 0.5 ? '#a8ff9e' : '#ff9e9e'}">${mle_log_likelihood.toFixed(2)}</span></strong><br>
  <progress value="${mle_log_likelihood}" min="-30" max="${mle_max_ll}" style="width: 100%; height: 20px; accent-color: ${mle_log_likelihood > mle_max_ll - 0.5 ? '#a8ff9e' : '#ff9e9e'};"></progress>
</div>
`

Plot.plot({
  width: 800,
  height: 350,
  x: {domain: [-5, 5], label: "Data Value (x)"},
  y: {domain: [0, 1.5], label: "Likelihood p(x|μ,σ)"},
  marks: [
    Plot.ruleY([0]),
    // The guessed PDF
    Plot.areaY(mle_pdf_curve, {x: "x", y: "y", fill: "steelblue", fillOpacity: 0.3}),
    Plot.line(mle_pdf_curve, {x: "x", y: "y", stroke: "white", strokeWidth: 2}),
    
    // The data points projected onto the PDF
    Plot.dot(mle_fixed_data, {
      x: "x", 
      y: d => {
        const variance = mle_sigma * mle_sigma;
        return (1 / Math.sqrt(2 * Math.PI * variance)) * Math.exp(-Math.pow(d.x - mle_mu, 2) / (2 * variance));
      }, 
      r: 6, stroke: "#ff7b7b", fill: "none", strokeWidth: 2
    }),
    
    // Droplines to axis
    Plot.ruleX(mle_fixed_data, {
      x: "x", 
      y1: 0, 
      y2: d => {
        const variance = mle_sigma * mle_sigma;
        return (1 / Math.sqrt(2 * Math.PI * variance)) * Math.exp(-Math.pow(d.x - mle_mu, 2) / (2 * variance));
      },
      stroke: "#ff7b7b", strokeDasharray: "2,2"
    }),

    // Data points on axis
    Plot.dot(mle_fixed_data, {x: "x", y: 0, r: 6, fill: "#ff7b7b"})
  ]
})

Adjust the Mean (\(\mu\)) and Std Dev (\(\sigma\)) to try and trap the red data points under the highest part of the blue curve.
Watch the Log-Likelihood gauge increase.
The red circles show the individual likelihood \(p(x_i|\mu,\sigma)\). The product of these heights determines the log-likelihood (converted to a sum).
Maximizing log-likelihood means pushing the curve up directly over the data points without spreading it too thin!

MLE for Gaussian mean

Gaussian log-likelihood (with known \(\sigma^2\)):

\[ \ell(\mu) = -\frac{N}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^{N}(x_i - \mu)^2 \]

Differentiate w.r.t. \(\mu\), set to zero:

\[ \hat{\mu}_{\text{MLE}} = \frac{1}{N}\sum_{i=1}^{N} x_i = \bar{x} \]

The MLE for the mean is the sample mean — intuitive and unbiased.

MLE for Gaussian variance

Differentiate w.r.t. \(\sigma^2\):

\[ \hat{\sigma}^2_{\text{MLE}} = \frac{1}{N}\sum_{i=1}^{N}(x_i - \hat{\mu})^2 \]

This is the biased sample variance (divides by \(N\), not \(N-1\)).
The bias vanishes as \(N \to \infty\) — MLE is consistent.
For small \(N\), the unbiased estimator (\(N-1\)) is often preferred.

MLE and MSE: the connection

For a regression model \(y = f_{\boldsymbol{\theta}}(\mathbf{x}) + \epsilon\) with \(\epsilon \sim \mathcal{N}(0, \sigma^2)\):

\[ \ell(\boldsymbol{\theta}) = -\frac{N}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^{N}(y_i - f_{\boldsymbol{\theta}}(\mathbf{x}_i))^2 \]

Maximizing \(\ell(\boldsymbol{\theta})\) w.r.t. \(\boldsymbol{\theta}\) is equivalent to minimizing MSE.
This provides the probabilistic justification for using MSE as a loss function.

This is the intellectual climax of the unit’s first half — pause and make it land. The MSE we minimized for six units was never arbitrary: it is exactly the Gaussian negative log-likelihood with the constants dropped. “Least squares = assuming Gaussian noise.” Write the cancellation on the board so they see the squared-error term fall out of the exponent.
Immediate consequence to state: if your noise is not Gaussian (heavy-tailed, skewed), MSE is the wrong loss — this is the bridge to the robustness/Student-t slides and to why MAE or Huber exist. The probabilistic view tells you which loss to pick, not just how to minimize it.
Forward link: same logic gives cross-entropy from a Bernoulli/categorical likelihood — mention it as a one-liner so classification losses also feel derived, not decreed.

MLE for multivariate Gaussian

For \(\mathbf{x}_i \in \mathbb{R}^d\):

\[ \hat{\boldsymbol{\mu}} = \frac{1}{N}\sum_{i=1}^{N}\mathbf{x}_i, \quad \hat{\boldsymbol{\Sigma}} = \frac{1}{N}\sum_{i=1}^{N}(\mathbf{x}_i - \hat{\boldsymbol{\mu}})(\mathbf{x}_i - \hat{\boldsymbol{\mu}})^\top \]

Direct extension of the 1D case to vectors and matrices (Murphy 2012).
Requires \(N > d\) for \(\hat{\boldsymbol{\Sigma}}\) to be invertible.

MLE: properties and limitations

Consistency: \(\hat{\theta}_{\text{MLE}} \to \theta_{\text{true}}\) as \(N \to \infty\).
Efficiency: achieves the lowest possible variance among unbiased estimators (Cramér-Rao bound).
Limitation: can overfit with small \(N\) — MLE has no built-in regularization.
MLE treats all parameter values as equally plausible before seeing data.

Robustness: the outlier problem

The Gaussian has light tails — extreme values are extremely unlikely under the model.
When outliers are present, MLE distorts \(\hat{\mu}\) and inflates \(\hat{\sigma}^2\) to accommodate them.
A single outlier can shift the mean by \(O(1/N)\) of its magnitude.
Need: a distribution with heavier tails that accommodates outliers without distortion.

Student’s t-distribution for robust estimation

The Student’s t-distribution has a parameter \(\nu\) (degrees of freedom) controlling tail heaviness.
\(\nu \to \infty\): converges to Gaussian. \(\nu = 1\): Cauchy distribution (very heavy tails).
MLE with Student’s t automatically downweights outliers.
Practical recommendation: use \(\nu \approx 4{-}10\) for moderate robustness (Murphy 2012).

Gaussian, Student-\(t\), and Laplace pdfs (left) and log-pdfs (right). The Student-\(t\) has heavier tails than the Gaussian (Murphy 2012).

Fitting Gaussian, Student, and Laplace distributions without (a) and with (b) outliers. The Gaussian fit is strongly affected by the outliers (Murphy 2012).

Bayes’ theorem — statement

\[ p(\boldsymbol{\theta} \mid \mathcal{D}) = \frac{p(\mathcal{D} \mid \boldsymbol{\theta}) \, p(\boldsymbol{\theta})}{p(\mathcal{D})} \]

Posterior \(p(\boldsymbol{\theta} \mid \mathcal{D})\): what we believe about \(\boldsymbol{\theta}\) after seeing data.
Likelihood \(p(\mathcal{D} \mid \boldsymbol{\theta})\): how probable the data is under each \(\boldsymbol{\theta}\).
Prior \(p(\boldsymbol{\theta})\): what we believed before seeing data.
Evidence \(p(\mathcal{D})\): normalizing constant.

Open the Bayesian block with the reframing that resolves the MLE limitation left on the board: MLE asked “which single θ best explains the data?”; Bayes asks “what is my full belief distribution over θ after seeing the data?” Point estimate → distribution. That shift is the whole block.
Read the four pieces as a sentence, not symbols: “belief after = (how well θ explains data) × (belief before) ÷ (normalizer).” Engineers retain the sentence, not the fraction.
Name the controversy honestly: the prior is a modeling choice and people object to its subjectivity. Pre-empt it — “the prior is an assumption you state explicitly and can defend, unlike the hidden assumptions MLE already makes (e.g. Gaussian noise).” This disarms the standard objection before a student raises it.

Components of Bayes’ theorem

The prior encodes domain knowledge or assumptions (e.g., “weights should be small”).
The likelihood is the same function used in MLE — it connects data to parameters \(\boldsymbol{\theta}\).
The posterior combines both: it is a compromise between prior knowledge and data evidence.
More data → posterior dominated by likelihood. Less data → posterior dominated by prior.

Beta distribution \(\text{Beta}(\mu|a,b)\) for different hyperparameters. Flat (\(a=b=1\)) = non-informative prior; peaked = strong prior belief (Bishop 2006).

The evidence (marginal likelihood)

\[ p(\mathcal{D}) = \int p(\mathcal{D} \mid \boldsymbol{\theta}) \, p(\boldsymbol{\theta}) \, d\boldsymbol{\theta} \]

Integrates the likelihood over all possible parameter values, weighted by the prior.
Ensures the posterior integrates to 1.
Often intractable for complex models — motivates approximation methods (MCMC, variational inference).
Also used for model comparison: models with higher evidence explain the data better.

Bayesian inference for Gaussian mean (known variance)

Prior: \(\mu \sim \mathcal{N}(\mu_0, \sigma_0^2)\).

Likelihood: \(x_i | \mu \sim \mathcal{N}(\mu, \sigma^2)\) (known \(\sigma^2\)).

Posterior: \(\mu | \mathcal{D} \sim \mathcal{N}(\mu_N, \sigma_N^2)\) where:

\[ \mu_N = \frac{\sigma^2 \mu_0 + N \sigma_0^2 \bar{x}}{\sigma^2 + N \sigma_0^2}, \quad \sigma_N^2 = \frac{\sigma^2 \sigma_0^2}{\sigma^2 + N \sigma_0^2} \]

This is a conjugate pair: Gaussian prior + Gaussian likelihood = Gaussian posterior (Bishop 2006).

One step of Bayesian update: Beta prior (left) × Bernoulli likelihood (centre) = Beta posterior (right). Conjugate structure yields the same family (Bishop 2006).

Don’t grind the algebra — interpret it. Rewrite \(\mu_N\) on the board as a precision-weighted average: posterior mean = (prior precision · prior mean + data precision · x̄) / total precision, where precision = 1/variance. That one re-expression makes every Bayesian-Gaussian result intuitive and is the form they should memorize.
Conjugacy is the only reason this is closed-form. Define it crisply: prior and posterior in the same family. Materials-relevant honest caveat: real models are rarely conjugate, hence the approximation methods flagged earlier. Conjugacy is the teaching sandbox, not the deployment reality.
Sanity checks to say aloud: σ₀²→∞ (no prior) recovers x̄ = the MLE; N→∞ also recovers the MLE. “Bayes contains MLE as a limiting case” — this reassures, not threatens, the frequentist-trained student.

Posterior update: visual intuition

Before data (\(N=0\)): posterior = prior (wide, uncertain).
After a few points (\(N=2\)): posterior narrows, shifts toward sample mean.
More data (\(N=10\)): posterior is very narrow, centered near \(\bar{x}\).
As \(N \to \infty\): posterior concentrates at \(\hat{\mu}_{\text{MLE}}\) — the prior washes out.

Bayesian inference for a Gaussian mean with known variance. Each curve is the posterior after observing \(N\) data points. The distribution narrows as \(N\) grows (Bishop 2006).

Interactive: Bayesian Posterior Update

Interactive Update
Intuition

viewof bayes_prior_mu = Inputs.range([-5, 5], {value: 0, step: 0.1, label: "Prior Mean (μ₀)"})
viewof bayes_prior_var = Inputs.range([0.1, 10], {value: 3, step: 0.1, label: "Prior Var (σ₀²)"})
viewof bayes_data_mu = Inputs.range([-5, 5], {value: 2.5, step: 0.1, label: "Data Mean (Sample x̄)"})
viewof bayes_data_var = Inputs.range([0.1, 10], {value: 1, step: 0.1, label: "Data Noise (σ²)"})
viewof bayes_N = Inputs.range([0, 50], {value: 3, step: 1, label: "Samples Observed (N)"})

bayes_post_var = 1.0 / ( (1.0/bayes_prior_var) + (bayes_N/bayes_data_var) )
bayes_post_mu = bayes_post_var * ( (bayes_prior_mu/bayes_prior_var) + (bayes_N * bayes_data_mu / bayes_data_var) )

bayes_x_vals = d3.range(-8, 8.05, 0.05);

bayes_curves = {
  const data = [];
  for(let x of bayes_x_vals) {
    // Prior
    const p_prior = (1 / Math.sqrt(2 * Math.PI * bayes_prior_var)) * Math.exp(-Math.pow(x - bayes_prior_mu, 2) / (2 * bayes_prior_var));
    
    // Likelihood (conceptually, the likelihood of the mean parameter given the data)
    // Scaled for visualization so it fits on same plot
    const likelihood_var = bayes_data_var / (bayes_N > 0 ? bayes_N : 0.0001);
    const p_like_unscaled = (1 / Math.sqrt(2 * Math.PI * likelihood_var)) * Math.exp(-Math.pow(x - bayes_data_mu, 2) / (2 * likelihood_var));
    // Scale likelihood to have max height ~ 1 for visual clarity against prior
    const p_like = bayes_N === 0 ? 0 : p_like_unscaled * Math.sqrt(2 * Math.PI * likelihood_var) * 0.5;

    // Posterior
    const p_post = (1 / Math.sqrt(2 * Math.PI * bayes_post_var)) * Math.exp(-Math.pow(x - bayes_post_mu, 2) / (2 * bayes_post_var));
    
    data.push({x: x, val: p_prior, type: "Prior P(θ)"});
    if (bayes_N > 0) data.push({x: x, val: p_like, type: "Likelihood (scaled)"});
    data.push({x: x, val: p_post, type: "Posterior P(θ|D)"});
  }
  return data;
}

Plot.plot({
  width: 800,
  height: 400,
  x: {domain: [-8, 8], label: "Mean Parameter (μ)"},
  y: {domain: [0, 1.2], label: "Density"},
  color: {
    domain: ["Prior P(θ)", "Likelihood (scaled)", "Posterior P(θ|D)"],
    range: ["#888888", "#5ca7ff", "#ff4d4d"]
  },
  marks: [
    Plot.line(bayes_curves, {x: "x", y: "val", stroke: "type", strokeWidth: 3}),
    Plot.areaY(bayes_curves, {x: "x", y: "val", fill: "type", fillOpacity: 0.15}),
    
    // Highlight MAP / Data Mean points on axis
    Plot.ruleX([bayes_prior_mu], {stroke: "#888888", strokeDasharray: "4,4"}),
    bayes_N > 0 ? Plot.ruleX([bayes_data_mu], {stroke: "#5ca7ff", strokeDasharray: "4,4"}) : null,
    Plot.ruleX([bayes_post_mu], {stroke: "#ff4d4d", strokeWidth: 2})
  ]
})

\(N = 0\): The Posterior (red) perfectly matches the Prior (gray).
Small \(N\): The Posterior is a compromise between the Prior and the Data Likelihood (blue).
Large \(N\): The Likelihood narrows dramatically, pulling the Posterior entirely towards the Data Mean (\(2.5\)). The Prior is “washed out”.
Strong Prior (small \(\sigma_0^2\)): The Prior resists the data pull much longer. Try changing Prior Var to 0.1 and notice how much data it takes to shift the belief!

Bayesian vs frequentist comparison

Aspect	Frequentist	Bayesian
Parameters	Fixed, unknown	Random variables
Inference	Point estimate + CI	Full posterior distribution
Prior knowledge	Not incorporated	Formally included
Uncertainty	Sampling variability	Posterior width
Interpretation	Long-run frequency	Degree of belief

Frequentist: Objective, based on repeated trials.
Bayesian: Subjective, based on evidence update.
Choice: Often depends on data availability and prior confidence.

Credible interval vs confidence interval

95% Bayesian credible interval: “Given the data, there is a 95% probability that \(\theta\) lies in this interval.”
95% frequentist confidence interval: “If we repeated the experiment many times, 95% of such intervals would contain the true \(\theta\).”
The Bayesian interpretation is often more natural for engineering decisions.

MAP estimation

Maximum A Posteriori: find the mode of the posterior:

\[ \hat{\boldsymbol{\theta}}_{\text{MAP}} = \arg\max_{\boldsymbol{\theta}} \, p(\boldsymbol{\theta} \mid \mathcal{D}) = \arg\max_{\boldsymbol{\theta}} \left[\log p(\mathcal{D} \mid \boldsymbol{\theta}) + \log p(\boldsymbol{\theta})\right] \]

MAP is a point estimate — it summarizes the posterior by its peak.
MAP = MLE when the prior is uniform (non-informative).

MAP closes the regularization loop from Unit 3

Unit 3 asserted: Gaussian prior → Ridge, Laplace prior → Lasso. Here we derive why.
Plug the conjugate Gaussian posterior into \(\hat{\boldsymbol{\theta}}_{\text{MAP}} = \arg\max[\log p(\mathcal{D}\mid\boldsymbol{\theta}) + \log p(\boldsymbol{\theta})]\):

\[ \underbrace{\tfrac{1}{2\sigma^2}\sum_i (y_i - f_{\boldsymbol{\theta}}(\mathbf{x}_i))^2}_{\text{negative log-likelihood}} \;+\; \underbrace{\tfrac{1}{2\tau^2}\|\boldsymbol{\theta}\|_2^2}_{\text{negative log-prior}} \]

So the penalty strength is not a free knob: \(\lambda = \sigma^2/\tau^2\) = (noise variance)/(prior variance).
MAP keeps only the mode; the posterior also carries the width \(\sigma_N^2\) — the uncertainty, which is the whole reason we went Bayesian. Caveat: mode ≠ mean for skewed posteriors, and the mode is not reparametrisation-invariant.

The intellectual payoff, but framed as the debt being repaid: Unit 3 told them “ridge = Gaussian prior” as a stated fact and explicitly pointed here for the derivation. Open by saying “remember the IOU from Unit 3 — we cash it now.” Do the cancellation on the board: the squared-error term is the Gaussian NLL exponent, the L2 penalty is the Gaussian-prior exponent, add the logs, and λ falls out as a ratio of variances, not a hyperparameter you tune blind.
The quantitative punch is \(\lambda = \sigma^2/\tau^2\): large λ ⇔ tight prior (\(\tau^2\) small) ⇔ “I strongly believe weights are near zero.” This is the same knob they dragged on the Bayesian-update interactive — close that loop explicitly rather than re-listing the Ridge/Lasso table (Unit 3 owns that table).
The deeper Unit-7-only point: MAP discards the posterior width. Frequentist ridge gives a number; the Bayesian view gives that same number plus an error bar. That gap is exactly what the predictive-distribution and conformal slides fill. Forward-point, don’t derive.

When to be Bayesian vs frequentist

Small data, safety-critical: Bayesian — uncertainty quantification is essential.
Large data, fast iteration: MLE/frequentist — posterior approximates MLE anyway.
Model comparison: Bayesian evidence is a principled selection criterion.
Engineering practice: often a pragmatic mix — MLE for training, Bayesian for uncertainty.

Predictive distribution

Instead of predicting with a point estimate, integrate over parameter uncertainty:

\[ p(\mathbf{x}_{\text{new}} \mid \mathcal{D}) = \int p(\mathbf{x}_{\text{new}} \mid \boldsymbol{\theta}) \, p(\boldsymbol{\theta} \mid \mathcal{D}) \, d\boldsymbol{\theta} \]

The predictive distribution is wider than the distribution under a point estimate.
It honestly reflects both data noise (aleatory) and parameter uncertainty (epistemic).

The single most under-appreciated slide for engineers: a point prediction plus a fixed σ under-reports uncertainty because it ignores that θ itself is uncertain. The integral adds parameter (epistemic) uncertainty on top of noise (aleatory) — the predictive distribution is correctly wider. Tie the two integrals’ two uncertainty sources back to the very first taxonomy slide; this is the unit coming full circle.
Practical translation: “plug-in” intervals (predict with θ̂, report ±σ) are systematically overconfident, especially at small N or when extrapolating. This is the mechanism behind confident-but-wrong extrapolation in materials property models.
This also exposes the gap conformal will fill: the predictive integral is usually intractable and still assumes the model is right. Foreshadow: “next we get honest intervals without trusting the model.”

Checkpoint: update a prior

Setup: Prior \(\mu_0 = 0\), \(\sigma_0^2 = 10\). Known noise \(\sigma^2 = 1\). Five observations with \(\bar{x} = 3.2\).
Compute \(\mu_N\) and \(\sigma_N^2\).
\(\sigma_N^2 = \frac{1 \cdot 10}{1 + 5 \cdot 10} = \frac{10}{51} \approx 0.196\).
\(\mu_N = \frac{1 \cdot 0 + 5 \cdot 10 \cdot 3.2}{1 + 50} = \frac{160}{51} \approx 3.14\).

Stochastic enrichment of input data

Add Gaussian noise to training inputs to simulate measurement uncertainty:
- \(\tilde{x}_i = x_i + \epsilon\), \(\epsilon \sim \mathcal{N}(0, \sigma_{\text{noise}}^2)\).
This augments the training set and makes the model robust to input perturbations.
Especially effective when the noise level matches real deployment conditions (Neuer et al. 2024).

Mixture-density networks

Standard networks predict a single value \(\hat{y}\) — they cannot express multi-modal uncertainty.
A mixture-density network predicts the parameters of a Gaussian mixture:

\[ p(y|x) = \sum_{k=1}^{K} \pi_k(x) \, \mathcal{N}(y | \mu_k(x), \sigma_k^2(x)) \]

Mixing coefficients \(\pi_k\), means \(\mu_k\), and variances \(\sigma_k^2\) are all functions of input \(x\) (Neuer et al. 2024).

The motivating failure: a standard regressor outputs one number even when the physics is genuinely bimodal (e.g., two stable phases, two processing outcomes for the same input). Averaging two modes gives a prediction in the forbidden middle — a confidently wrong answer. MDNs fix this by predicting a distribution, not a point.
Connect to everything prior: the network now outputs the parameters (π_k, μ_k, σ_k) of a Gaussian mixture — it’s the maximum-entropy/Gaussian machinery from earlier, made input-dependent and multi-modal. It’s also the supervised cousin of the GMM clustering from Unit 5 and a stepping stone to the VAE decoder (Unit 11).
Forward pointer: this is the preview of mixture-density / heteroscedastic outputs that Unit 12 develops for uncertainty. One sentence; don’t derive the (numerically delicate) MDN loss here.

Process corridors via 2D histograms

In manufacturing, define acceptable parameter ranges as probability contours.
A 2D histogram of (process parameter, quality metric) shows the process corridor.
Points outside the corridor flag anomalies or process drift.
This converts probabilistic thinking into actionable quality control (Neuer et al. 2024).

Process corridor showing safe probability regions.

This is the “probability becomes a factory floor decision” slide — the concrete answer to the unit’s opening promise. A 2D histogram of (process parameter, quality) is an empirical joint distribution; the high-density region is the process corridor (an acceptance region), and points outside flag drift/anomaly.
Tie back to the multivariate Gaussian: the corridor is the Mahalanobis-distance ellipse made empirical — same concept (statistically-surprising = far in σ-units), no Gaussian assumption needed. This is anomaly detection without a model, which also rhymes with the distribution-free spirit of the conformal section coming next.
Engineering value statement: this converts “the model is uncertain” into “this batch is out of spec” — actionable, auditable quality control. Keep it brief; it’s an application illustration, not new theory.

Materials example: property prediction with uncertainty

Predicting tensile strength of a new alloy composition.
Point prediction: 450 MPa. But how confident are we?
With uncertainty: 450 ± 35 MPa (95% credible interval).
Epistemic uncertainty is large in composition regions far from training data — flagging extrapolation.

Practical diagnostic: calibration plots

A well-calibrated model’s predicted \(p\)% confidence intervals should contain \(p\)% of test points.
Plot: predicted confidence level vs observed coverage.
Perfect calibration = diagonal line.
Overconfident models: predicted intervals are too narrow (points fall outside too often).

Distribution-Free Coverage: Conformal Prediction

The gap calibration plots leave open.

Bayesian / MAP intervals depend on the model being right. Mis-specify the prior or the likelihood and the credible interval is no longer trustworthy.
Calibration plots diagnose miscalibration but do not fix it for a new test point.
We want a wrapper that takes any point predictor and produces an interval with a finite-sample, distribution-free coverage guarantee.

Conformal prediction in one sentence.

Pick miscoverage level \(\alpha\). Conformal prediction outputs \(C(\mathbf{x})\) such that, for any exchangeable new \((X, Y)\), \[\Pr\!\left(Y \in C(X)\right) \;\geq\; 1 - \alpha.\]
No assumption on the data distribution. No assumption on the model.
Only assumption: exchangeability of calibration and test data (typically i.i.d. from the same source) (Angelopoulos and Bates 2023).

The guarantee is marginal — averaged over the test distribution. Conditional coverage (per input \(\mathbf{x}\)) needs CQR, two slides from now.

Hoisting conformal into the probabilistic unit (instead of leaving it in Unit 12) is deliberate: ML-PC u07 (time series), u08 (generalisation), and u11 (automation) all assume the audience already knows split conformal. Promoting it here resolves three cross-track forward dependencies.
Historical pointer (one sentence): the framework goes back to Vovk, Gammerman, Shafer; practical uptake exploded post-2019 with split conformal and CQR.
For materials: when you tell a metallurgist your hardness prediction is “120 \(\pm\) 8 HV with 90% coverage”, and that 90% holds on a held-out lab batch regardless of whether your NN is well-calibrated, that lands. Bayesian “credible interval” is harder to defend in a regulated context.
Forward link: ML-PC u07, u08, u11 use conformal as a black-box wrapper; this slide block is where they expect students to have seen the derivation.

Split conformal in 5 lines

The algorithm. Given a trained predictor \(\hat f\) and a separate calibration set \(\{(\mathbf{x}_i, y_i)\}_{i=1}^{n}\):

Compute nonconformity scores \(s_i = |y_i - \hat f(\mathbf{x}_i)|\) for \(i = 1, \dots, n\).
Compute the empirical quantile \[ \hat q = \mathrm{Quantile}\!\left(\{s_i\};\; \tfrac{\lceil (n+1)(1-\alpha) \rceil}{n}\right). \]
For a new \(\mathbf{x}\), output the interval \[ C(\mathbf{x}) = \big[\hat f(\mathbf{x}) - \hat q,\; \hat f(\mathbf{x}) + \hat q\big]. \]

For any exchangeable new \((X, Y)\): \[\Pr\!\left(Y \in C(X)\right) \;\geq\; 1 - \alpha.\]

The 5-line Python.

# Held-out calibration set (x_cal, y_cal);
# trained model `model`; alpha = 0.1
scores = np.abs(y_cal - model.predict(x_cal))
n = len(scores)
qhat = np.quantile(
    scores, np.ceil((n + 1) * (1 - alpha)) / n)

# At test time:
def conformal_interval(x):
    yhat = model.predict(x)
    return yhat - qhat, yhat + qhat

\(\hat q\) is a single scalar. The interval width is the same for every test point in vanilla split CP — the next slide (CQR) is the standard fix (Angelopoulos and Bates 2023).

Proof in one sentence: by exchangeability, the rank of the test-point score \(s_{n+1}\) among \(\{s_1, \dots, s_n, s_{n+1}\}\) is uniformly distributed on \(\{1, \dots, n+1\}\); thresholding at the \(\lceil (n+1)(1-\alpha) \rceil\)-th score guarantees marginal coverage.
Practical: \(n\) should be a few hundred; smaller works with chunkier quantiles. The training set is separate from the calibration set — no leakage.
Critical practical points to say aloud:
- “Split conformal is trivially parallel with any model; you don’t retrain when you change \(\alpha\).”
- “It needs exchangeability — if there’s distribution shift between calibration and test, coverage fails.”
Forward link: Unit 12 (uncertainty in predictions) uses this as a black-box wrapper around GP / MC-dropout / ensemble predictors. ML-PC u07, u08, u11 also assume this slide as a prerequisite.

Conformalized Quantile Regression (CQR) — adaptive interval widths

Why vanilla split CP is too coarse.

\(\hat q\) is the same constant for every \(\mathbf{x}\).
Easy inputs get unnecessarily wide intervals; hard inputs may still under-cover conditionally.
We want intervals that widen near hard regions of input space — i.e., locally heteroscedastic coverage.

CQR fix (Romano et al. 2019).

Train a quantile-regression model predicting \(\hat q_{\alpha/2}(\mathbf{x})\) and \(\hat q_{1-\alpha/2}(\mathbf{x})\) — e.g., a NN with two heads using the pinball loss \(\ell_\tau(r) = \max(\tau r, (\tau - 1) r)\), or a quantile gradient-boosted tree.
Compute conformity scores \[ s_i = \max\!\big(\hat q_{\alpha/2}(\mathbf{x}_i) - y_i,\; y_i - \hat q_{1-\alpha/2}(\mathbf{x}_i)\big). \]
Take \(\hat q = \mathrm{Quantile}\!\left(s_i;\; \tfrac{\lceil (n+1)(1-\alpha) \rceil}{n}\right)\) and output \[ C(\mathbf{x}) = \big[\hat q_{\alpha/2}(\mathbf{x}) - \hat q,\; \hat q_{1-\alpha/2}(\mathbf{x}) + \hat q\big]. \]

Marginal coverage is preserved; intervals adapt to local input difficulty.

The 2026 default UQ stack for a regression NN: quantile heads + CQR. Calibrate once, ship with finite-sample coverage.

Failure mode — exchangeability under drift

When the guarantee breaks.

The coverage proof needs exchangeability of \(\{(X_1, Y_1), \dots, (X_n, Y_n), (X_{n+1}, Y_{n+1})\}\).
I.i.d. data: exchangeable. Time-series with drift, train/test from different labs, post-deployment shifts: not exchangeable.
Under covariate shift the marginal coverage drops below \(1 - \alpha\); under harder shifts the guarantee can be lost entirely.

Mitigations (named, not derived).

Weighted conformal prediction — reweight calibration scores by an importance ratio (assumes known shift structure).
Online conformal — adapt the quantile \(\hat q\) from streaming feedback.
Jackknife+ / CV+ — recover finite-sample coverage with leave-one-out scoring (Angelopoulos and Bates 2023).

Engineering rule of thumb. Report split-conformal / CQR intervals and the exchangeability assumption you are making. When the assumption is shaky, instrument for coverage tracking after deployment.

Lecture-essential vs exercise content split

Lecture: uncertainty taxonomy, Gaussian distribution, MLE derivation, Bayesian framework, MAP-regularization connection, split conformal + CQR.
Exercise: noise injection and Nyquist demo, MLE implementation in NumPy, Bayesian posterior updating, MLE-MSE equivalence proof, calibration plots, 5-line split-conformal wrapper around a baseline regressor.

Exam-aligned summary: must-know statements

Aleatory uncertainty is irreducible; epistemic uncertainty is reducible with more data.
The Gaussian is the maximum-entropy distribution for given mean and variance.
MLE maximizes the probability of the observed data under the model.
For Gaussian noise, MLE is equivalent to MSE minimization.
Bayes’ theorem: posterior \(\propto\) likelihood \(\times\) prior.
Conjugate priors yield closed-form posteriors (e.g., Gaussian-Gaussian).
MAP with a Gaussian prior is equivalent to Ridge regression.
The predictive distribution integrates over parameter uncertainty.
Student’s t-distribution provides robustness to outliers.
Calibration plots assess whether predicted uncertainties match observed frequencies.
KL divergence is non-negative, zero iff the two distributions agree, and asymmetric; the closed form between Gaussians is the regularizer in the VAE loss.
Split conformal prediction wraps any predictor with a finite-sample, distribution-free marginal-coverage guarantee under exchangeability; CQR adapts the width to local input difficulty.

This is the revision sheet — read each statement and have the room supply the one-line justification from memory. If they can justify all 12, the unit landed. The four “equivalence” items (2 Gaussian=maxent, 4 MLE=MSE, 7 MAP=ridge, 11 KL closed form=VAE regularizer) are the conceptual spine — weight them.
Statement 12 (conformal: distribution-free, finite-sample, marginal coverage under exchangeability; CQR for adaptive width) is the newest and the most cross-load-bearing — ML-PC u07/u08/u11 depend on it. Drill the assumption (“exchangeability”) as hard as the guarantee; the assumption is what students forget and misuse.
Close the lecture by re-stating the arc in one breath: built the probabilistic worldview (Gaussian, MLE, Bayes), saw it justifies the losses and regularizers we’d used on faith, then armed them with conformal — a guarantee that survives the worldview being wrong. Then point to the Unit 8 reading: this machinery now gets turned on overfitting itself.

Continue

← Previous: Unit 06 — Loss Landscapes & Optimization Behavior
→ Next: Unit 08 — Tree Ensembles for Tabular Learning
All courses

References + reading assignment for next unit

Required reading before Unit 8:
- Neuer: Ch. 4.5.9 (overfitting and cross-validation)
- McClarren: Ch. 2.4 (Ridge, Lasso, elastic net)
Optional depth:
- Murphy: Ch. 6.4.4, 6.5.3 (bias-variance, CV for \(\lambda\) selection)
- Bishop: Ch. 3.2 (bias-variance decomposition)
- Angelopoulos & Bates 2023 (conformal): §1–§3 of Angelopoulos and Bates (2023)
Next unit: Generalization, Bias-Variance, Regularization, Tree Ensembles — using the probabilistic machinery to reason about overfitting.

Angelopoulos, Anastasios N., and Stephen Bates. 2023. “A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification.” Foundations and Trends in Machine Learning 16 (4): 494–591. https://doi.org/10.1561/2200000101.

Bishop, Christopher M. 2006. Pattern Recognition and Machine Learning. Springer.

Murphy, Kevin P. 2012. Machine Learning: A Probabilistic Perspective. MIT Press.

Neuer, Michael et al. 2024. Machine Learning for Engineers: Introduction to Physics-Informed, Explainable Learning Methods for AI in Engineering Applications. Springer Nature.

Romano, Yaniv, Evan Patterson, and Emmanuel J. Candès. 2019. “Conformalized Quantile Regression.” Advances in Neural Information Processing Systems 32.

Mathematical Foundations of AI & MLUnit 7: The Probabilistic View of Learning; Conformal Prediction