Mathematical Foundations of AI & ML
Unit 3: Regression and Classification as Loss Minimization

Prof. Dr. Philipp Pelz

FAU Erlangen-Nürnberg

Where we are in the triad

Just done (Unit 2):

Linear regression in matrix form, normal equations.
Ridge & Lasso closed forms; L1 vs L2 geometry.
Multicollinearity, pseudo-inverse, kernel hint.

Coming later:

Unit 6 — full optimization deep dive (momentum, Adam, conditioning, saddle points).
Unit 7 — full probabilistic learning (MLE, MAP, posterior) + conformal prediction.
Unit 8 — bias-variance decomposition.
ML-PC Unit 2 — already saw Gaussian → MSE, Poisson → Poisson NLL, Bayes/MAP table.

Learning outcomes

By the end of this unit, students can:

Recall the ERM principle and write down the supervised learning objective.
Apply gradient descent, SGD/minibatch SGD, and Newton’s method to small problems.
Derive the Newton update from a 2nd-order Taylor expansion and explain its single-step convergence on quadratics.
Identify the right loss function for a given noise model (Gaussian, Bernoulli, Poisson).
Analyze how the exponential-family / GLM framework unifies regression and classification under one likelihood.
Recognize Runge’s phenomenon and choose between polynomial, RBF, and spline bases.

The supervised learning framework

Data: \(\mathcal{D} = \{(\mathbf{x}_i, y_i)\}_{i=1}^N\) drawn iid from an unknown \(p(\mathbf{x}, y)\).
Hypothesis: parameterized predictor \(f_{\mathbf{w}}: \mathbf{x} \mapsto \hat{y}\).
Loss: \(L(\hat{y}, y)\) scores a single prediction.
Population risk: \(R(\mathbf{w}) = \mathbb{E}_{(\mathbf{x},y) \sim p}[L(f_{\mathbf{w}}(\mathbf{x}), y)]\) — what we want.
Empirical risk: \(\hat R(\mathbf{w}) = \frac{1}{N}\sum_i L(f_{\mathbf{w}}(\mathbf{x}_i), y_i)\) — what we can compute.
ERM: \(\hat{\mathbf{w}} = \arg\min_{\mathbf{w}} \hat R(\mathbf{w})\).

Optimization landscape: convex vs non-convex

Convex: any local minimum is global. (E.g. MSE for linear regression.)
Non-convex: local minima, saddle points, plateaus. (Anything with a hidden layer.)
Practical message: convex problems have one correct answer; non-convex problems have one we can find.

Plot.plot({
  grid: true,
  x: {domain: [-3, 3]},
  y: {domain: [-1, 10]},
  aspectRatio: 1,
  marks: [
    Plot.line(Array.from({length: 100}, (_, i) => {
      let x = -3 + i * 6 / 99;
      return {x: x, y: x*x};
    }), {x: "x", y: "y", stroke: "blue"}),
    Plot.line(Array.from({length: 100}, (_, i) => {
      let x = -3 + i * 6 / 99;
      return {x: x, y: x*x + 3*Math.cos(2*x) + 3};
    }), {x: "x", y: "y", stroke: "red"}),
    Plot.text([{x: 0, y: -0.5, text: "Convex"}], {x: "x", y: "y", text: "text", fill: "blue"}),
    Plot.text([{x: 1.5, y: 8, text: "Non-Convex"}], {x: "x", y: "y", text: "text", fill: "red"})
  ]
})

Gradient descent

Idea: to minimize, step in the steepest descent direction.
First-order Taylor: \(f(\mathbf{w} - \eta \nabla f(\mathbf{w})) \approx f(\mathbf{w}) - \eta \|\nabla f(\mathbf{w})\|^2 < f(\mathbf{w})\) for small \(\eta > 0\).
Update: \(\mathbf{w}_{t+1} = \mathbf{w}_t - \eta \nabla f(\mathbf{w}_t)\).
Learning rate \(\eta\):
- too small → slow;
- too large → overshoots, diverges.

Stochastic gradient descent (SGD)

Cost of full GD: \(\nabla \hat R = \frac{1}{N}\sum_i \nabla L_i\) — \(\mathcal{O}(N)\) work per step.
Stochastic estimator: pick one \(i\) uniformly at random; use \(\nabla L_i\) as the gradient.
Unbiased: \(\mathbb{E}_i[\nabla L_i] = \nabla \hat R\) — in expectation we’re still doing GD.
Behaviour: noisy steps, fast initial progress, eventually bounces near the minimum.

Minibatch SGD

Compromise: average over a minibatch of size \(b\) (typically 32–256).
Update: \(\displaystyle \mathbf{w}_{t+1} = \mathbf{w}_t - \frac{\eta}{b}\sum_{i \in \mathcal{B}_t} \nabla L_i(\mathbf{w}_t)\).
Why \(b\) matters:
1. Variance reduction — gradient estimate concentrates around the true gradient.
2. Vectorization — modern GPUs do \(b\) samples in parallel almost for free.
The default for every modern deep-learning training loop.

Beyond vanilla SGD — see Unit 6

Momentum, Nesterov, RMSProp, AdaGrad, Adam, AdamW — all first-order, all standard.
Conditioning, saddle points, plateaus, mode connectivity — landscape pathologies.
Unit 6 owns this with its own deep-dive interactives.
For the rest of this unit we ask a different question: what does second-order information buy us?

When first-order is slow: the Hessian

Hessian \(\mathbf{H} = \nabla^2 f\) — the matrix of second derivatives. It encodes local curvature.
Eigenvalues of \(\mathbf{H}\) are the principal curvatures. Their ratio is the condition number \(\kappa = \lambda_{\max}/\lambda_{\min}\).
GD’s pain: in an elongated bowl (\(\kappa \gg 1\)), GD oscillates across the steep direction while crawling along the shallow one. Convergence rate scales like \(\frac{\kappa - 1}{\kappa + 1}\).
Unit 6 owns the full conditioning treatment; here we use it only as motivation for Newton.

Newton’s method

Second-order Taylor: \(f(\mathbf{w} + \Delta) \approx f(\mathbf{w}) + \nabla f^T \Delta + \tfrac{1}{2} \Delta^T \mathbf{H}\, \Delta\).
Minimize the quadratic in \(\Delta\): \(\Delta = -\mathbf{H}^{-1} \nabla f\).
Update: \(\boxed{\;\mathbf{w}_{t+1} = \mathbf{w}_t - \mathbf{H}^{-1} \nabla f(\mathbf{w}_t)\;}\)
Key property: if \(f\) is a quadratic, the update lands on the minimum in a single step — independent of the starting point and of \(\kappa\).
For non-quadratic \(f\): Newton converges quadratically near the optimum (error squares each iteration).

Write the local quadratic model around \(\mathbf{w}\): \[ m(\Delta)=f(\mathbf{w})+\nabla f(\mathbf{w})^T\Delta+\tfrac12\Delta^T\mathbf{H}(\mathbf{w})\Delta. \] Differentiate with respect to \(\Delta\): \[ \nabla_\Delta m(\Delta)=\nabla f(\mathbf{w})+\mathbf{H}(\mathbf{w})\Delta. \] Set first-order optimality to zero: \[ \nabla f(\mathbf{w})+\mathbf{H}(\mathbf{w})\Delta=0 \quad\Longrightarrow\quad \Delta=-\mathbf{H}(\mathbf{w})^{-1}\nabla f(\mathbf{w}). \] Substitute into \(\mathbf{w}_{t+1}=\mathbf{w}_t+\Delta\): \[ \mathbf{w}_{t+1}=\mathbf{w}_t-\mathbf{H}(\mathbf{w}_t)^{-1}\nabla f(\mathbf{w}_t). \] Emphasize: on a true quadratic, this model \(m(\Delta)\) is exact, so Newton jumps to the minimizer in one step (no \(\kappa\)-dependent iteration count). For non-quadratics, the model is only locally accurate, so we re-linearize/re-quadratize each iterate; near \(\mathbf{w}^\star\), the error contracts quadratically.

Interactive: Newton vs Gradient Descent

\(f(x,y) = a\,x^2 + b\,y^2\) on a 2D ill-conditioned bowl (\(b = 1\) fixed).

viewof a_cond = Inputs.range([1, 30], {value: 10, step: 1, label: "Conditioning a"})
viewof eta_gd = Inputs.range([0.01, 0.55], {value: 0.18, step: 0.01, label: "GD η"})
viewof n_steps = Inputs.range([1, 30], {value: 8, step: 1, label: "# steps"})
viewof method = Inputs.radio(["Gradient descent", "Newton's method", "Both"], {value: "Both", label: "Algorithm"})

GD update: \(\mathbf{w} \leftarrow \mathbf{w} - \eta\,\nabla f\).
Newton update: \(\mathbf{w} \leftarrow \mathbf{w} - \mathbf{H}^{-1} \nabla f\) — for this quadratic, \(\mathbf{H} = \mathrm{diag}(2a, 2b)\).
Watch Newton land on the origin in one step regardless of \(a\), while GD oscillates as \(a\) grows.

chart_newton_vs_gd = {
  // Trajectory builders
  function gdSteps(a, b, eta, k, x0, y0) {
    const pts = [{x: x0, y: y0, iter: 0}];
    let x = x0, y = y0;
    for (let i = 1; i <= k; i++) {
      x = x - eta * 2 * a * x;
      y = y - eta * 2 * b * y;
      pts.push({x, y, iter: i});
      if (Math.abs(x) > 8 || Math.abs(y) > 8) break;
    }
    return pts;
  }
  function newtonSteps(a, b, k, x0, y0) {
    // For diagonal H, Newton step = -(1/2a)*(2a*x) = -x and -y → reaches origin in 1 step.
    const pts = [{x: x0, y: y0, iter: 0}];
    let x = x0, y = y0;
    for (let i = 1; i <= k; i++) {
      x = 0; y = 0;
      pts.push({x, y, iter: i});
    }
    return pts;
  }

  const a = a_cond, b = 1, x0 = -3.5, y0 = 2.5;
  const gd = gdSteps(a, b, eta_gd, n_steps, x0, y0);
  const nt = newtonSteps(a, b, n_steps, x0, y0);

  // Contours of f(x,y) = a x^2 + b y^2 → ellipses sqrt(c/a), sqrt(c/b)
  const contours = [];
  for (let c = 0.5; c <= 30; c *= 1.6) {
    for (let t = 0; t < Math.PI * 2; t += 0.05) {
      contours.push({x: Math.sqrt(c/a) * Math.cos(t), y: Math.sqrt(c/b) * Math.sin(t), level: c});
    }
  }

  const showGD = method !== "Newton's method";
  const showNT = method !== "Gradient descent";

  return Plot.plot({
    grid: true,
    x: {domain: [-5, 5]},
    y: {domain: [-5, 5]},
    aspectRatio: 1,
    width: 720, height: 600,
    marks: [
      Plot.dot(contours, {x: "x", y: "y", r: 1, fill: "lightgray"}),
      ...(showGD ? [
        Plot.line(gd, {x: "x", y: "y", stroke: "#e67e22", strokeWidth: 2}),
        Plot.dot(gd, {x: "x", y: "y", fill: "#e67e22", r: 4})
      ] : []),
      ...(showNT ? [
        Plot.line(nt, {x: "x", y: "y", stroke: "#3498db", strokeWidth: 2, strokeDasharray: "4,4"}),
        Plot.dot(nt, {x: "x", y: "y", fill: "#3498db", r: 5})
      ] : []),
      Plot.dot([{x: 0, y: 0}], {x: "x", y: "y", fill: "black", r: 5}),
      Plot.dot([{x: x0, y: y0}], {x: "x", y: "y", stroke: "black", r: 6})
    ]
  });
}

Newton’s catch

Memory: \(\mathbf{H}\) is \(D \times D\) — \(\mathcal{O}(D^2)\) to store.
Compute: inverting (or solving with) \(\mathbf{H}\) is \(\mathcal{O}(D^3)\).
For a deep network with \(D = 10^9\) parameters, \(\mathbf{H}\) has \(10^{18}\) entries. Not happening.
For medium-dimensional convex problems (\(D \lesssim 10^4\)), Newton is a realistic and excellent choice.

Quasi-Newton: BFGS and L-BFGS

Idea: don’t form \(\mathbf{H}\). Approximate \(\mathbf{H}^{-1}\) from a sequence of gradient differences.
BFGS update: maintain \(\mathbf{B}_t \approx \mathbf{H}^{-1}\), update it rank-2 each step using \(\mathbf{s}_t = \mathbf{w}_{t+1} - \mathbf{w}_t\) and \(\mathbf{y}_t = \nabla f_{t+1} - \nabla f_t\).
L-BFGS (limited-memory BFGS): keep only the last \(m\) pairs \((\mathbf{s}_k, \mathbf{y}_k)\) — memory drops from \(\mathcal{O}(D^2)\) to \(\mathcal{O}(mD)\).
Workhorse for medium-dim convex problems and for fine-tuning small models. scipy.optimize.minimize(method='L-BFGS-B'), torch.optim.LBFGS.

Newton on a GLM = IRLS — preview

For generalized linear models (next: §7), Newton’s method has a remarkably clean form.
The Hessian factors cleanly through the design matrix — each Newton step becomes a weighted least-squares solve.
That’s the Iteratively Reweighted Least Squares (IRLS) algorithm. We’ll close the loop with this in §7.

Checkpoint: optimization

A learning rate \(\eta\) is “too large.” What’s the symptom you’d see in a GD trajectory?
- 1. Slow, monotone decrease toward the minimum.
- 1. Trajectory oscillates and may diverge.
- 1. Stuck at a saddle point.
- 1. Newton’s method takes over.
Why does Newton’s method converge in one step on a quadratic, regardless of conditioning?
- 1. Because the Hessian is identity for quadratics.
- 1. Because the second-order Taylor approximation is the function for quadratics, so we land on the exact minimum of the local model.
- 1. Because the gradient is zero at the minimum.
- 1. It doesn’t — that’s a myth.
We don’t use plain Newton’s method to train deep networks. Why not?
- 1. Newton diverges on non-convex losses.
- 1. Storing and inverting an \(D\times D\) Hessian for \(D = 10^9\) parameters is infeasible (\(\mathcal{O}(D^2)\) memory, \(\mathcal{O}(D^3)\) inversion).
- 1. Adam is provably faster.
- 1. Newton’s method requires labeled data.

Loss as decision proxy

A loss \(L(\hat y, y)\) is a quantitative penalty for being wrong — not a fundamental quantity. We design it.
Different application contexts → different penalties:
- Should errors grow quadratically (smooth tails) or linearly (robust to outliers)?
- Are false positives and false negatives equally costly?
- Do we need calibrated probabilities, or just labels?
The loss encodes the engineering question you’re asking.

Mean squared error (MSE)

\(L_{\text{MSE}}(\hat y, y) = (\hat y - y)^2\).
Geometry: smooth convex bowl — gradient descent’s ideal landscape.
Probabilistic identity: minimizing MSE = MLE assuming iid Gaussian residuals \(\varepsilon \sim \mathcal{N}(0, \sigma^2)\).
See ML-PC Unit 2 §26 for the full Gaussian → MSE derivation; see Unit 7 for the formal MLE machinery.

Mean absolute error (MAE)

\(L_{\text{MAE}}(\hat y, y) = |\hat y - y|\).
Linear penalty → much less sensitive to outliers than MSE.
Probabilistic identity: MLE under Laplacian residuals.
Optimization caveat: non-differentiable at zero. Use sub-gradient methods, or smooth via Huber.

Huber loss

Piecewise definition with parameter \(\delta\): \[L_\delta(\hat y, y) = \begin{cases} \tfrac{1}{2}(\hat y - y)^2 & |\hat y - y| \le \delta \\ \delta(|\hat y - y| - \tfrac{1}{2}\delta) & |\hat y - y| > \delta \end{cases}\]
Quadratic in the small-error regime (smooth optimization), linear in the tails (outlier-robust).
Standard tool for industrial / engineering data where most points are clean but occasional glitches occur (Neuer et al. 2024).

Interactive: Drag-the-outlier (MSE / MAE / Huber)

We have a dataset with a few points. Try dragging the red “outlier” point up and down.

viewof huber_delta = Inputs.range([0.1, 5], {value: 1, step: 0.1, label: "Huber δ"})

Notice how: - MSE (Blue) is pulled strongly by the outlier to minimize the huge quadratic penalty. - MAE (Green) ignores the outlier completely, picking the median line. - Huber (Orange) compromises based on \(\delta\).

initialData1 = [
  {id: 0, x: 1, y: 1, isOutlier: false},
  {id: 1, x: 2, y: 2.2, isOutlier: false},
  {id: 2, x: 3, y: 2.8, isOutlier: false},
  {id: 3, x: 4, y: 4.1, isOutlier: false},
  {id: 4, x: 5, y: 4.8, isOutlier: false},
  {id: 5, x: 7, y: 10, isOutlier: true} // Initial outlier
]

mutable data1 = initialData1

// Regression helper functions
function fit_mse(pts) {
  const n = pts.length;
  let sumX = 0, sumY = 0, sumXY = 0, sumX2 = 0;
  for (let p of pts) {
    sumX += p.x; sumY += p.y;
    sumXY += p.x * p.y; sumX2 += p.x * p.x;
  }
  const denominator = (n * sumX2 - sumX * sumX);
  if (denominator === 0) return [0, sumY / n];
  const m = (n * sumXY - sumX * sumY) / denominator;
  const b = (sumY - m * sumX) / n;
  return [m, b];
}

function fit_gd(pts, loss_grad_fn, lr=0.01, epochs=1000) {
  let m = 1, b = 0;
  const mse_p = fit_mse(pts);
  m = mse_p[0]; b = mse_p[1];
  
  for(let i=0; i<epochs; i++) {
    let gm = 0, gb = 0;
    for(let p of pts) {
      let y_pred = m * p.x + b;
      let err = y_pred - p.y;
      let grad = loss_grad_fn(err);
      gm += grad * p.x;
      gb += grad;
    }
    m -= lr * (gm / pts.length);
    b -= lr * (gb / pts.length);
  }
  return [m, b];
}

mae_grad = (err) => err > 0 ? 1 : (err < 0 ? -1 : 0)

huber_grad = (err, d=huber_delta) => Math.abs(err) <= d ? err : (err > 0 ? d : -d)


mse_params = fit_mse(data1)
mae_params = fit_gd(data1, mae_grad, 0.05, 1000)
huber_params = fit_gd(data1, (e) => huber_grad(e, huber_delta), 0.05, 1000)

chart1 = {
  const width = 800;
  const height = 500;
  const margin = {top: 20, right: 30, bottom: 30, left: 40};

  const x = d3.scaleLinear().domain([0, 8]).range([margin.left, width - margin.right]);
  const y = d3.scaleLinear().domain([0, 12]).range([height - margin.bottom, margin.top]);

  const svg = d3.create("svg")
      .attr("viewBox", [0, 0, width, height])
      .style("background", "none"); 

  svg.append("g")
      .attr("transform", `translate(0,${height - margin.bottom})`)
      .call(d3.axisBottom(x))
      .attr("color", "#ccc");

  svg.append("g")
      .attr("transform", `translate(${margin.left},0)`)
      .call(d3.axisLeft(y))
      .attr("color", "#ccc");

  const drawLine = (params, color, dash="") => {
    return svg.append("line")
        .attr("x1", x(0))
        .attr("y1", y(params[1]))
        .attr("x2", x(8))
        .attr("y2", y(params[0] * 8 + params[1]))
        .attr("stroke", color)
        .attr("stroke-width", 3)
        .attr("stroke-dasharray", dash);
  };

  drawLine(mse_params, "#3498db"); 
  drawLine(mae_params, "#2ecc71"); 
  drawLine(huber_params, "#e67e22", "5,5"); 

  const dragBehavior = d3.drag()
      .on("drag", function(event, d) {
          if(!d.isOutlier) return;
          const newY = Math.max(0, Math.min(12, y.invert(event.y)));
          
          mutable data1 = data1.map(p => {
            if (p.id === d.id) return {...p, y: newY};
            return p;
          });
      });

  svg.append("g")
      .selectAll("circle")
      .data(data1)
      .join("circle")
        .attr("cx", d => x(d.x))
        .attr("cy", d => y(d.y))
        .attr("r", 8)
        .attr("fill", d => d.isOutlier ? "#e74c3c" : "#ecf0f1")
        .attr("stroke", d => d.isOutlier ? "#c0392b" : "#bdc3c7")
        .attr("stroke-width", 2)
        .attr("cursor", d => d.isOutlier ? "ns-resize" : "default")
        .call(dragBehavior);

  const legend = svg.append("g")
      .attr("transform", `translate(${margin.left + 20}, ${margin.top + 10})`)
      .attr("font-family", "sans-serif")
      .attr("font-size", 16)
      .attr("fill", "#ecf0f1");
      
  legend.append("rect").attr("x", 0).attr("y", 0).attr("width", 15).attr("height", 3).attr("fill", "#3498db");
  legend.append("text").attr("x", 25).attr("y", 5).text("MSE");
  
  legend.append("rect").attr("x", 0).attr("y", 25).attr("width", 15).attr("height", 3).attr("fill", "#2ecc71");
  legend.append("text").attr("x", 25).attr("y", 30).text("MAE");
  
  legend.append("rect").attr("x", 0).attr("y", 50).attr("width", 15).attr("height", 3).attr("fill", "#e67e22");
  legend.append("text").attr("x", 25).attr("y", 55).text("Huber");

  return svg.node();
}

Beyond Gaussian: heteroscedastic and count data

Heteroscedastic noise (variance varies with signal): predict \(\sigma^2\) alongside \(\hat y\); loss becomes \(\frac{(y - \hat y)^2}{2\hat\sigma^2} + \tfrac{1}{2}\log\hat\sigma^2\).
Poisson / count data (low-dose imaging, photon counting): use Poisson NLL \(L = \hat\mu - y\log\hat\mu\), not MSE. See ML-PC Unit 2 §27.
Heavy-tailed noise: use Student-\(t\) or log-cosh.
The principle: loss = NLL of the noise model. We’ll formalize this in §7.

The 0–1 loss problem

For binary classification with \(y \in \{0, 1\}\), the natural loss is \(L = \mathbf{1}\{\hat y \ne y\}\).
The bug: non-differentiable, gradient is zero almost everywhere — gradient descent has nothing to follow.
The fix: use a surrogate loss — a smooth differentiable function that bounds the 0–1 loss from above.
Surrogates also let us output calibrated probabilities, not just labels.

Cross-entropy = Bernoulli / Categorical NLL

Binary: model \(p(y=1\mid\mathbf{x}) = \sigma(\mathbf{w}^T\mathbf{x})\) with sigmoid \(\sigma\).
Cross-entropy loss: \(L = -[y \log p + (1-y)\log(1-p)]\).
Identity: this is the negative log-likelihood of a Bernoulli\((p)\) on \(y\).
Multi-class: softmax + categorical cross-entropy = NLL of a Categorical distribution.
Behaviour: confident-but-wrong predictions are punished hugely (NLL → ∞). See Unit 7 for the full MLE framework.

Interactive: Cross-entropy & decision boundary

Adjust the logistic regression model’s weights to classify the blue (\(y=1\)) and red (\(y=0\)) points.

viewof w1 = Inputs.range([-5, 5], {value: 1.5, step: 0.1, label: "Weight w₁"})

viewof w2 = Inputs.range([-5, 5], {value: 1.5, step: 0.1, label: "Weight w₂"})

viewof bias = Inputs.range([-15, 10], {value: -8, step: 0.2, label: "Bias b"})

The background color shows predicted probability \(P(y=1|\mathbf{x}) = \sigma(w_1 x_1 + w_2 x_2 + b)\). Notice how strongly the loss explodes if you push a red point deep into the “confident blue” region!

chart4 = {
  const width = 600;
  const height = 500;
  const margin = {top: 40, right: 30, bottom: 30, left: 40};

  const x = d3.scaleLinear().domain([0, 10]).range([margin.left, width - margin.right]);
  const y = d3.scaleLinear().domain([0, 10]).range([height - margin.bottom, margin.top]);

  const svg = d3.create("svg")
      .attr("viewBox", [0, 0, width, height])
      .style("background", "none");

  const pts = [
    {x: 2, y: 3, c: 0}, {x: 3, y: 2, c: 0}, {x: 4, y: 4, c: 0},
    {x: 2, y: 5, c: 0}, {x: 5, y: 2, c: 0}, {x: 3, y: 4, c: 0},
    {x: 7, y: 8, c: 1}, {x: 8, y: 7, c: 1}, {x: 6, y: 9, c: 1},
    {x: 8, y: 9, c: 1}, {x: 9, y: 6, c: 1}, {x: 7, y: 6, c: 1},
    {x: 5, y: 6, c: 1}
  ];

  const sigmoid = (z) => 1 / (1 + Math.exp(-z));

  const res = 40;
  const dx = 10 / res;
  const dy = 10 / res;

  const grid = [];
  for(let i=0; i<res; i++) {
    for(let j=0; j<res; j++) {
        let gx = i * dx;
        let gy = j * dy;
        let p = sigmoid(w1 * gx + w2 * gy + bias);
        grid.push({x: gx, y: gy, p: p});
    }
  }

  svg.append("g")
      .selectAll("rect")
      .data(grid)
      .join("rect")
      .attr("x", d => x(d.x))
      .attr("y", d => y(d.y + dy))
      .attr("width", x(dx) - x(0) + 1)
      .attr("height", y(0) - y(dy) + 1)
      .attr("fill", d => d3.interpolateRdBu(1 - d.p))
      .attr("opacity", 0.6);

  svg.append("g")
      .attr("transform", `translate(0,${height - margin.bottom})`)
      .call(d3.axisBottom(x))
      .attr("color", "#ecf0f1");

  svg.append("g")
      .attr("transform", `translate(${margin.left},0)`)
      .call(d3.axisLeft(y))
      .attr("color", "#ecf0f1");

  if (Math.abs(w2) > 0.01) {
      let x1 = 0, y1 = (-w1 * x1 - bias) / w2;
      let x2 = 10, y2 = (-w1 * x2 - bias) / w2;

      svg.append("line")
         .attr("x1", x(x1)).attr("y1", y(y1))
         .attr("x2", x(x2)).attr("y2", y(y2))
         .attr("stroke", "#f1c40f")
         .attr("stroke-width", 3)
         .attr("stroke-dasharray", "5,5");
  } else if (Math.abs(w1) > 0.01) {
      let x1 = -bias / w1;
      svg.append("line")
         .attr("x1", x(x1)).attr("y1", y(0))
         .attr("x2", x(x1)).attr("y2", y(10))
         .attr("stroke", "#f1c40f")
         .attr("stroke-width", 3)
         .attr("stroke-dasharray", "5,5");
  }

  svg.append("g")
      .selectAll("circle")
      .data(pts)
      .join("circle")
      .attr("cx", d => x(d.x))
      .attr("cy", d => y(d.y))
      .attr("r", 8)
      .attr("fill", d => d.c === 1 ? "#3498db" : "#e74c3c")
      .attr("stroke", "#ecf0f1")
      .attr("stroke-width", 2);

  let ce_sum = 0;
  for(let p of pts) {
      let prob = sigmoid(w1 * p.x + w2 * p.y + bias);
      prob = Math.max(1e-15, Math.min(1 - 1e-15, prob));
      if (p.c === 1) ce_sum += -Math.log(prob);
      else ce_sum += -Math.log(1 - prob);
  }
  let ce_loss = ce_sum / pts.length;

  svg.append("rect")
     .attr("x", width - margin.right - 250)
     .attr("y", margin.top - 30)
     .attr("width", 240)
     .attr("height", 40)
     .attr("fill", "rgba(44, 62, 80, 0.8)")
     .attr("rx", 5);

  svg.append("text")
     .attr("x", width - margin.right - 240)
     .attr("y", margin.top - 5)
     .text(`Cross-Entropy Loss: ${ce_loss.toFixed(3)}`)
     .attr("fill", "#ecf0f1")
     .attr("font-size", 18)
     .attr("font-weight", "bold")
     .attr("font-family", "monospace");

  return svg.node();
}

Margin-based view: hinge & calibration

Hinge loss (\(y \in \{-1,+1\}\)): \(L = \max(0, 1 - y\hat y)\). Goal: correct classification with a margin, not just correct labels — the SVM principle.
Once correctly classified beyond the margin, the gradient is zero — no further “improvement” is needed. Crisp decision-boundary geometry, but no probabilities.
Proper scoring rules (Brier, log-loss): the loss is minimized in expectation by the true probability. Cross-entropy is proper; misclassification rate is not.
Choose the loss to match what you need — labels (hinge) or calibrated probabilities (cross-entropy).

The linearity principle

A “linear model” is linear in the parameters \(\mathbf{w}\) — not in the inputs \(\mathbf{x}\).
Replace raw \(\mathbf{x}\) with a feature vector \(\boldsymbol\phi(\mathbf{x})\). The predictor becomes \[f_{\mathbf{w}}(\mathbf{x}) = \mathbf{w}^T \boldsymbol\phi(\mathbf{x}).\]
The model is now non-linear in \(\mathbf{x}\) but all the linear-regression machinery still applies to \(\mathbf{w}\).
This is one of the most useful conceptual moves in ML.

Formal basis-function expansion

Map \(\boldsymbol\phi: \mathbb{R}^d \to \mathbb{R}^M\), \(\boldsymbol\phi(\mathbf{x}) = (\phi_1(\mathbf{x}), \ldots, \phi_M(\mathbf{x}))^T\).
Design matrix: \(\boldsymbol\Phi \in \mathbb{R}^{N \times M}\) with \(\boldsymbol\Phi_{ij} = \phi_j(\mathbf{x}_i)\).
Normal equations (from Unit 2, applied to \(\boldsymbol\Phi\)): \[\hat{\mathbf{w}} = (\boldsymbol\Phi^T \boldsymbol\Phi)^{-1} \boldsymbol\Phi^T \mathbf{y}.\]
The same OLS / Ridge / Lasso closed forms — they just live in feature space now.

Polynomial basis

Univariate: \(\phi_j(x) = x^{j}\) for \(j = 0, 1, \ldots, M\).
The design matrix is the Vandermonde matrix \(\boldsymbol\Phi_{ij} = x_i^{j}\).
Recovers polynomial regression. Nice and familiar — but watch out:

Runge’s phenomenon

High-degree polynomials interpolated at evenly spaced points oscillate wildly near the boundary.
Classic example: fit \(f(x) = 1/(1 + 25 x^2)\) on \([-1, 1]\) with degree-15 polynomial → boundary error grows with degree, not shrinks.
Lesson: higher complexity ≠ better fit. Global polynomials are a poor basis when you need flexibility and boundedness.
We’ll see two fixes: localize the basis (RBFs, splines) or constrain the weights (regularization, §6).

Interactive: Runge’s phenomenon — polynomial fitting

Let’s fit a polynomial to noisy training data to see the Bias-Variance tradeoff live.

viewof poly_degree = Inputs.range([1, 10], {value: 1, step: 1, label: "Polynomial Degree"})

Degree 1-2: Underfitting (High Bias). The model is too simple to capture the intrinsic curve.
Degree 3-4: The “Sweet Spot”. Fits the true curve well.
Degree 9-10: Overfitting (High Variance). The model memorizes the noise, leading to wild oscillations between data points. Test error explodes!

chart3 = {
  const width = 800;
  const height = 500;
  const margin = {top: 20, right: 30, bottom: 40, left: 50};

  const ground_truth = (x) => Math.sin(1.5 * Math.PI * x);
  
  const raw_x = [0.02, 0.05, 0.1, 0.15, 0.22, 0.28, 0.35, 0.40, 0.45, 0.52, 0.58, 0.65, 0.70, 0.78, 0.85, 0.90, 0.95, 0.98];
  const train_data = raw_x.map(x => ({
    x: x, 
    y: ground_truth(x) + (Math.sin(x * 1234) * 0.25) 
  }));

  const test_x = [0.01, 0.08, 0.12, 0.2, 0.25, 0.3, 0.38, 0.42, 0.48, 0.55, 0.6, 0.68, 0.75, 0.8, 0.88, 0.92, 0.96, 0.99];
  const test_data = test_x.map(x => ({
    x: x, 
    y: ground_truth(x) + (Math.sin(x * 4321) * 0.25)
  }));

  const svg = d3.create("svg")
      .attr("viewBox", [0, 0, width, height])
      .style("background", "none");

  const leftWidth = width * 0.55;
  const rightWidth = width * 0.45;

  const xLeft = d3.scaleLinear().domain([0, 1]).range([margin.left, leftWidth - margin.right/2]);
  const yLeft = d3.scaleLinear().domain([-1.8, 1.8]).range([height - margin.bottom, margin.top]);

  svg.append("g")
      .attr("transform", `translate(0,${yLeft(0)})`)
      .call(d3.axisBottom(xLeft).ticks(5))
      .attr("color", "#7f8c8d");

  svg.append("g")
      .attr("transform", `translate(${margin.left},0)`)
      .call(d3.axisLeft(yLeft).ticks(5))
      .attr("color", "#7f8c8d");

  const truth_pts = [];
  for(let x=0; x<=1; x+=0.01) truth_pts.push([x, ground_truth(x)]);
  svg.append("path")
      .datum(truth_pts)
      .attr("fill", "none")
      .attr("stroke", "#7f8c8d")
      .attr("stroke-width", 2)
      .attr("stroke-dasharray", "5,5")
      .attr("d", d3.line().x(d => xLeft(d[0])).y(d => yLeft(d[1])));

  const fit_poly = (data, degree) => {
    const X = data.map(d => {
        let row = [];
        for(let j=0; j<=degree; j++) row.push(Math.pow(d.x, j));
        return row;
    });
    const Y = data.map(d => d.y);
    
    const XT = [];
    for(let j=0; j<=degree; j++) {
        let row = [];
        for(let i=0; i<data.length; i++) row.push(X[i][j]);
        XT.push(row);
    }
    
    const XTX = [];
    for(let i=0; i<=degree; i++) {
        let row = [];
        for(let j=0; j<=degree; j++) {
            let sum = 0;
            for(let k=0; k<data.length; k++) sum += XT[i][k] * X[k][j];
            row.push(sum);
        }
        XTX.push(row);
    }

    const XTY = [];
    for(let i=0; i<=degree; i++) {
        let sum = 0;
        for(let k=0; k<data.length; k++) sum += XT[i][k] * Y[k];
        XTY.push(sum);
    }

    for(let i=0; i<=degree; i++) {
        let max_el = Math.abs(XTX[i][i]);
        let max_row = i;
        for(let k=i+1; k<=degree; k++) {
            if (Math.abs(XTX[k][i]) > max_el) {
                max_el = Math.abs(XTX[k][i]);
                max_row = k;
            }
        }
        
        let tmp = XTX[i]; XTX[i] = XTX[max_row]; XTX[max_row] = tmp;
        let tmpY = XTY[i]; XTY[i] = XTY[max_row]; XTY[max_row] = tmpY;
        
        for (let k=i+1; k<=degree; k++) {
            let c = -XTX[k][i] / XTX[i][i];
            for(let j=i; j<=degree; j++) {
                if(i===j) XTX[k][j] = 0;
                else XTX[k][j] += c * XTX[i][j];
            }
            XTY[k] += c * XTY[i];
        }
    }
    
    let w = new Array(degree+1).fill(0);
    for(let i=degree; i>=0; i--) {
        if(Math.abs(XTX[i][i]) < 1e-12) { w[i] = 0; continue; }
        w[i] = XTY[i] / XTX[i][i];
        for(let k=i-1; k>=0; k--) {
            XTY[k] -= XTX[k][i] * w[i];
        }
    }
    return w;
  };

  const predict = (x, weights) => {
    let sum = 0;
    for(let j=0; j<weights.length; j++) sum += weights[j] * Math.pow(x, j);
    return sum;
  };

  const calc_mse = (data, weights) => {
    let sum = 0;
    for(let d of data) {
        let err = d.y - predict(d.x, weights);
        sum += err * err;
    }
    return sum / data.length;
  };

  const weights = fit_poly(train_data, poly_degree);
  
  const fit_pts = [];
  for(let x=0; x<=1; x+=0.01) {
    let y = predict(x, weights);
    // clip wild oscillations to bounding box visually
    if(y > 2) y = 2; if(y < -2) y = -2;
    fit_pts.push([x, y]);
  }
  
  svg.append("path")
      .datum(fit_pts)
      .attr("fill", "none")
      .attr("stroke", "#e74c3c")
      .attr("stroke-width", 3)
      .attr("d", d3.line().x(d => xLeft(d[0])).y(d => yLeft(d[1])));

  svg.append("g")
      .selectAll("circle")
      .data(train_data)
      .join("circle")
      .attr("cx", d => xLeft(d.x))
      .attr("cy", d => yLeft(d.y))
      .attr("r", 5)
      .attr("fill", "#3498db")
      .attr("stroke", "#2980b9");

  const xRight = d3.scaleLinear().domain([1, 10]).range([leftWidth + margin.left, width - margin.right]);
  const yRight = d3.scaleLog().domain([0.01, 100]).range([height - margin.bottom, margin.top]);

  svg.append("g")
      .attr("transform", `translate(0,${height - margin.bottom})`)
      .call(d3.axisBottom(xRight).ticks(5))
      .attr("color", "#7f8c8d");

  svg.append("g")
      .attr("transform", `translate(${leftWidth + margin.left},0)`)
      .call(d3.axisLeft(yRight).ticks(4).tickFormat(d => d))
      .attr("color", "#7f8c8d");

  const train_errs = [];
  const test_errs = [];
  for(let d=1; d<=poly_degree; d++) {
      let w = fit_poly(train_data, d);
      train_errs.push([d, Math.max(0.01, Math.min(100, calc_mse(train_data, w)))]);
      test_errs.push([d, Math.max(0.01, Math.min(100, calc_mse(test_data, w)))]);
  }

  svg.append("path")
      .datum(train_errs)
      .attr("fill", "none")
      .attr("stroke", "#3498db")
      .attr("stroke-width", 2)
      .attr("d", d3.line().x(d => xRight(d[0])).y(d => yRight(d[1])));

  svg.append("g")
      .selectAll("circle.train")
      .data(train_errs)
      .join("circle")
      .attr("class", "train")
      .attr("cx", d => xRight(d[0]))
      .attr("cy", d => yRight(d[1]))
      .attr("r", 4)
      .attr("fill", "#3498db");

  svg.append("path")
      .datum(test_errs)
      .attr("fill", "none")
      .attr("stroke", "#e67e22")
      .attr("stroke-width", 2)
      .attr("d", d3.line().x(d => xRight(d[0])).y(d => yRight(d[1])));
      
  svg.append("g")
      .selectAll("circle.test")
      .data(test_errs)
      .join("circle")
      .attr("class", "test")
      .attr("cx", d => xRight(d[0]))
      .attr("cy", d => yRight(d[1]))
      .attr("r", 4)
      .attr("fill", "#e67e22");

  svg.append("text").attr("x", margin.left + 10).attr("y", margin.top).text("Training Data").attr("fill", "#3498db").attr("font-size", 14);
  svg.append("text").attr("x", margin.left + 10).attr("y", margin.top + 20).text("True Function").attr("fill", "#7f8c8d").attr("font-size", 14);
  svg.append("text").attr("x", margin.left + 10).attr("y", margin.top + 40).text("Model Fit").attr("fill", "#e74c3c").attr("font-size", 14);

  svg.append("text").attr("x", leftWidth + margin.left + 10).attr("y", margin.top).text("Train Error").attr("fill", "#3498db").attr("font-size", 14);
  svg.append("text").attr("x", leftWidth + margin.left + 10).attr("y", margin.top + 20).text("Test Error").attr("fill", "#e67e22").attr("font-size", 14);

  return svg.node();
}

Radial basis functions (RBFs)

Pick centers \(\boldsymbol\mu_1, \ldots, \boldsymbol\mu_M\) and a bandwidth \(\sigma\).
Each basis function is localized: \[\phi_k(\mathbf{x}) = \exp\!\left(-\tfrac{\|\mathbf{x} - \boldsymbol\mu_k\|^2}{2\sigma^2}\right).\]
\(\sigma\) controls width: small \(\sigma\) → sharp local “bumps”, large \(\sigma\) → smooth global features.
Center placement matters: equispaced, \(k\)-means on data, or every data point is a center (recovers a kernel method — see Unit 2).

Splines

Idea: stitch low-degree polynomials together at fixed knots with continuity constraints.
Cubic spline: piecewise degree-3 polynomials, continuous in value, 1st, and 2nd derivative — visually indistinguishable from a smooth curve.
B-spline basis: an explicit set of \(M\) basis functions that span the same space; each B-spline has local support, so the design matrix is sparse and conditioning is excellent.
Why splines fix Runge: the basis is local. Increasing complexity adds knots, not global oscillation modes.

Interactive: Basis function explorer

Same data, three bases. Notice that all three fits use the same normal equations \(\hat{\mathbf{w}} = (\boldsymbol\Phi^T\boldsymbol\Phi)^{-1}\boldsymbol\Phi^T\mathbf{y}\) — only \(\boldsymbol\Phi\) changes.

viewof basis_kind = Inputs.radio(["Polynomial", "Gaussian RBF", "B-spline (cubic)"], {value: "Polynomial", label: "Basis"})
viewof n_basis = Inputs.range([2, 25], {value: 8, step: 1, label: "# basis fns"})
viewof rbf_sigma = Inputs.range([0.03, 0.5], {value: 0.12, step: 0.01, label: "RBF σ (RBF only)"})
viewof show_basis = Inputs.toggle({value: true, label: "Show basis fns"})

Polynomial → global, prone to Runge.
RBF → localized; bandwidth \(\sigma\) trades smoothness vs flexibility.
B-spline → piecewise polynomial with local support; clean conditioning.

chart_basis = {
  // Fixed seeded dataset: noisy sin(2 pi x)
  function rng(seed) {
    let s = seed >>> 0;
    return () => {
      s = (s * 1664525 + 1013904223) >>> 0;
      return s / 4294967296;
    };
  }
  const rand = rng(42);
  function nrand() {
    // Box-Muller
    let u = 0, v = 0;
    while (u === 0) u = rand();
    while (v === 0) v = rand();
    return Math.sqrt(-2 * Math.log(u)) * Math.cos(2 * Math.PI * v);
  }
  const N = 30, sigma_y = 0.15;
  const data = [];
  for (let i = 0; i < N; i++) {
    const x = i / (N - 1);
    const y = Math.sin(2 * Math.PI * x) + sigma_y * nrand();
    data.push({x, y});
  }

  function buildPhi(xs, M, kind) {
    const Phi = xs.map(() => new Array(M).fill(0));
    if (kind === "Polynomial") {
      for (let i = 0; i < xs.length; i++)
        for (let j = 0; j < M; j++) Phi[i][j] = Math.pow(xs[i], j);
    } else if (kind === "Gaussian RBF") {
      const centers = M === 1 ? [0.5] : Array.from({length: M}, (_, k) => k / (M - 1));
      for (let i = 0; i < xs.length; i++)
        for (let j = 0; j < M; j++)
          Phi[i][j] = Math.exp(-((xs[i] - centers[j]) ** 2) / (2 * rbf_sigma * rbf_sigma));
    } else {
      // Cubic B-spline basis with M basis functions and uniform knots on [0,1].
      // Use de Boor recursion; clamp endpoints by extending knots.
      const k = 3; // cubic
      const nKnots = M + k + 1;
      // Open uniform knot vector on [0, 1]:
      const knots = [];
      for (let i = 0; i < nKnots; i++) {
        if (i <= k) knots.push(0);
        else if (i >= nKnots - 1 - k) knots.push(1);
        else knots.push((i - k) / (M - k));
      }
      function N_jk(j, k_, x) {
        if (k_ === 0) {
          return (knots[j] <= x && x < knots[j + 1]) || (x === 1 && j === M - 1) ? 1 : 0;
        }
        const d1 = knots[j + k_] - knots[j];
        const d2 = knots[j + k_ + 1] - knots[j + 1];
        const t1 = d1 > 0 ? (x - knots[j]) / d1 * N_jk(j, k_ - 1, x) : 0;
        const t2 = d2 > 0 ? (knots[j + k_ + 1] - x) / d2 * N_jk(j + 1, k_ - 1, x) : 0;
        return t1 + t2;
      }
      for (let i = 0; i < xs.length; i++)
        for (let j = 0; j < M; j++) Phi[i][j] = N_jk(j, k, xs[i]);
    }
    return Phi;
  }

  // Solve normal equations via QR-ish path; small system, use Cholesky-style direct solve.
  function solveNormal(Phi, y) {
    const N = Phi.length, M = Phi[0].length;
    const A = Array.from({length: M}, () => new Array(M).fill(0));
    const b = new Array(M).fill(0);
    for (let i = 0; i < M; i++) {
      for (let j = 0; j < M; j++) {
        let s = 0;
        for (let n = 0; n < N; n++) s += Phi[n][i] * Phi[n][j];
        A[i][j] = s;
      }
      let s = 0;
      for (let n = 0; n < N; n++) s += Phi[n][i] * y[n];
      b[i] = s;
      A[i][i] += 1e-8; // tiny ridge for numerical stability
    }
    // Gaussian elimination
    for (let i = 0; i < M; i++) {
      let piv = i;
      for (let r = i + 1; r < M; r++) if (Math.abs(A[r][i]) > Math.abs(A[piv][i])) piv = r;
      [A[i], A[piv]] = [A[piv], A[i]];
      [b[i], b[piv]] = [b[piv], b[i]];
      for (let r = i + 1; r < M; r++) {
        const f = A[r][i] / A[i][i];
        for (let c = i; c < M; c++) A[r][c] -= f * A[i][c];
        b[r] -= f * b[i];
      }
    }
    const w = new Array(M).fill(0);
    for (let i = M - 1; i >= 0; i--) {
      let s = b[i];
      for (let j = i + 1; j < M; j++) s -= A[i][j] * w[j];
      w[i] = s / A[i][i];
    }
    return w;
  }

  const xs = data.map(p => p.x);
  const ys = data.map(p => p.y);
  const Phi = buildPhi(xs, n_basis, basis_kind);
  const w = solveNormal(Phi, ys);

  // Curve
  const grid = Array.from({length: 200}, (_, i) => i / 199);
  const Phigrid = buildPhi(grid, n_basis, basis_kind);
  const yhat = Phigrid.map(row => row.reduce((s, v, j) => s + v * w[j], 0));
  const fit = grid.map((x, i) => ({x, y: yhat[i]}));

  // Basis fns curves (only if toggled)
  const basisCurves = [];
  if (show_basis) {
    for (let j = 0; j < n_basis; j++) {
      for (let i = 0; i < grid.length; i++) {
        basisCurves.push({x: grid[i], y: Phigrid[i][j], k: j});
      }
    }
  }

  return Plot.plot({
    grid: true,
    width: 950,
    height: 600,
    x: {domain: [0, 1]},
    y: {domain: [-1.4, 1.4]},
    marks: [
      Plot.line(grid.map(x => ({x, y: Math.sin(2 * Math.PI * x)})), {x: "x", y: "y", stroke: "#bdc3c7", strokeWidth: 2, strokeDasharray: "3,3"}),
      ...(show_basis ? [Plot.line(basisCurves, {x: "x", y: "y", z: "k", stroke: "lightblue", strokeOpacity: 0.6})] : []),
      Plot.dot(data, {x: "x", y: "y", fill: "#e74c3c", r: 4}),
      Plot.line(fit, {x: "x", y: "y", stroke: "#2c3e50", strokeWidth: 3})
    ]
  });
}

The bias-variance picture (one-frame summary)

High bias (underfit): basis is too rigid — wrong systematic shape.
High variance (overfit): basis is too flexible — fits noise.
Total expected error = \(\text{Bias}^2 + \text{Variance} + \sigma^2_{\text{irreducible}}\).
Sweet spot depends on \(N\): more data lets you afford more flexibility.
Full decomposition + math: Unit 8.

graph TD
    Complexity[Model Complexity →]
    Total[Total Error]
    Bias["Bias²"]
    Var[Variance]
    Bias --> Total
    Var --> Total
    Complexity --> Bias
    Complexity --> Var
    style Total fill:#f96,stroke:#333

Connection to kernels

If \(M\) is huge (or infinite), forming \(\boldsymbol\Phi^T\boldsymbol\Phi\) is impossible.
Kernel trick: any algorithm that depends only on \(\langle\boldsymbol\phi(\mathbf{x}_i), \boldsymbol\phi(\mathbf{x}_j)\rangle\) can compute that inner product without ever materializing \(\boldsymbol\phi\).
\(k(\mathbf{x},\mathbf{x}') = \exp(-\|\mathbf{x}-\mathbf{x}'\|^2/(2\sigma^2))\) → infinite-dim feature space, finite computation.
See Unit 2 “Kernel hint from inner products” — full treatment in advanced courses (Gaussian processes, SVMs).

Bridge: complex models need a constraint

We can make our hypothesis class arbitrarily flexible by stacking basis functions.
With finite data, more flexibility = more variance = worse generalization (without help).
The fix: constrain the parameters. Penalize complexity.
That’s regularization — and it has a beautifully clean Bayesian interpretation.

Checkpoint: basis functions

We say RBF regression is a “linear model.” In what sense?
- 1. The map \(\mathbf{x} \mapsto \hat y\) is a straight line.
- 1. The model is linear in the parameters \(\mathbf{w}\), even though it’s nonlinear in \(\mathbf{x}\).
- 1. Both (a) and (b).
- 1. Neither — RBF is a fundamentally nonlinear model.
Why does Runge’s phenomenon happen?
- 1. Because we’re using the wrong loss.
- 1. Because high-degree polynomials are global — local data changes affect the whole curve, and equispaced interpolation forces large oscillations near the boundary.
- 1. Because polynomial regression overfits noise.
- 1. Because the design matrix is singular at high degree.
Going from polynomial basis to B-spline basis with the same \(M\):
- 1. Changes the model class entirely; OLS no longer applies.
- 1. Changes only the design matrix \(\boldsymbol\Phi\); the normal equations are identical.
- 1. Forces us to use gradient descent.
- 1. Adds regularization automatically.

Recap from Unit 2: Ridge & Lasso

Ridge (\(\ell_2\) penalty \(\lambda\|\mathbf{w}\|_2^2\)): \[\hat{\mathbf{w}}_{\text{ridge}} = (\boldsymbol\Phi^T\boldsymbol\Phi + \lambda\mathbf{I})^{-1}\boldsymbol\Phi^T\mathbf{y}.\] Spectral effect: shifts every eigenvalue by \(+\lambda\).

Lasso (\(\ell_1\) penalty \(\lambda\|\mathbf{w}\|_1\)): no closed form, but the constraint region has corners → exact sparsity.
Constraint geometry (sphere vs diamond) — see Unit 2 for the full picture.

We do not rederive these here — Unit 2 owns the derivations.

The MAP interpretation: every regularizer is a prior

MAP estimate: \(\hat{\mathbf{w}} = \arg\max_{\mathbf{w}}\left[\log p(\mathbf{y}\mid\mathbf{w}) + \log p(\mathbf{w})\right]\).
The first term is the negative loss (likelihood). The second term is the prior.
Gaussian prior \(\mathbf{w} \sim \mathcal{N}(\mathbf{0}, \tau^2\mathbf{I})\) → \(\log p(\mathbf{w}) \propto -\tfrac{1}{2\tau^2}\|\mathbf{w}\|_2^2\) → Ridge.
Laplace prior \(\mathbf{w}_j \sim \text{Laplace}(0, b)\) → \(\log p(\mathbf{w}) \propto -\tfrac{1}{b}\|\mathbf{w}\|_1\) → Lasso.
Punchline: every loss is an NLL; every regularizer is a prior. See ML-PC Unit 2 §28 for the full ML↔︎Bayes table; Unit 7 for the formal posterior treatment.

What lives elsewhere (and why)

Dropout, batch-norm, label smoothing, focal loss, early stopping, data augmentation → deep-learning units. Implementation tricks, not foundations.
Full bias-variance decomposition → Unit 8.
Full Bayesian (posterior, predictive, evidence) → Unit 7.
First-order optimizer zoo (Adam, RMSProp, Nesterov) → Unit 6.
Physical noise → loss derivations (Gaussian, Poisson, Weibull) → ML-PC Unit 2.
This unit’s contribution: the abstraction that ties them together — coming next.

Exponential family: canonical form

A distribution belongs to the exponential family if its density has the form \[p(y\mid\eta) = h(y)\,\exp\!\left(\eta^T T(y) - A(\eta)\right).\]
\(\eta\) — natural parameter (the parameter we’ll learn).
\(T(y)\) — sufficient statistic of the data.
\(A(\eta)\) — log-partition / cumulant function (normalizer).
\(h(y)\) — base measure (does not depend on \(\eta\)).

Three examples in canonical form

Distribution	\(T(y)\)	\(A(\eta)\)	\(\eta\) in terms of usual parameter
Gaussian (\(\sigma^2\) known)	\(y\)	\(\eta^2/2\)	\(\eta = \mu\)
Bernoulli	\(y\)	\(\log(1 + e^{\eta})\)	\(\eta = \log\frac{p}{1-p}\) (logit)
Poisson	\(y\)	\(e^{\eta}\)	\(\eta = \log\mu\)

Verify: plug each into \(p(y\mid\eta) = h(y)\exp(\eta\, T(y) - A(\eta))\) and you recover the original density.
\(h(y)\) for Poisson is \(1/y!\); for Bernoulli, \(h(y) = 1\); for Gaussian (with \(\sigma^2\) known), \(h(y) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp(-y^2/2\sigma^2)\).

Link function and the mean–parameter map

For an exponential family, mean \(\mu = \mathbb{E}[T(Y)] = A'(\eta)\) and variance \(\mathrm{Var}[T(Y)] = A''(\eta)\). (\(A\) is convex.)
A GLM says: model the mean as a function of features. \[g(\mu) = \mathbf{w}^T \boldsymbol\phi(\mathbf{x}),\] where \(g\) is the link function. The right-hand side is the linear predictor.
Canonical link: choose \(g\) such that \(g(\mu) = \eta\). Then \(\eta = \mathbf{w}^T\boldsymbol\phi(\mathbf{x})\) — directly modelling the natural parameter.
Canonical links by distribution: Gaussian → identity; Bernoulli → logit (\(\sigma^{-1}\)); Poisson → log.

The unification table

Distribution	Canonical link	\(\mu(\eta)\)	NLL → recovered loss
Gaussian	identity	\(\mu = \eta\)	\(\tfrac{1}{2}(y - \mu)^2\) → MSE
Bernoulli	logit	\(\mu = \sigma(\eta) = \frac{1}{1+e^{-\eta}}\)	\(-y\log\mu - (1-y)\log(1-\mu)\) → cross-entropy
Poisson	log	\(\mu = e^{\eta}\)	\(\mu - y\log\mu\) → Poisson NLL

MSE, cross-entropy, and Poisson NLL are not three losses — they are one loss applied to three different distributions. Forward-pointers: ML-PC Unit 2 §26 (Gaussian → MSE), §27 (Poisson → Poisson NLL), §28 (full ML↔︎Bayes table).

IRLS = Newton’s method on a GLM

For a GLM with canonical link, the log-likelihood NLL of \(N\) iid samples has gradient \[\nabla_{\mathbf{w}} \ell = \boldsymbol\Phi^T (\boldsymbol\mu - \mathbf{y})\] and Hessian \[\mathbf{H} = \boldsymbol\Phi^T \mathbf{W} \boldsymbol\Phi, \quad \mathbf{W} = \mathrm{diag}(A''(\eta_i)).\]
Newton step \(\mathbf{w}_{t+1} = \mathbf{w}_t - \mathbf{H}^{-1}\nabla\ell\) rearranges to \[\boxed{\;\mathbf{w}_{t+1} = (\boldsymbol\Phi^T \mathbf{W}_t \boldsymbol\Phi)^{-1} \boldsymbol\Phi^T \mathbf{W}_t\, \mathbf{z}_t\;}\] with working response \(\mathbf{z}_t = \boldsymbol\Phi\mathbf{w}_t + \mathbf{W}_t^{-1}(\mathbf{y} - \boldsymbol\mu_t)\).
This is just weighted least squares on \((\boldsymbol\Phi, \mathbf{z}_t)\) with weights \(\mathbf{W}_t\). Hence Iteratively Reweighted Least Squares.
Logistic regression’s classical solver is IRLS — and now we know why: it’s the Newton iteration on a Bernoulli GLM.
The loop closes: §2 introduced Newton’s method abstractly; §7 shows it’s the principled solver for any GLM.

This is the densest slide in the unit. Walk through the IRLS reduction explicitly: 1. Start from Newton step on GLM NLL: \(\mathbf{w}_{t+1} = \mathbf{w}_t - \mathbf{H}^{-1}\nabla\ell\). 2. Substitute \(\nabla\ell = \boldsymbol\Phi^T(\boldsymbol\mu - \mathbf{y})\) and \(\mathbf{H} = \boldsymbol\Phi^T\mathbf{W}\boldsymbol\Phi\). 3. Algebra gives the WLS form on the right with the working response \(\mathbf{z}\). 4. WLS is the same closed-form normal-equations solve students saw in Unit 2 — but now reweighted at each iteration.

Pedagogical payoff: logistic regression isn’t trained by some special algorithm; it’s just Newton on the Bernoulli GLM, which happens to have a clean WLS form. Same story for Poisson regression, gamma regression, etc. Every classical regression model is a GLM with a chosen distribution + link, and they all share IRLS as their natural solver.

Summary: loss → noise → optimizer

Loss	Noise model / distribution	Canonical optimizer
MSE	Gaussian residuals	OLS closed-form / GD on quadratic / Newton (1 step)
Cross-entropy (binary)	Bernoulli	IRLS = Newton on Bernoulli GLM
Cross-entropy (multi-class)	Categorical	IRLS / GD on the softmax NLL
Poisson NLL	Poisson counts	IRLS = Newton on Poisson GLM
Huber	Heavy-tailed (between Gaussian & Laplacian)	GD / sub-gradient
MAE	Laplacian	Sub-gradient / quantile regression
Hinge	(No probabilistic interpretation — margin-based)	SGD / coordinate descent

Three threads — loss, optimizer, model class — meet in the GLM framework.

Forward links

Unit 4–5 — neural networks: replace \(f_{\mathbf{w}}(\mathbf{x}) = \mathbf{w}^T\boldsymbol\phi(\mathbf{x})\) with a learned, parametric \(\boldsymbol\phi\).
Unit 6 — full first-order optimization: momentum, adaptive methods, saddle dynamics.
Unit 7 — full probabilistic learning: MLE, MAP, Bayesian posterior, predictive distribution, conformal prediction.
Unit 8 — bias-variance decomposition with full math.
Unit 12 — uncertainty quantification: Bayesian regression, GPs.
ML-PC Unit 2 §26–28 — physical noise → loss derivations (already covered).

Continue

← Previous: Unit 02 — Linear Algebra Refresher; Covariance, PCA/SVD
→ Next: Unit 04 — Neural Networks — From Neurons to CNNs
All courses

Notebook companion + references

Week 3 notebook: Regression from Scratch — TensileTestDataset

Open rendered notebook →

Implements OLS, gradient descent, Newton’s method, and basis-function regression on a real materials dataset. References to Murphy ch. 6 (linear regression), Bishop ch. 3, McElreath ch. 4–9.

Neuer, Michael et al. 2024. Machine Learning for Engineers: Introduction to Physics-Informed, Explainable Learning Methods for AI in Engineering Applications. Springer Nature.