Mathematical Foundations of AI & ML
Unit 3: Regression and Classification as Loss Minimization

Prof. Dr. Philipp Pelz

FAU Erlangen-Nürnberg

FAU Logo IMN Logo CENEM Logo ERC Logo Eclipse Logo

Where we are in the triad

Just done (Unit 2):

  • Linear regression in matrix form, normal equations.
  • Ridge & Lasso closed forms; L1 vs L2 geometry.
  • Multicollinearity, pseudo-inverse, kernel hint.

Coming later:

  • Unit 6 — full optimization deep dive (momentum, Adam, conditioning, saddle points).
  • Unit 7 — full probabilistic learning (MLE, MAP, posterior) + conformal prediction.
  • Unit 8 — bias-variance decomposition.
  • ML-PC Unit 2 — already saw Gaussian → MSE, Poisson → Poisson NLL, Bayes/MAP table.

Learning outcomes

By the end of this unit, students can:

  • Recall the ERM principle and write down the supervised learning objective.
  • Apply gradient descent, SGD/minibatch SGD, and Newton’s method to small problems.
  • Derive the Newton update from a 2nd-order Taylor expansion and explain its single-step convergence on quadratics.
  • Identify the right loss function for a given noise model (Gaussian, Bernoulli, Poisson).
  • Analyze how the exponential-family / GLM framework unifies regression and classification under one likelihood.
  • Recognize Runge’s phenomenon and choose between polynomial, RBF, and spline bases.

The supervised learning framework

  • Data: \(\mathcal{D} = \{(\mathbf{x}_i, y_i)\}_{i=1}^N\) drawn iid from an unknown \(p(\mathbf{x}, y)\).
  • Hypothesis: parameterized predictor \(f_{\mathbf{w}}: \mathbf{x} \mapsto \hat{y}\).
  • Loss: \(L(\hat{y}, y)\) scores a single prediction.
  • Population risk: \(R(\mathbf{w}) = \mathbb{E}_{(\mathbf{x},y) \sim p}[L(f_{\mathbf{w}}(\mathbf{x}), y)]\) — what we want.
  • Empirical risk: \(\hat R(\mathbf{w}) = \frac{1}{N}\sum_i L(f_{\mathbf{w}}(\mathbf{x}_i), y_i)\) — what we can compute.
  • ERM: \(\hat{\mathbf{w}} = \arg\min_{\mathbf{w}} \hat R(\mathbf{w})\).

Optimization landscape: convex vs non-convex

  • Convex: any local minimum is global. (E.g. MSE for linear regression.)
  • Non-convex: local minima, saddle points, plateaus. (Anything with a hidden layer.)
  • Practical message: convex problems have one correct answer; non-convex problems have one we can find.

Gradient descent

  • Idea: to minimize, step in the steepest descent direction.
  • First-order Taylor: \(f(\mathbf{w} - \eta \nabla f(\mathbf{w})) \approx f(\mathbf{w}) - \eta \|\nabla f(\mathbf{w})\|^2 < f(\mathbf{w})\) for small \(\eta > 0\).
  • Update: \(\mathbf{w}_{t+1} = \mathbf{w}_t - \eta \nabla f(\mathbf{w}_t)\).
  • Learning rate \(\eta\):
    • too small → slow;
    • too large → overshoots, diverges.

Stochastic gradient descent (SGD)

  • Cost of full GD: \(\nabla \hat R = \frac{1}{N}\sum_i \nabla L_i\)\(\mathcal{O}(N)\) work per step.
  • Stochastic estimator: pick one \(i\) uniformly at random; use \(\nabla L_i\) as the gradient.
  • Unbiased: \(\mathbb{E}_i[\nabla L_i] = \nabla \hat R\) — in expectation we’re still doing GD.
  • Behaviour: noisy steps, fast initial progress, eventually bounces near the minimum.

Minibatch SGD

  • Compromise: average over a minibatch of size \(b\) (typically 32–256).
  • Update: \(\displaystyle \mathbf{w}_{t+1} = \mathbf{w}_t - \frac{\eta}{b}\sum_{i \in \mathcal{B}_t} \nabla L_i(\mathbf{w}_t)\).
  • Why \(b\) matters:
    1. Variance reduction — gradient estimate concentrates around the true gradient.
    2. Vectorization — modern GPUs do \(b\) samples in parallel almost for free.
  • The default for every modern deep-learning training loop.

Beyond vanilla SGD — see Unit 6

  • Momentum, Nesterov, RMSProp, AdaGrad, Adam, AdamW — all first-order, all standard.
  • Conditioning, saddle points, plateaus, mode connectivity — landscape pathologies.
  • Unit 6 owns this with its own deep-dive interactives.
  • For the rest of this unit we ask a different question: what does second-order information buy us?

When first-order is slow: the Hessian

  • Hessian \(\mathbf{H} = \nabla^2 f\) — the matrix of second derivatives. It encodes local curvature.
  • Eigenvalues of \(\mathbf{H}\) are the principal curvatures. Their ratio is the condition number \(\kappa = \lambda_{\max}/\lambda_{\min}\).
  • GD’s pain: in an elongated bowl (\(\kappa \gg 1\)), GD oscillates across the steep direction while crawling along the shallow one. Convergence rate scales like \(\frac{\kappa - 1}{\kappa + 1}\).
  • Unit 6 owns the full conditioning treatment; here we use it only as motivation for Newton.

Newton’s method

  • Second-order Taylor: \(f(\mathbf{w} + \Delta) \approx f(\mathbf{w}) + \nabla f^T \Delta + \tfrac{1}{2} \Delta^T \mathbf{H}\, \Delta\).
  • Minimize the quadratic in \(\Delta\): \(\Delta = -\mathbf{H}^{-1} \nabla f\).
  • Update: \(\boxed{\;\mathbf{w}_{t+1} = \mathbf{w}_t - \mathbf{H}^{-1} \nabla f(\mathbf{w}_t)\;}\)
  • Key property: if \(f\) is a quadratic, the update lands on the minimum in a single step — independent of the starting point and of \(\kappa\).
  • For non-quadratic \(f\): Newton converges quadratically near the optimum (error squares each iteration).

Interactive: Newton vs Gradient Descent

\(f(x,y) = a\,x^2 + b\,y^2\) on a 2D ill-conditioned bowl (\(b = 1\) fixed).

  • GD update: \(\mathbf{w} \leftarrow \mathbf{w} - \eta\,\nabla f\).
  • Newton update: \(\mathbf{w} \leftarrow \mathbf{w} - \mathbf{H}^{-1} \nabla f\) — for this quadratic, \(\mathbf{H} = \mathrm{diag}(2a, 2b)\).
  • Watch Newton land on the origin in one step regardless of \(a\), while GD oscillates as \(a\) grows.

Newton’s catch

  • Memory: \(\mathbf{H}\) is \(D \times D\)\(\mathcal{O}(D^2)\) to store.
  • Compute: inverting (or solving with) \(\mathbf{H}\) is \(\mathcal{O}(D^3)\).
  • For a deep network with \(D = 10^9\) parameters, \(\mathbf{H}\) has \(10^{18}\) entries. Not happening.
  • For medium-dimensional convex problems (\(D \lesssim 10^4\)), Newton is a realistic and excellent choice.

Quasi-Newton: BFGS and L-BFGS

  • Idea: don’t form \(\mathbf{H}\). Approximate \(\mathbf{H}^{-1}\) from a sequence of gradient differences.
  • BFGS update: maintain \(\mathbf{B}_t \approx \mathbf{H}^{-1}\), update it rank-2 each step using \(\mathbf{s}_t = \mathbf{w}_{t+1} - \mathbf{w}_t\) and \(\mathbf{y}_t = \nabla f_{t+1} - \nabla f_t\).
  • L-BFGS (limited-memory BFGS): keep only the last \(m\) pairs \((\mathbf{s}_k, \mathbf{y}_k)\) — memory drops from \(\mathcal{O}(D^2)\) to \(\mathcal{O}(mD)\).
  • Workhorse for medium-dim convex problems and for fine-tuning small models. scipy.optimize.minimize(method='L-BFGS-B'), torch.optim.LBFGS.

Newton on a GLM = IRLS — preview

  • For generalized linear models (next: §7), Newton’s method has a remarkably clean form.
  • The Hessian factors cleanly through the design matrix — each Newton step becomes a weighted least-squares solve.
  • That’s the Iteratively Reweighted Least Squares (IRLS) algorithm. We’ll close the loop with this in §7.

Checkpoint: optimization

  1. A learning rate \(\eta\) is “too large.” What’s the symptom you’d see in a GD trajectory?
      1. Slow, monotone decrease toward the minimum.
      1. Trajectory oscillates and may diverge.
      1. Stuck at a saddle point.
      1. Newton’s method takes over.
  2. Why does Newton’s method converge in one step on a quadratic, regardless of conditioning?
      1. Because the Hessian is identity for quadratics.
      1. Because the second-order Taylor approximation is the function for quadratics, so we land on the exact minimum of the local model.
      1. Because the gradient is zero at the minimum.
      1. It doesn’t — that’s a myth.
  3. We don’t use plain Newton’s method to train deep networks. Why not?
      1. Newton diverges on non-convex losses.
      1. Storing and inverting an \(D\times D\) Hessian for \(D = 10^9\) parameters is infeasible (\(\mathcal{O}(D^2)\) memory, \(\mathcal{O}(D^3)\) inversion).
      1. Adam is provably faster.
      1. Newton’s method requires labeled data.

Loss as decision proxy

  • A loss \(L(\hat y, y)\) is a quantitative penalty for being wrong — not a fundamental quantity. We design it.
  • Different application contexts → different penalties:
    • Should errors grow quadratically (smooth tails) or linearly (robust to outliers)?
    • Are false positives and false negatives equally costly?
    • Do we need calibrated probabilities, or just labels?
  • The loss encodes the engineering question you’re asking.

Mean squared error (MSE)

  • \(L_{\text{MSE}}(\hat y, y) = (\hat y - y)^2\).
  • Geometry: smooth convex bowl — gradient descent’s ideal landscape.
  • Probabilistic identity: minimizing MSE = MLE assuming iid Gaussian residuals \(\varepsilon \sim \mathcal{N}(0, \sigma^2)\).
  • See ML-PC Unit 2 §26 for the full Gaussian → MSE derivation; see Unit 7 for the formal MLE machinery.

Mean absolute error (MAE)

  • \(L_{\text{MAE}}(\hat y, y) = |\hat y - y|\).
  • Linear penalty → much less sensitive to outliers than MSE.
  • Probabilistic identity: MLE under Laplacian residuals.
  • Optimization caveat: non-differentiable at zero. Use sub-gradient methods, or smooth via Huber.

Huber loss

  • Piecewise definition with parameter \(\delta\): \[L_\delta(\hat y, y) = \begin{cases} \tfrac{1}{2}(\hat y - y)^2 & |\hat y - y| \le \delta \\ \delta(|\hat y - y| - \tfrac{1}{2}\delta) & |\hat y - y| > \delta \end{cases}\]
  • Quadratic in the small-error regime (smooth optimization), linear in the tails (outlier-robust).
  • Standard tool for industrial / engineering data where most points are clean but occasional glitches occur (Neuer et al. 2024).

Interactive: Drag-the-outlier (MSE / MAE / Huber)

We have a dataset with a few points. Try dragging the red “outlier” point up and down.

Notice how: - MSE (Blue) is pulled strongly by the outlier to minimize the huge quadratic penalty. - MAE (Green) ignores the outlier completely, picking the median line. - Huber (Orange) compromises based on \(\delta\).

Beyond Gaussian: heteroscedastic and count data

  • Heteroscedastic noise (variance varies with signal): predict \(\sigma^2\) alongside \(\hat y\); loss becomes \(\frac{(y - \hat y)^2}{2\hat\sigma^2} + \tfrac{1}{2}\log\hat\sigma^2\).
  • Poisson / count data (low-dose imaging, photon counting): use Poisson NLL \(L = \hat\mu - y\log\hat\mu\), not MSE. See ML-PC Unit 2 §27.
  • Heavy-tailed noise: use Student-\(t\) or log-cosh.
  • The principle: loss = NLL of the noise model. We’ll formalize this in §7.

The 0–1 loss problem

  • For binary classification with \(y \in \{0, 1\}\), the natural loss is \(L = \mathbf{1}\{\hat y \ne y\}\).
  • The bug: non-differentiable, gradient is zero almost everywhere — gradient descent has nothing to follow.
  • The fix: use a surrogate loss — a smooth differentiable function that bounds the 0–1 loss from above.
  • Surrogates also let us output calibrated probabilities, not just labels.

Cross-entropy = Bernoulli / Categorical NLL

  • Binary: model \(p(y=1\mid\mathbf{x}) = \sigma(\mathbf{w}^T\mathbf{x})\) with sigmoid \(\sigma\).
  • Cross-entropy loss: \(L = -[y \log p + (1-y)\log(1-p)]\).
  • Identity: this is the negative log-likelihood of a Bernoulli\((p)\) on \(y\).
  • Multi-class: softmax + categorical cross-entropy = NLL of a Categorical distribution.
  • Behaviour: confident-but-wrong predictions are punished hugely (NLL → ∞). See Unit 7 for the full MLE framework.

Interactive: Cross-entropy & decision boundary

Adjust the logistic regression model’s weights to classify the blue (\(y=1\)) and red (\(y=0\)) points.

The background color shows predicted probability \(P(y=1|\mathbf{x}) = \sigma(w_1 x_1 + w_2 x_2 + b)\). Notice how strongly the loss explodes if you push a red point deep into the “confident blue” region!

Margin-based view: hinge & calibration

  • Hinge loss (\(y \in \{-1,+1\}\)): \(L = \max(0, 1 - y\hat y)\). Goal: correct classification with a margin, not just correct labels — the SVM principle.
  • Once correctly classified beyond the margin, the gradient is zero — no further “improvement” is needed. Crisp decision-boundary geometry, but no probabilities.
  • Proper scoring rules (Brier, log-loss): the loss is minimized in expectation by the true probability. Cross-entropy is proper; misclassification rate is not.
  • Choose the loss to match what you need — labels (hinge) or calibrated probabilities (cross-entropy).

The linearity principle

  • A “linear model” is linear in the parameters \(\mathbf{w}\)not in the inputs \(\mathbf{x}\).
  • Replace raw \(\mathbf{x}\) with a feature vector \(\boldsymbol\phi(\mathbf{x})\). The predictor becomes \[f_{\mathbf{w}}(\mathbf{x}) = \mathbf{w}^T \boldsymbol\phi(\mathbf{x}).\]
  • The model is now non-linear in \(\mathbf{x}\) but all the linear-regression machinery still applies to \(\mathbf{w}\).
  • This is one of the most useful conceptual moves in ML.

Formal basis-function expansion

  • Map \(\boldsymbol\phi: \mathbb{R}^d \to \mathbb{R}^M\), \(\boldsymbol\phi(\mathbf{x}) = (\phi_1(\mathbf{x}), \ldots, \phi_M(\mathbf{x}))^T\).
  • Design matrix: \(\boldsymbol\Phi \in \mathbb{R}^{N \times M}\) with \(\boldsymbol\Phi_{ij} = \phi_j(\mathbf{x}_i)\).
  • Normal equations (from Unit 2, applied to \(\boldsymbol\Phi\)): \[\hat{\mathbf{w}} = (\boldsymbol\Phi^T \boldsymbol\Phi)^{-1} \boldsymbol\Phi^T \mathbf{y}.\]
  • The same OLS / Ridge / Lasso closed forms — they just live in feature space now.

Polynomial basis

  • Univariate: \(\phi_j(x) = x^{j}\) for \(j = 0, 1, \ldots, M\).
  • The design matrix is the Vandermonde matrix \(\boldsymbol\Phi_{ij} = x_i^{j}\).
  • Recovers polynomial regression. Nice and familiar — but watch out:

Runge’s phenomenon

  • High-degree polynomials interpolated at evenly spaced points oscillate wildly near the boundary.
  • Classic example: fit \(f(x) = 1/(1 + 25 x^2)\) on \([-1, 1]\) with degree-15 polynomial → boundary error grows with degree, not shrinks.
  • Lesson: higher complexity ≠ better fit. Global polynomials are a poor basis when you need flexibility and boundedness.
  • We’ll see two fixes: localize the basis (RBFs, splines) or constrain the weights (regularization, §6).

Interactive: Runge’s phenomenon — polynomial fitting

Let’s fit a polynomial to noisy training data to see the Bias-Variance tradeoff live.

  • Degree 1-2: Underfitting (High Bias). The model is too simple to capture the intrinsic curve.
  • Degree 3-4: The “Sweet Spot”. Fits the true curve well.
  • Degree 9-10: Overfitting (High Variance). The model memorizes the noise, leading to wild oscillations between data points. Test error explodes!

Radial basis functions (RBFs)

  • Pick centers \(\boldsymbol\mu_1, \ldots, \boldsymbol\mu_M\) and a bandwidth \(\sigma\).
  • Each basis function is localized: \[\phi_k(\mathbf{x}) = \exp\!\left(-\tfrac{\|\mathbf{x} - \boldsymbol\mu_k\|^2}{2\sigma^2}\right).\]
  • \(\sigma\) controls width: small \(\sigma\) → sharp local “bumps”, large \(\sigma\) → smooth global features.
  • Center placement matters: equispaced, \(k\)-means on data, or every data point is a center (recovers a kernel method — see Unit 2).

Splines

  • Idea: stitch low-degree polynomials together at fixed knots with continuity constraints.
  • Cubic spline: piecewise degree-3 polynomials, continuous in value, 1st, and 2nd derivative — visually indistinguishable from a smooth curve.
  • B-spline basis: an explicit set of \(M\) basis functions that span the same space; each B-spline has local support, so the design matrix is sparse and conditioning is excellent.
  • Why splines fix Runge: the basis is local. Increasing complexity adds knots, not global oscillation modes.

Interactive: Basis function explorer

Same data, three bases. Notice that all three fits use the same normal equations \(\hat{\mathbf{w}} = (\boldsymbol\Phi^T\boldsymbol\Phi)^{-1}\boldsymbol\Phi^T\mathbf{y}\) — only \(\boldsymbol\Phi\) changes.

  • Polynomial → global, prone to Runge.
  • RBF → localized; bandwidth \(\sigma\) trades smoothness vs flexibility.
  • B-spline → piecewise polynomial with local support; clean conditioning.

The bias-variance picture (one-frame summary)

  • High bias (underfit): basis is too rigid — wrong systematic shape.
  • High variance (overfit): basis is too flexible — fits noise.
  • Total expected error = \(\text{Bias}^2 + \text{Variance} + \sigma^2_{\text{irreducible}}\).
  • Sweet spot depends on \(N\): more data lets you afford more flexibility.
  • Full decomposition + math: Unit 8.

graph TD
    Complexity[Model Complexity →]
    Total[Total Error]
    Bias["Bias²"]
    Var[Variance]
    Bias --> Total
    Var --> Total
    Complexity --> Bias
    Complexity --> Var
    style Total fill:#f96,stroke:#333

Connection to kernels

  • If \(M\) is huge (or infinite), forming \(\boldsymbol\Phi^T\boldsymbol\Phi\) is impossible.
  • Kernel trick: any algorithm that depends only on \(\langle\boldsymbol\phi(\mathbf{x}_i), \boldsymbol\phi(\mathbf{x}_j)\rangle\) can compute that inner product without ever materializing \(\boldsymbol\phi\).
  • \(k(\mathbf{x},\mathbf{x}') = \exp(-\|\mathbf{x}-\mathbf{x}'\|^2/(2\sigma^2))\) → infinite-dim feature space, finite computation.
  • See Unit 2 “Kernel hint from inner products” — full treatment in advanced courses (Gaussian processes, SVMs).

Bridge: complex models need a constraint

  • We can make our hypothesis class arbitrarily flexible by stacking basis functions.
  • With finite data, more flexibility = more variance = worse generalization (without help).
  • The fix: constrain the parameters. Penalize complexity.
  • That’s regularization — and it has a beautifully clean Bayesian interpretation.

Checkpoint: basis functions

  1. We say RBF regression is a “linear model.” In what sense?
      1. The map \(\mathbf{x} \mapsto \hat y\) is a straight line.
      1. The model is linear in the parameters \(\mathbf{w}\), even though it’s nonlinear in \(\mathbf{x}\).
      1. Both (a) and (b).
      1. Neither — RBF is a fundamentally nonlinear model.
  2. Why does Runge’s phenomenon happen?
      1. Because we’re using the wrong loss.
      1. Because high-degree polynomials are global — local data changes affect the whole curve, and equispaced interpolation forces large oscillations near the boundary.
      1. Because polynomial regression overfits noise.
      1. Because the design matrix is singular at high degree.
  3. Going from polynomial basis to B-spline basis with the same \(M\):
      1. Changes the model class entirely; OLS no longer applies.
      1. Changes only the design matrix \(\boldsymbol\Phi\); the normal equations are identical.
      1. Forces us to use gradient descent.
      1. Adds regularization automatically.

Recap from Unit 2: Ridge & Lasso

  • Ridge (\(\ell_2\) penalty \(\lambda\|\mathbf{w}\|_2^2\)): \[\hat{\mathbf{w}}_{\text{ridge}} = (\boldsymbol\Phi^T\boldsymbol\Phi + \lambda\mathbf{I})^{-1}\boldsymbol\Phi^T\mathbf{y}.\] Spectral effect: shifts every eigenvalue by \(+\lambda\).
  • Lasso (\(\ell_1\) penalty \(\lambda\|\mathbf{w}\|_1\)): no closed form, but the constraint region has corners → exact sparsity.
  • Constraint geometry (sphere vs diamond) — see Unit 2 for the full picture.

We do not rederive these here — Unit 2 owns the derivations.

The MAP interpretation: every regularizer is a prior

  • MAP estimate: \(\hat{\mathbf{w}} = \arg\max_{\mathbf{w}}\left[\log p(\mathbf{y}\mid\mathbf{w}) + \log p(\mathbf{w})\right]\).
  • The first term is the negative loss (likelihood). The second term is the prior.
  • Gaussian prior \(\mathbf{w} \sim \mathcal{N}(\mathbf{0}, \tau^2\mathbf{I})\)\(\log p(\mathbf{w}) \propto -\tfrac{1}{2\tau^2}\|\mathbf{w}\|_2^2\)Ridge.
  • Laplace prior \(\mathbf{w}_j \sim \text{Laplace}(0, b)\)\(\log p(\mathbf{w}) \propto -\tfrac{1}{b}\|\mathbf{w}\|_1\)Lasso.
  • Punchline: every loss is an NLL; every regularizer is a prior. See ML-PC Unit 2 §28 for the full ML↔︎Bayes table; Unit 7 for the formal posterior treatment.

What lives elsewhere (and why)

  • Dropout, batch-norm, label smoothing, focal loss, early stopping, data augmentation → deep-learning units. Implementation tricks, not foundations.
  • Full bias-variance decomposition → Unit 8.
  • Full Bayesian (posterior, predictive, evidence) → Unit 7.
  • First-order optimizer zoo (Adam, RMSProp, Nesterov) → Unit 6.
  • Physical noise → loss derivations (Gaussian, Poisson, Weibull) → ML-PC Unit 2.
  • This unit’s contribution: the abstraction that ties them together — coming next.

Exponential family: canonical form

  • A distribution belongs to the exponential family if its density has the form \[p(y\mid\eta) = h(y)\,\exp\!\left(\eta^T T(y) - A(\eta)\right).\]
  • \(\eta\)natural parameter (the parameter we’ll learn).
  • \(T(y)\)sufficient statistic of the data.
  • \(A(\eta)\)log-partition / cumulant function (normalizer).
  • \(h(y)\)base measure (does not depend on \(\eta\)).

Three examples in canonical form

Distribution \(T(y)\) \(A(\eta)\) \(\eta\) in terms of usual parameter
Gaussian (\(\sigma^2\) known) \(y\) \(\eta^2/2\) \(\eta = \mu\)
Bernoulli \(y\) \(\log(1 + e^{\eta})\) \(\eta = \log\frac{p}{1-p}\) (logit)
Poisson \(y\) \(e^{\eta}\) \(\eta = \log\mu\)
  • Verify: plug each into \(p(y\mid\eta) = h(y)\exp(\eta\, T(y) - A(\eta))\) and you recover the original density.
  • \(h(y)\) for Poisson is \(1/y!\); for Bernoulli, \(h(y) = 1\); for Gaussian (with \(\sigma^2\) known), \(h(y) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp(-y^2/2\sigma^2)\).

The unification table

Distribution Canonical link \(\mu(\eta)\) NLL → recovered loss
Gaussian identity \(\mu = \eta\) \(\tfrac{1}{2}(y - \mu)^2\)MSE
Bernoulli logit \(\mu = \sigma(\eta) = \frac{1}{1+e^{-\eta}}\) \(-y\log\mu - (1-y)\log(1-\mu)\)cross-entropy
Poisson log \(\mu = e^{\eta}\) \(\mu - y\log\mu\)Poisson NLL

MSE, cross-entropy, and Poisson NLL are not three losses — they are one loss applied to three different distributions. Forward-pointers: ML-PC Unit 2 §26 (Gaussian → MSE), §27 (Poisson → Poisson NLL), §28 (full ML↔︎Bayes table).

IRLS = Newton’s method on a GLM

  • For a GLM with canonical link, the log-likelihood NLL of \(N\) iid samples has gradient \[\nabla_{\mathbf{w}} \ell = \boldsymbol\Phi^T (\boldsymbol\mu - \mathbf{y})\] and Hessian \[\mathbf{H} = \boldsymbol\Phi^T \mathbf{W} \boldsymbol\Phi, \quad \mathbf{W} = \mathrm{diag}(A''(\eta_i)).\]
  • Newton step \(\mathbf{w}_{t+1} = \mathbf{w}_t - \mathbf{H}^{-1}\nabla\ell\) rearranges to \[\boxed{\;\mathbf{w}_{t+1} = (\boldsymbol\Phi^T \mathbf{W}_t \boldsymbol\Phi)^{-1} \boldsymbol\Phi^T \mathbf{W}_t\, \mathbf{z}_t\;}\] with working response \(\mathbf{z}_t = \boldsymbol\Phi\mathbf{w}_t + \mathbf{W}_t^{-1}(\mathbf{y} - \boldsymbol\mu_t)\).
  • This is just weighted least squares on \((\boldsymbol\Phi, \mathbf{z}_t)\) with weights \(\mathbf{W}_t\). Hence Iteratively Reweighted Least Squares.
  • Logistic regression’s classical solver is IRLS — and now we know why: it’s the Newton iteration on a Bernoulli GLM.
  • The loop closes: §2 introduced Newton’s method abstractly; §7 shows it’s the principled solver for any GLM.

Summary: loss → noise → optimizer

Loss Noise model / distribution Canonical optimizer
MSE Gaussian residuals OLS closed-form / GD on quadratic / Newton (1 step)
Cross-entropy (binary) Bernoulli IRLS = Newton on Bernoulli GLM
Cross-entropy (multi-class) Categorical IRLS / GD on the softmax NLL
Poisson NLL Poisson counts IRLS = Newton on Poisson GLM
Huber Heavy-tailed (between Gaussian & Laplacian) GD / sub-gradient
MAE Laplacian Sub-gradient / quantile regression
Hinge (No probabilistic interpretation — margin-based) SGD / coordinate descent

Three threads — loss, optimizer, model class — meet in the GLM framework.

Continue

Notebook companion + references

Week 3 notebook: Regression from Scratch — TensileTestDataset

Implements OLS, gradient descent, Newton’s method, and basis-function regression on a real materials dataset. References to Murphy ch. 6 (linear regression), Bishop ch. 3, McElreath ch. 4–9.

Neuer, Michael et al. 2024. Machine Learning for Engineers: Introduction to Physics-Informed, Explainable Learning Methods for AI in Engineering Applications. Springer Nature.