Mathematical Foundations of AI & ML
Unit 5: Backpropagation and Gradient Flow
FAU Erlangen-Nürnberg
By the end of this lecture, students can:
\[ \frac{dy}{dx} = f'(g(x)) \cdot g'(x) \]
\[ \frac{\partial J}{\partial x} = \sum_{i=1}^{k} \frac{\partial J}{\partial u_i} \frac{\partial u_i}{\partial x} \]
\[ J = \frac{1}{N} \sum_{i=1}^{N} L(\hat{y}_i, y_i) \]
\[ \delta_o = (\hat{y} - y) \sigma'(z_o) \]
\[ \frac{\partial J}{\partial w_{ok}} = \delta_o a_k^{(L-1)} \]
\[ \frac{\partial J}{\partial w_{kj}} = \delta_k a_j^{(\ell-1)} \]
where:
\[ \delta_k = \sigma'(z_k) \sum_m w_{mk}^{(\ell+1)} \delta_m^{(\ell+1)} \]
\[ \delta_i^{(\ell)} = \sigma'(z_i^{(\ell)}) \sum_j w_{ji}^{(\ell \to \ell+1)} \delta_j^{(\ell+1)} \]
\[ \frac{\partial J}{\partial b_k^{(\ell)}} = \delta_k^{(\ell)} \]
graph TD
subgraph ForwardPass [Forward]
direction TB
F1[Input x] --> F2[Layer 1]
F2 --> F3[...]
F3 --> F4[Layer L]
end
ForwardPass --> Loss[Loss J]
Loss --> BP1[Output Delta δL]
subgraph BackwardPass [Backward]
direction TB
BP1 --> BP2[Layer L-1 δ]
BP2 --> BP3[...]
BP3 --> BP4[Layer 1 δ]
end
BackwardPass --> Grad[Accumulate Gradients]
Grad --> Update[Update Weights]
style ForwardPass fill:#e1f5fe,stroke:#01579b
style BackwardPass fill:#fff3e0,stroke:#e65100//| echo: false
//| panel: input
viewof i_x1 = Inputs.range([-2, 2], {value: 1.0, step: 0.1, label: "x1"})
viewof i_x2 = Inputs.range([-2, 2], {value: -1.0, step: 0.1, label: "x2"})
viewof w_11 = Inputs.range([-2, 2], {value: 0.5, step: 0.1, label: "w11 (x1->h)"})
viewof w_21 = Inputs.range([-2, 2], {value: -0.5, step: 0.1, label: "w21 (x2->h)"})
viewof w_out = Inputs.range([-2, 2], {value: 1.0, step: 0.1, label: "w_out (h->y)"})
viewof showGrads = Inputs.toggle({label: "Show Gradients (Backward Pass)", value: false})//| echo: false
netCalc = {
// Forward pass
const z_h = i_x1 * w_11 + i_x2 * w_21;
const a_h = Math.max(0, z_h); // ReLU
const z_y = a_h * w_out;
const a_y = z_y; // Linear output
const y_target = 1.0;
const loss = 0.5 * Math.pow(a_y - y_target, 2);
// Backward pass
const dL_dy = (a_y - y_target); // dL/da_y = dL/dz_y
const dL_dwout = dL_dy * a_h;
const dL_dah = dL_dy * w_out;
const dL_dzh = z_h > 0 ? dL_dah : 0; // ReLU derivative
const dL_dw11 = dL_dzh * i_x1;
const dL_dw21 = dL_dzh * i_x2;
return { z_h, a_h, z_y, a_y, loss, dL_dy, dL_dwout, dL_dah, dL_dzh, dL_dw11, dL_dw21 };
}//| echo: false
html`
<div style="font-family: sans-serif; background: #222; padding: 20px; border-radius: 8px; color: #eee; text-align: center;">
<h3 style="margin-top: 0">Loss $J = \\frac{1}{2}(\\hat{y} - 1)^2 = ${netCalc.loss.toFixed(3)}$</h3>
<div style="display: flex; justify-content: space-around; align-items: center; margin-top: 30px;">
<!-- Input Layer -->
<div style="display: flex; flex-direction: column; gap: 40px;">
<div style="background: #4e79a7; padding: 15px; border-radius: 50%;">x1 = ${i_x1.toFixed(1)}</div>
<div style="background: #4e79a7; padding: 15px; border-radius: 50%;">x2 = ${i_x2.toFixed(1)}</div>
</div>
<!-- Weights 1 -->
<div style="display: flex; flex-direction: column; gap: 40px; font-size: 0.8em; color: #aaa;">
<div>w11 = ${w_11.toFixed(1)} <br> ${showGrads ? `<span style="color:#e15759">∇=${netCalc.dL_dw11.toFixed(2)}</span>` : ""}</div>
<div>w21 = ${w_21.toFixed(1)} <br> ${showGrads ? `<span style="color:#e15759">∇=${netCalc.dL_dw21.toFixed(2)}</span>` : ""}</div>
</div>
<!-- Hidden Layer -->
<div style="background: #f28e2b; padding: 15px; border-radius: 50%; color: #222; font-weight: bold;">
h <br>
<span style="font-size: 0.8em">z=${netCalc.z_h.toFixed(2)}</span><br>
<span style="font-size: 0.8em">a=${netCalc.a_h.toFixed(2)}</span><br>
${showGrads ? `<span style="color:#e15759; font-size: 0.8em">δ=${netCalc.dL_dzh.toFixed(2)}</span>` : ""}
</div>
<!-- Weights 2 -->
<div style="font-size: 0.8em; color: #aaa;">
w_out = ${w_out.toFixed(1)} <br> ${showGrads ? `<span style="color:#e15759">∇=${netCalc.dL_dwout.toFixed(2)}</span>` : ""}
</div>
<!-- Output Layer -->
<div style="background: #59a14f; padding: 15px; border-radius: 50%;">
y_hat = ${netCalc.a_y.toFixed(2)} <br>
${showGrads ? `<span style="color:#e15759; font-size: 0.8em">δ=${netCalc.dL_dy.toFixed(2)}</span>` : ""}
</div>
</div>
</div>
`\[ \mathbf{W} \leftarrow \mathbf{W} - \eta \nabla_{\mathbf{W}} J \]
| Method | Forward passes | Backward passes | Total cost |
|---|---|---|---|
| Finite differences | \(W + 1\) | 0 | \(O(W^2)\) |
| Backpropagation | 1 | 1 | \(O(W)\) |
\[ \delta_k^{(L)} = \hat{y}_k - y_k \quad \text{(softmax + cross-entropy)} \]
The complete training loop:
\[ \frac{\partial J}{\partial \mathbf{a}^{(\ell)}} = \prod_{m=\ell}^{L} \text{diag}(\sigma'(\mathbf{z}^{(m)})) \cdot \mathbf{W}^{(m+1)} \cdot \frac{\partial J}{\partial \mathbf{a}^{(L+1)}} \]
\[ \prod_{m=1}^{L} 0.25 = 0.25^L \to 0 \quad \text{as } L \to \infty \]
//| echo: false
actFuncData = {
const xs = d3.range(-5, 5.1, 0.1);
return xs.map(x => {
let y, dy;
if (actFunc === "Sigmoid") {
y = 1 / (1 + Math.exp(-x));
dy = y * (1 - y);
} else if (actFunc === "Tanh") {
y = Math.tanh(x);
dy = 1 - y * y;
} else if (actFunc === "ReLU") {
y = Math.max(0, x);
dy = x > 0 ? 1 : 0;
} else { // Leaky ReLU
y = x > 0 ? x : 0.1 * x;
dy = x > 0 ? 1 : 0.1;
}
return { x, y, dy, act: actFunc };
});
}//| echo: false
Plot.plot({
grid: true,
height: 500,
x: { domain: [-5, 5], label: "z (pre-activation)" },
y: { domain: [-1.5, 1.5], label: "Value" },
color: { legend: true, domain: ["Activation f(z)", "Derivative f'(z)"], range: ["#4e79a7", "#e15759"] },
marks: [
Plot.line(actFuncData, {x: "x", y: "y", stroke: () => "Activation f(z)", strokeWidth: 3}),
Plot.line(actFuncData, {x: "x", y: "dy", stroke: () => "Derivative f'(z)", strokeWidth: 3, strokeDasharray: "5,5"}),
Plot.ruleX([0], {strokeOpacity: 0.2}),
Plot.ruleY([0], {strokeOpacity: 0.2}),
...(actFunc === "Sigmoid" ? [
Plot.ruleY([0.25], {stroke: "#e15759", strokeDasharray: "2,2", strokeOpacity: 0.5}),
Plot.text([[-3.5, 0.3]], {text: () => "Max derivative = 0.25", fill: "#e15759"})
] : [])
]
})\[ \prod_{m=1}^{L} \|\mathbf{W}^{(m)}\| \cdot \|\sigma'(\mathbf{z}^{(m)})\| \to \infty \]
\[ \nabla_{\theta} J \leftarrow \frac{\tau}{\|\nabla_{\theta} J\|} \nabla_{\theta} J \quad \text{if } \|\nabla_{\theta} J\| > \tau \]
\[ \sigma'(z) = \begin{cases} 1 & \text{if } z > 0 \\ 0 & \text{if } z \leq 0 \end{cases} \]
Question: A 10-layer network uses sigmoid activations and weights initialized with \(|^{()}| \). What happens to the gradient at layer 1 during the backward pass?
Answer: C — Since \(\sigma'(z) \leq 0.25\), the product of 10 such terms is at most \(0.25^{10} \approx 9.5 \times 10^{-7}\).
//| echo: false
simData = {
const L = 15;
let layerGrads = [];
let current_grad = 1.0;
// Approximate average derivative across the active region
let act_deriv_avg = 1.0;
if (simAct === "Sigmoid") act_deriv_avg = 0.15;
else if (simAct === "Tanh") act_deriv_avg = 0.5;
else if (simAct === "ReLU") act_deriv_avg = 0.6; // Accounts for ~50% dead neurons
// Effective multiplication factor per layer backward
let factor = initScale * act_deriv_avg;
for (let l = L; l >= 1; l--) {
let log_val = Math.max(-20, Math.min(20, Math.log10(current_grad + 1e-30)));
layerGrads.push({ layer: l, "Gradient Magnitude (log10)": log_val });
current_grad = current_grad * factor;
}
return layerGrads.reverse();
}//| echo: false
Plot.plot({
grid: true,
height: 450,
x: { domain: [0, 16], label: "Layer (1 = Input, 15 = Output)", tickFormat: "d", ticks: 15 },
y: { domain: [-21, 21], label: "Gradient Norm (Log10 scale)" },
marks: [
Plot.ruleY([0], {stroke: "white", strokeDasharray: "3,3", strokeOpacity: 0.5}),
Plot.line(simData, {x: "layer", y: "Gradient Magnitude (log10)", stroke: "#e15759", strokeWidth: 3, marker: "circle"}),
Plot.text(simData.filter(d => d.layer === 1 || d.layer === 15), {
x: "layer",
y: d => d["Gradient Magnitude (log10)"] + (d["Gradient Magnitude (log10)"] > 0 ? 1.5 : -1.5),
text: d => `10^${Math.round(d["Gradient Magnitude (log10)"])}`,
fill: "white"
})
]
})autograd or finite differences.Week 5: Manual Backprop & Gradient Flow — DigitsDataset