graph LR
x(("x")) --> plus["+"]
y(("y")) --> plus
plus --> mult["*"]
z(("z")) --> mult
mult --> J(("J"))
style plus fill:#f9f,stroke:#333,stroke-width:2px
style mult fill:#f9f,stroke:#333,stroke-width:2px
FAU Erlangen-Nürnberg
By the end of this lecture, students can:
graph LR
x(("x")) --> plus["+"]
y(("y")) --> plus
plus --> mult["*"]
z(("z")) --> mult
mult --> J(("J"))
style plus fill:#f9f,stroke:#333,stroke-width:2px
style mult fill:#f9f,stroke:#333,stroke-width:2px
\[ \frac{dy}{dx} = f'(g(x)) \cdot g'(x) \]
\[ \frac{\partial J}{\partial x} = \sum_{i=1}^{k} \frac{\partial J}{\partial u_i} \frac{\partial u_i}{\partial x} \]
\[ J = \frac{1}{N} \sum_{i=1}^{N} L(\hat{y}_i, y_i) \]
\[ \delta_o = (\hat{y} - y) \sigma'(z_o) \]
\[ \frac{\partial J}{\partial w_{ok}} = \delta_o a_k^{(L-1)} \]
\[ \frac{\partial J}{\partial w_{kj}} = \delta_k a_j^{(\ell-1)} \]
where:
\[ \delta_k = \sigma'(z_k) \sum_m w_{mk}^{(\ell+1)} \delta_m^{(\ell+1)} \]
\[ \delta_i^{(\ell)} = \sigma'(z_i^{(\ell)}) \sum_j w_{ji}^{(\ell \to \ell+1)} \delta_j^{(\ell+1)} \]
\[ \frac{\partial J}{\partial b_k^{(\ell)}} = \delta_k^{(\ell)} \]
graph TD
subgraph ForwardPass [Forward]
direction TB
F1["Input x"] --> F2["Layer 1"]
F2 --> F3["..."]
F3 --> F4["Layer L"]
end
ForwardPass --> Loss["Loss J"]
Loss --> BP1["Output delta at L"]
subgraph BackwardPass [Backward]
direction TB
BP1 --> BP2["Layer L-1 delta"]
BP2 --> BP3["..."]
BP3 --> BP4["Layer 1 delta"]
end
BackwardPass --> Grad["Accumulate Gradients"]
Grad --> Update["Update Weights"]
style ForwardPass fill:#e1f5fe,stroke:#01579b
style BackwardPass fill:#fff3e0,stroke:#e65100
viewof i_x1 = Inputs.range([-2, 2], {value: 1.0, step: 0.1, label: "x1"})
viewof i_x2 = Inputs.range([-2, 2], {value: -1.0, step: 0.1, label: "x2"})
viewof w_11 = Inputs.range([-2, 2], {value: 0.5, step: 0.1, label: "w11 (x1->h)"})
viewof w_21 = Inputs.range([-2, 2], {value: -0.5, step: 0.1, label: "w21 (x2->h)"})
viewof w_out = Inputs.range([-2, 2], {value: 1.0, step: 0.1, label: "w_out (h->y)"})
viewof showGrads = Inputs.toggle({label: "Show Gradients (Backward Pass)", value: false})netCalc = {
// Forward pass
const z_h = i_x1 * w_11 + i_x2 * w_21;
const a_h = Math.max(0, z_h); // ReLU
const z_y = a_h * w_out;
const a_y = z_y; // Linear output
const y_target = 1.0;
const loss = 0.5 * Math.pow(a_y - y_target, 2);
// Backward pass
const dL_dy = (a_y - y_target); // dL/da_y = dL/dz_y
const dL_dwout = dL_dy * a_h;
const dL_dah = dL_dy * w_out;
const dL_dzh = z_h > 0 ? dL_dah : 0; // ReLU derivative
const dL_dw11 = dL_dzh * i_x1;
const dL_dw21 = dL_dzh * i_x2;
return { z_h, a_h, z_y, a_y, loss, dL_dy, dL_dwout, dL_dah, dL_dzh, dL_dw11, dL_dw21 };
}html`
<div style="font-family: sans-serif; background: #222; padding: 20px; border-radius: 8px; color: #eee; text-align: center;">
<h3 style="margin-top: 0">Loss $J = \\frac{1}{2}(\\hat{y} - 1)^2 = ${netCalc.loss.toFixed(3)}$</h3>
<div style="display: flex; justify-content: space-around; align-items: center; margin-top: 30px;">
<!-- Input Layer -->
<div style="display: flex; flex-direction: column; gap: 40px;">
<div style="background: #4e79a7; padding: 15px; border-radius: 50%;">$x_1 = ${i_x1.toFixed(1)}$</div>
<div style="background: #4e79a7; padding: 15px; border-radius: 50%;">$x_2 = ${i_x2.toFixed(1)}$</div>
</div>
<!-- Weights 1 -->
<div style="display: flex; flex-direction: column; gap: 40px; font-size: 0.8em; color: #aaa;">
<div>$w_{11} = ${w_11.toFixed(1)}$ <br> ${showGrads ? `<span style="color:#e15759">$\\nabla_{w_{11}} J = ${netCalc.dL_dw11.toFixed(2)}$</span>` : ""}</div>
<div>$w_{21} = ${w_21.toFixed(1)}$ <br> ${showGrads ? `<span style="color:#e15759">$\\nabla_{w_{21}} J = ${netCalc.dL_dw21.toFixed(2)}$</span>` : ""}</div>
</div>
<!-- Hidden Layer -->
<div style="background: #f28e2b; padding: 15px; border-radius: 50%; color: #222; font-weight: bold;">
$h$ <br>
<span style="font-size: 0.8em">$z_h = ${netCalc.z_h.toFixed(2)}$</span><br>
<span style="font-size: 0.8em">$a_h = ${netCalc.a_h.toFixed(2)}$</span><br>
${showGrads ? `<span style="color:#e15759; font-size: 0.8em">$\\delta_h = ${netCalc.dL_dzh.toFixed(2)}$</span>` : ""}
</div>
<!-- Weights 2 -->
<div style="font-size: 0.8em; color: #aaa;">
$w_{\\mathrm{out}} = ${w_out.toFixed(1)}$ <br> ${showGrads ? `<span style="color:#e15759">$\\nabla_{w_{\\mathrm{out}}} J = ${netCalc.dL_dwout.toFixed(2)}$</span>` : ""}
</div>
<!-- Output Layer -->
<div style="background: #59a14f; padding: 15px; border-radius: 50%;">
$\\hat{y} = ${netCalc.a_y.toFixed(2)}$ <br>
${showGrads ? `<span style="color:#e15759; font-size: 0.8em">$\\delta_y = ${netCalc.dL_dy.toFixed(2)}$</span>` : ""}
</div>
</div>
</div>
`\[ \mathbf{W} \leftarrow \mathbf{W} - \eta \nabla_{\mathbf{W}} J \]
| Method | Forward passes | Backward passes | Total cost |
|---|---|---|---|
| Finite differences | \(W + 1\) | 0 | \(O(W^2)\) |
| Backpropagation | 1 | 1 | \(O(W)\) |
\[ \delta_k^{(L)} = \hat{y}_k - y_k \quad \text{(softmax + cross-entropy)} \]
\[ \frac{\partial J}{\partial \mathbf{a}^{(\ell)}} = \prod_{m=\ell}^{L} \text{diag}(\sigma'(\mathbf{z}^{(m)})) \cdot \mathbf{W}^{(m+1)} \cdot \frac{\partial J}{\partial \mathbf{a}^{(L+1)}} \]
\[ \prod_{m=1}^{L} 0.25 = 0.25^L \to 0 \quad \text{as } L \to \infty \]
actFuncData = {
const xs = d3.range(-5, 5.1, 0.1);
return xs.map(x => {
let y, dy;
if (actFunc === "Sigmoid") {
y = 1 / (1 + Math.exp(-x));
dy = y * (1 - y);
} else if (actFunc === "Tanh") {
y = Math.tanh(x);
dy = 1 - y * y;
} else if (actFunc === "ReLU") {
y = Math.max(0, x);
dy = x > 0 ? 1 : 0;
} else { // Leaky ReLU
y = x > 0 ? x : 0.1 * x;
dy = x > 0 ? 1 : 0.1;
}
return { x, y, dy, act: actFunc };
});
}Plot.plot({
grid: true,
height: 500,
x: { domain: [-5, 5], label: "z (pre-activation)" },
y: { domain: [-1.5, 1.5], label: "Value" },
color: { legend: true, domain: ["Activation f(z)", "Derivative f'(z)"], range: ["#4e79a7", "#e15759"] },
marks: [
Plot.line(actFuncData, {x: "x", y: "y", stroke: () => "Activation f(z)", strokeWidth: 3}),
Plot.line(actFuncData, {x: "x", y: "dy", stroke: () => "Derivative f'(z)", strokeWidth: 3, strokeDasharray: "5,5"}),
Plot.ruleX([0], {strokeOpacity: 0.2}),
Plot.ruleY([0], {strokeOpacity: 0.2}),
...(actFunc === "Sigmoid" ? [
Plot.ruleY([0.25], {stroke: "#e15759", strokeDasharray: "2,2", strokeOpacity: 0.5}),
Plot.text([[-3.5, 0.3]], {text: () => "Max derivative = 0.25", fill: "#e15759"})
] : [])
]
})\[ \prod_{m=1}^{L} \|\mathbf{W}^{(m)}\| \cdot \|\sigma'(\mathbf{z}^{(m)})\| \to \infty \]
\[ \nabla_{\theta} J \leftarrow \frac{\tau}{\|\nabla_{\theta} J\|} \nabla_{\theta} J \quad \text{if } \|\nabla_{\theta} J\| > \tau \]
\[ \sigma'(z) = \begin{cases} 1 & \text{if } z > 0 \\ 0 & \text{if } z \leq 0 \end{cases} \]
Question: A 10-layer network uses sigmoid activations and weights initialized with \(\|\mathbf{W}^{(\ell)}\| \approx 1\). What happens to the gradient at layer 1 during the backward pass?
simData = {
const L = 15;
let layerGrads = [];
let current_grad = 1.0;
// Approximate average derivative across the active region
let act_deriv_avg = 1.0;
if (simAct === "Sigmoid") act_deriv_avg = 0.15;
else if (simAct === "Tanh") act_deriv_avg = 0.5;
else if (simAct === "ReLU") act_deriv_avg = 0.6; // Accounts for ~50% dead neurons
// Effective multiplication factor per layer backward
let factor = initScale * act_deriv_avg;
for (let l = L; l >= 1; l--) {
let log_val = Math.max(-20, Math.min(20, Math.log10(current_grad + 1e-30)));
layerGrads.push({ layer: l, "Gradient Magnitude (log10)": log_val });
current_grad = current_grad * factor;
}
return layerGrads.reverse();
}Plot.plot({
grid: true,
height: 450,
x: { domain: [0, 16], label: "Layer (1 = Input, 15 = Output)", tickFormat: "d", ticks: 15 },
y: { domain: [-21, 21], label: "Gradient Norm (Log10 scale)" },
marks: [
Plot.ruleY([0], {stroke: "white", strokeDasharray: "3,3", strokeOpacity: 0.5}),
Plot.line(simData, {x: "layer", y: "Gradient Magnitude (log10)", stroke: "#e15759", strokeWidth: 3, marker: "circle"}),
Plot.text(simData.filter(d => d.layer === 1 || d.layer === 15), {
x: "layer",
y: d => d["Gradient Magnitude (log10)"] + (d["Gradient Magnitude (log10)"] > 0 ? 1.5 : -1.5),
text: d => `10^${Math.round(d["Gradient Magnitude (log10)"])}`,
fill: "white"
})
]
})autograd or finite differences.Required reading before Unit 6: - Neuer: Ch. 4.5.4–4.5.5 - McClarren: Ch. 5.2–5.3.2
Optional depth: - Bishop: Ch. 5.3 (error backpropagation) - Goodfellow et al.: Ch. 6.5 (computational graphs, backpropagation)
Next unit: - Loss Landscapes and Optimization Behavior - What does the surface we are descending on actually look like?
Week 5: Manual Backprop & Gradient Flow — DigitsDataset

© Philipp Pelz - Mathematical Foundations of AI & ML