FAU Erlangen-Nürnberg
Institute of Micro- and Nanostructure Research
GroupKFold to prevent the crop-vs-specimen leakage trap.notebooks/week05_tiny_mlp.ipynb — train a small MLP on a materials-property dataset and see why, with 225 training samples, it does NOT yet beat a linear fit on the correct hand-crafted \(1/\sqrt{d}\) feature (\(R^2\approx0.91\) vs \(R^2\approx0.65\)) — and what that teaches about learned vs domain features.Left: linear regression on the hand-crafted \(1/\sqrt{d}\) feature achieves \(R^2\approx0.91\) (225 training samples). Right: a small MLP on raw grain size \(d\) only reaches \(R^2\approx0.65\) — with limited data, domain knowledge still wins over learned features.
Hand-crafted features encode what the engineer knows. Learned features encode what the data contains.
A single perceptron: each input \(x_i\) is multiplied by a weight \(w_i\), the bias \(b\) is added, and the sum passes through an activation function \(\sigma\).
Left: AND-like data — one straight line separates the classes. Right: XOR data — no single straight line can separate both classes, regardless of slope or offset.
“A single neuron draws one straight line. A hidden layer bends the space so that line can solve curves.” Goodfellow, Ian et al., (2016)
Left: XOR data — linearly inseparable. Right: an MLP with two hidden layers learns a curved boundary that correctly classifies all four quadrants.
A multi-layer perceptron (MLP): the hidden layers (green, purple) learn internal representations; the output layer (orange) applies a linear map to those representations to produce the prediction.
Five activation functions from most historical (step) to most commonly used today (Leaky ReLU). Note how ReLU preserves gradient for positive inputs, while sigmoid and tanh saturate.
| Task | Output activation | Loss |
|---|---|---|
| Regression | identity | MSE or MAE |
| Binary classification | sigmoid | binary cross-entropy |
| Multi-class | softmax | categorical cross-entropy |
Left: sigmoid gradient saturates to ≤0.25, while ReLU gradient is exactly 1 for \(z>0\). Right: at depth 8, sigmoid gradients have shrunk by four orders of magnitude compared to the first layer; ReLU gradients remain at full strength.
Gradient descent follows the slope of the loss surface downhill, one small step at a time. The update rule \(w_{t+1} = w_t - \eta\nabla_w\mathcal{L}\) is identical to Week 4; the surface is now non-convex with many local minima.
Training loss (blue) and validation R² (orange) versus training epoch. Training loss decreases monotonically; validation R² peaks around epoch 70 and then levels off — the optimal stopping point.
loss.backward() and gradients appear in .grad attributes.Computational graph for one neuron: forward pass (black arrows, left to right) computes the loss; backward pass (red dashed arrows, right to left) propagates gradients via the chain rule at each node.
loss.backward(), grad()) implement this automatically for any differentiable code. You write the forward pass; the library writes the backward pass for you.data_science_for_em/01_intro/01_autograd.qmd for an implementation walkthrough in PyTorch.StandardScaler on the training set; apply the same scaling to validation and test. (Same hygiene as Week 4.)GroupKFold strategy from Week 4. A neural network does not exempt you from honest validation.StandardScaler (fit on train only).n_iter_no_change = 30.
©Philipp Pelz - FAU Erlangen-Nürnberg - Data Science for Electron Microscopy