FAU Erlangen-Nürnberg
Institute of Micro- and Nanostructure Research
notebooks/week04_leakage_demo.ipynb — fit a regressor two ways and watch the honest score drop.Every supervised learning algorithm is a choice of loss + a choice of optimiser.
Enter gradient descent — an iterative alternative that never inverts anything.
Gradient descent: take the gradient, step opposite to it, repeat.
MSE punishes large residuals heavily — one bad crop can dominate the loss.



1

| Optimiser | Per-step cost | Adaptive \(\eta\)? | Momentum? | Typical use |
|---|---|---|---|---|
| Full GD | \(\mathcal{O}(N)\) | No | No | Tiny datasets, convex |
| SGD | \(\mathcal{O}(1)\) | No | No | Rarely used bare |
| Minibatch SGD | \(\mathcal{O}(b)\) | No | Optional | Many DL papers |
| SGD + Momentum | \(\mathcal{O}(b)\) | No | Yes (\(\beta \approx 0.9\)) | Fine-tuned vision models |
| Adam | \(\mathcal{O}(b)\) | Per-param | Yes | Default for most EM projects |

| Regime | Bias | Variance | Cure |
|---|---|---|---|
| Underfit | High | Low | More flexible model / features |
| Good fit | Low | Low | — |
| Overfit | Low | High | More data, fewer parameters, regularisation |
Error
│ test: \____/‾‾‾‾ ← U-shape (optimal somewhere in the middle)
│ train: ‾‾‾‾‾‾‾‾\ ← monotonically decreases with complexity
└───────────────────── Model complexity →

Gold standard: always wrap preprocessing + model in a Pipeline before passing to CV.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score, KFold, GroupKFold
pipe = Pipeline([
("scale", StandardScaler()), # fitted on train fold only — no leakage
("model", Ridge(alpha=1.0))
])
# Random K-fold (only if data points are independent)
scores = cross_val_score(pipe, X, y, cv=KFold(n_splits=5), scoring='r2')
# Group K-fold (when specimen_id exists)
scores = cross_val_score(pipe, X, y, cv=GroupKFold(n_splits=5),
groups=specimen_id, scoring='r2')
print(f"R² = {scores.mean():.3f} ± {scores.std():.3f}")Pipeline reruns StandardScaler.fit on each training fold automatically → no leakage.


Assign a group label (specimen_id) to every data point.
GroupKFold: the entire specimen stays in either train or test — never split across folds.
Materials default: if there is a specimen_id column, your default CV is GroupKFold.
The within-specimen correlation that random CV exploits is noise from the perspective of generalisation — ignoring it inflates your score by a predictable amount.
For each of the three setups below, identify the leakage and the fix.
(a) You standardise all features with StandardScaler().fit_transform(X) on the full dataset, then run 5-fold cross-validation.
. . .
Pre-processing leak. The scaler saw all test-set values when computing \(\mu\) and \(\sigma\). Fix: StandardScaler().fit(X_train) inside each fold (use Pipeline).
. . .
(b) You collect 100 EBSD maps from the same 5 specimens (20 maps each). You run a random 5-fold CV and report Dice=0.91. (Dice: segmentation metric — see metrics section.) On a 6th specimen, Dice=0.51.
. . .
Group leak. Maps from the same specimen in both train and test. Fix: GroupKFold(groups=specimen_id).
(c) You record an in-situ liquid-phase TEM video (1000 frames). You randomly shuffle and split 80/20. Train \(R^2 = 0.97\), deploy \(R^2 = 0.30\).
. . .
Temporal leak. Future frames used to predict past ones. Fix: train on first 800 frames; test on last 200 (chronological split).
. . .
Pattern: every leakage scenario reduces to one sentence — test-set information influenced the training process.
Temporal leakage: for time-series data (operando EM, in-situ growth), randomly splitting scrambles time order. The model can use “future” frames to predict “past” ones — impossible in deployment. Fix: always split chronologically — train on \(t < t_1\), test on \(t > t_1\).
Pre-processing leakage: fitting a StandardScaler on all data (train + test) before splitting. Test-set statistics leak into the scaler. Fix:
sklearn.pipeline.Pipeline does the right thing automatically inside CV.

For a binary defect-detection task:
| Predicted: no defect | Predicted: defect | |
|---|---|---|
| True: no defect | TN | FP (false alarm) |
| True: defect | FN (missed!) | TP |
\[\text{Accuracy} = \frac{\text{TP}+\text{TN}}{\text{TP}+\text{TN}+\text{FP}+\text{FN}}\]

specimen_id for every measurement.specimen_id exists → GroupKFold; if time-series → chronological split; otherwise 5-fold KFold.StandardScaler + model inside Pipeline — scaler fitted on train folds only.notebooks/week04_leakage_demo.ipynb — “Data leakage in EM: crop-level vs. specimen-level splitting.”
GroupKFold on specimen ID._shared/exam_mustknow.md — Week 4 statements are now filled.
©Philipp Pelz - FAU Erlangen-Nürnberg - Data Science for Electron Microscopy