FAU Erlangen-Nürnberg
Recap — Units 1 & 2
Today — Unit 3
The data has arrived in your computer. What now?
By the end of this lecture, you will be able to:
“On two occasions I have been asked, ‘Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?’ I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.” — Charles Babbage, 1864
Garbage In, Garbage Out (GIGO): model accuracy is bounded by data quality before any architectural choice.
A representative materials-science failure mode:
The classical pipeline (CRISP-DM, Unit 1):
Steps 2 and 4 typically dominate calendar time on real projects.
Today’s structure
Three short think-pair-share checkpoints along the way.
\[ \underbrace{\xi(\tau)}_{\text{physical state}} \;\to\; \text{probe} \;\to\; \text{detector} \;\to\;\\\to\; \underbrace{\text{ADC} \;\to\; \text{file}}_{\text{digital}} \;\to\; \mathbf{x} \]
Errors can enter at every link:
Strategy: clean at the source first.
Note
Most of these are caught only by visualizing the data. Always plot before fitting.
Where do NaNs come from?
Three repair strategies:
Linear interpolation between neighbours Neuer, Michael et al., (2024), eq. 3.1: \[ x_i \;=\; \tfrac{1}{2}\bigl(x_{i-1} + x_{i+1}\bigr) \]
Numerical marker (temperature sensor, range \([-50, 100]\,°\mathrm{C}\)): \[ x_i^{\text{NaN}} \;=\; -1000\,°\mathrm{C} \] chosen outside the physically possible range, so downstream code can treat it specially.
Note
Trap. Replacing NaN with \(0\) silently inflates the count of zero-readings and biases statistics.
Point / global outlier. A single value far from the bulk of the distribution. Tensile test: one stress reading at \(10\,\mathrm{GPa}\) in a \(\sim 500\,\mathrm{MPa}\) dataset.
Contextual outlier. A value that is reasonable globally but anomalous given its neighbours. Time series: a \(300\,\mathrm{K}\) reading inside a furnace ramp at \(1500\,\mathrm{K}\).
Collective outlier. A sub-sequence whose individual values look fine but whose joint behaviour deviates. Stress–strain: an entire curve with the wrong loading rate.
Detection toolbox
For a tensile-test example, see Figure 11.4 in Sandfeld et al., Materials Data Science.
A point well outside the distribution may be:
The decision is not statistical. It is physical.
Heuristic checklist before removing a point
Default rule of thumb: flag, don’t drop. Keep an “outlier” column. Train with and without; report both.
Sources
Why it matters
Detection idioms (pandas)
For images / spectra: exact duplicates miss near-duplicates (different crops, different exposures of the same scene). Use perceptual hashes or learned embeddings.
Materials trap. Many “different” measurements come from the same specimen. They are not duplicates by row, but they are correlated by physics. We will return to this in §5 (group leakage).
Three goals
Algorithms that care about magnitude
Algorithms that don’t (much)
A general warning
Transformation is a modelling decision. It changes the effective prior the algorithm sees. Document and motivate every transform you apply §11.5.3 in Sandfeld et al., Materials Data Science.
Centering (mean-subtraction): \[ \tilde{x}_i = x_i - \langle x \rangle, \qquad \langle \tilde{x} \rangle = 0 \]
Shifting (alignment): \[ \tilde{x}_i^{(k)} = x_i^{(k)} - x^{(k)}_{\text{ref}} \]
Materials examples
Caution. Shifting destroys absolute calibration. Keep the offsets if you may need them later (e.g., for absolute energy alignment).
Min–max scaling to \([0, 1]\): \[ \tilde{x}_i \;=\; \frac{x_i - \min(\mathbf{x})}{\max(\mathbf{x}) - \min(\mathbf{x})} \]
Variants
Use when
Avoid when
Neuer, Michael et al., (2024), eqs. 3.2 – 3.4
\[ z_i \;=\; \frac{x_i - \mu}{\sigma}, \qquad \mu = \langle x \rangle, \quad \sigma^2 = \langle (x-\mu)^2 \rangle \]
After standardisation: \(\langle z \rangle = 0\), \(\mathrm{Var}(z) = 1\).
Properties
Worked example — Motor currents Neuer, Michael et al., (2024)
Problem: a fleet of motors logs current. Some channels record in mA, others in A — three orders of magnitude difference. Raw plot: half the curves look “flat”.
Fix: standardise each curve (or rescale unit families) so all channels share a common axis. The anomalous motor now visibly deviates.
Lesson: the fix is one line of code. The diagnosis required a domain expert and a plot.
Idea: divide each quantity by an intrinsic physical scale, not a statistical one.
\[ \tilde{x} \;=\; \frac{x}{x_{\text{ref}}}, \qquad x_{\text{ref}} \in \{L_0, T_0, c, k_B, \dots\} \]
Examples
Why physicists prefer this
Connect: Unit 13 (PINNs) leans heavily on non-dimensionalisation — it is what lets a single trained model generalise across orders of magnitude.
\[ \tilde{x}_i \;=\; \log(x_i + \epsilon), \qquad x_i > 0 \]
Linearises power laws and exponentials.
Compresses dynamic range.
Practical rules
Connect: the MSLE error metric (§6) is the natural error after a log-transform.
Forward difference as a discrete kernel Neuer, Michael et al., (2024), eq. 3.10: \[ \frac{d\mathbf{f}}{dx} \;\approx\; \mathbf{f} \ast [-1, +1] \]
Second derivative: \[ \frac{d^2\mathbf{f}}{dx^2} \;\approx\; \mathbf{f} \ast [+1, -2, +1] \]
What it does
Practical caveat
Differentiation amplifies high-frequency noise. Combine with smoothing in one kernel: \[ \mathbf{f} \ast [-1,-1,-1,-1,+1,+1,+1,+1]/4 \]
(Savitzky–Golay filters generalise this idea.)
Materials applications
Continuous Fourier transform Neuer, Michael et al., (2024), eq. 3.14: \[ \hat{x}(\nu) \;=\; \frac{1}{\sqrt{2\pi}}\int x(t)\, e^{-i 2\pi \nu t}\, dt \]
Why we transform
FFT (Cooley–Tukey) computes the DFT in \(\mathcal{O}(N \log N)\) — practical for \(N \sim 10^6\).

Diagnostic value. A specific \(\nu^*\) peak distinguishes “good” from “anomalous” motor cycles even when the time-domain signals look almost identical.
FFT assumes the signal is periodic over the whole window.
A transient — an acoustic-emission burst, a crack-initiation event, a single phonon pulse — is short in time but broad in frequency. The FFT finds its frequency content but cannot tell you when it happened.
The consequence
Wavelet transform — the Ricker example Neuer, Michael et al., (2024), eq. 3.16: \[ \psi\!\left(\tfrac{t-b}{a}\right) \;\propto\; \frac{d^2}{dt^2}\, e^{-(t-b)^2 / a^2} \]
Continuous wavelet transform (CWT): \[ \mathrm{CWT}[x](a, b) \;=\; \frac{1}{\sqrt{|a|}} \!\int x(t)\, \psi\!\left(\tfrac{t-b}{a}\right) dt \]
Output is a function of width \(a\) (≈ inverse frequency) and position \(b\) (time) — a 2D time–frequency map.
Many process signals are long, with short interesting windows:
Triggering extracts windows around events.
Why it matters
Connect: in Unit 7 (time-series ML) we will build models that learn the trigger criterion.
| Goal | Tool | When |
|---|---|---|
| Centre at zero | mean subtraction | covariance, FFT phase |
| Bound to \([0,1]\) | min–max | image pixels, bounded sensors |
| Equalise spread | standardisation | kNN, PCA, regularised linear |
| Linearise multiplicative | log | power laws, dynamic range |
| Reveal change | derivative | baseline drift, anomaly |
| Reveal periodicity | FFT | stationary oscillations |
| Reveal localised periodicity | wavelet | transients, AE events |
| Isolate cycles | triggering | repetitive processes |
| Remove unit dependence | non-dimensionalisation | physics-aware ML |
Rule of thumb: match the transform to what you want the model to see.
For each of the four signals below, which preprocessing would you apply first, and why?
(a) What would you apply first to a vibration spectrum from a rolling bearing, sampled at \(20\,\mathrm{kHz}\)?
Answer: FFT to find characteristic fault frequencies; subtract DC first.
(b) Grain-size measurements ranging from \(50\,\mathrm{nm}\) to \(50\,\mathrm{\mu m}\).
Answer: Log-transform to bring 3 decades onto a manageable scale.*
(c) Three sensors measuring temperature, pressure, current — fed into a kNN classifier.
Answer: Standardise so no single feature dominates the Euclidean distance.*
(d) Acoustic emission during fatigue — long quiet stretches, brief bursts.
Answer: Trigger on amplitude threshold; CWT inside each window for time–frequency content.*
A supervised model is only as good as its labels.
In materials science, labels are:
Common scenarios
Note
Insight: the labelling process is itself a measurement chain (Unit 2) — with its own noise model.
Two domain experts rarely agree pixel-for-pixel on:
Note
Quantify it! Compute Dice or IoU between annotators — this gives you the ceiling for any model’s performance when trained on either set of labels.

Hard label — one-hot vector \(\mathbf{y} = [0, 1, 0]\).
Soft label — distribution \(\mathbf{y} = [0.1, 0.7, 0.2]\).
For a classifier producing logits \(z_\ell\), the softmax converts them to probabilities: \[ p(\ell \mid \mathbf{x}) \;=\; \frac{e^{z_\ell(\mathbf{x})}}{\sum_{\ell'} e^{z_{\ell'}(\mathbf{x})}} \]
The output sums to 1 — interpret as a probability over classes.
Why it matters in materials science
Connect: Unit 12 (Gaussian processes, uncertainty-aware regression) treats this principle in full generality.
Bayes’ rule McClarren, Ryan G., (2021):
\[ \underbrace{p(\mathbf{w} \mid \mathcal{D})}_{\text{posterior}} \;=\; \frac{ \overbrace{p(\mathcal{D} \mid \mathbf{w})}^{\text{likelihood}} \; \overbrace{p(\mathbf{w})}^{\text{prior}} }{ \underbrace{p(\mathcal{D})}_{\text{evidence}} } \]
Reading it
Why this matters today
Connect: Unit 12 (GPs) takes the Bayesian view all the way; today we just need its vocabulary.
Three fits to the same data: too stiff (underfit, high bias), well-balanced, too flexible (overfit, high variance). Adapted from Sandfeld, Stefan et al., (2024).
For squared-error loss Sandfeld, Stefan et al., (2024), eq. after MSE:
\[ \mathbb{E}\bigl[(\hat{y} - y)^2\bigr] \;=\; \underbrace{\bigl(\mathbb{E}\hat{y} - y\bigr)^2}_{\mathrm{Bias}^2} \;+\; \underbrace{\mathrm{Var}(\hat{y})}_{\text{Variance}} \;+\; \underbrace{\sigma^2}_{\text{Noise}} \]
Practical reading
Materials reality: \(\sigma^2\) is often large (small samples, expensive measurements). Don’t chase a model below the noise.
Entia non sunt multiplicanda praeter necessitatem.
“Entities should not be multiplied beyond necessity.” — William of Ockham, 14th c.
McClarren’s example McClarren, Ryan G., (2021):
The lesson
A model with capacity \(\geq\) dataset size can memorise noise. Adding the right kind of bias toward simplicity is the only protection.
Practical consequences
Augment the loss: \[ \mathcal{L}_{\text{reg}}(\mathbf{w}) \;=\; \mathcal{L}_{\text{data}}(\mathbf{w}) \;+\; \lambda\, \Omega(\mathbf{w}) \]
Two canonical penalties
Bayesian re-reading
Materials value. Lasso applied to McClarren’s example zeros out 49 noise features, recovers \(w_1 \approx 3\). The model literally tells you which features matter.
The simplest validation strategy Sandfeld, Stefan et al., (2024):
Quick, cheap, often used.
The risks
Sandfeld’s experiment: 100 random 60/40 splits on the same data → 100 different MSEs, sometimes off by 5×. The holdout gives you one of those 100 numbers.
The recipe Sandfeld, Stefan et al., (2024):
Why it’s better than holdout
Cost
Defaults
LOOCV — \(k = N\).

Stratified k-fold
sklearn.model_selection.StratifiedKFold.
Definition. Information from the test set influences the training process — directly or indirectly. The reported performance is then optimistic by an unknown amount.
Symptoms
Cause: a discipline failure, not a bug. Almost all real cases come from one of three patterns on the next slides.
Three classes you must know
The wrong way:
The scaler computed \(\mu, \sigma\) on \(X\) — including the test rows. Test-set statistics leaked into training.
Time has a direction. You cannot use \(x(t+\Delta t)\) to predict \(y(t)\).
The wrong way: randomly shuffle a time series and split. The resulting training set has data points that come after test-set points — a model can exploit short-range temporal autocorrelation that won’t be there at deployment.
The right way: time-aware split.
─────train─────┃────val───┃───test───→ time
t1 t2 T
Walk-forward (rolling) CV for time series:
Materials examples
sklearn.model_selection.TimeSeriesSplit.
The trap. Multiple data points share an underlying physical entity:
If random splitting puts some of those rows in train and others in test, the model can recognise the entity rather than the property.
The cure: group-based splitting.
The split now respects the physical grouping — the entire specimen is in either train or test, never both.
Note
Materials default: if there is a specimen_id column, your default CV is GroupKFold.
For each scenario, identify the leakage — and the fix.
(a) You scale all features with StandardScaler.fit_transform(X) and then split into train/test.
Pre-processing leak. Fix: fit scaler on train only.
(b) You split a year of process data 80/20 at random and report 0.95 \(R^2\).
Temporal leak. Fix: train on first 80%, test on last 20% chronologically.
(c) You collected 20 micrographs from each of 5 fatigue specimens. Random 5-fold CV gives Dice 0.92; on a 6th specimen, Dice = 0.55.
Group leak. Fix: GroupKFold on specimen_id.
(d) You select the top-50 most correlated features with \(y\) on the full dataset, then run k-fold CV.
Pre-processing leak via feature selection. Fix: feature selection inside each fold.
A loss function trains the model. A metric reports performance. They need not be the same.
A metric encodes a value judgment.
Different problems demand different metrics:
Note
Pick the metric that matches what you actually care about. Defects you must catch → recall. Calibrated property predictions → MSE / \(R^2\). Sample size where outliers dominate → MAE.
\[ \mathrm{MAE} = \frac{1}{n}\sum_{i=1}^n |y_i - \hat{y}_i| \] \[ \mathrm{MSE} = \frac{1}{n}\sum_{i=1}^n (y_i - \hat{y}_i)^2 \] \[ \mathrm{RMSE} = \sqrt{\mathrm{MSE}} \]
Properties
Outlier sensitivity
Pick MSE when
Pick MAE when
\[ R^2 \;=\; 1 - \frac{\sum_i (y_i - \hat{y}_i)^2}{\sum_i (y_i - \bar{y})^2} \;=\; 1 - \frac{\mathrm{MSE}_{\text{model}}}{\mathrm{MSE}_{\text{baseline}}} \]
Interpretation: “fraction of variance in \(y\) explained by the model.”
Caution: \(R^2\) alone is not enough McClarren, Ryan G., (2021).
For a binary problem (“defective” = 1):
| Pred 0 (no defect) | Pred 1 (defect) | |
|---|---|---|
| True 0 | TN | FP (Type I) |
| True 1 | FN (Type II) | TP |
\[ \mathrm{Accuracy} = \frac{\mathrm{TP} + \mathrm{TN}}{\mathrm{TP} + \mathrm{TN} + \mathrm{FP} + \mathrm{FN}} \]
The cost is asymmetric.
A balanced metric like accuracy hides this asymmetry. Precision and recall reveal it.
For multi-class: confusion matrix is \(K \times K\). Off-diagonal = misclassifications. Heat-map it.
\[ \mathrm{Precision} = \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FP}} \] of what I called positive, how much actually was?
\[ \mathrm{Recall} = \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FN}} \] of what was actually positive, how much did I find?
Synonyms
They trade off. Lower the decision threshold → recall ↑, precision ↓.
Pick by use case
Materials examples
\[ \mathrm{F1} \;=\; \frac{2\,\mathrm{Precision}\cdot\mathrm{Recall}}{\mathrm{Precision} + \mathrm{Recall}} \;=\; \frac{2\,\mathrm{TP}}{2\,\mathrm{TP} + \mathrm{FP} + \mathrm{FN}} \]
The harmonic mean of precision and recall — close to the minimum of the two, so a model with one near-zero score is heavily punished.
In segmentation, the same quantity is called the Dice coefficient.
Dice for segmentation
For predicted region \(A\) and true region \(B\): \[ \mathrm{Dice}(A, B) \;=\; \frac{2|A \cap B|}{|A| + |B|} \]
Single-metric trap. A high Dice can hide systematic over- or under-segmentation. Always pair with precision and recall to know which way the model fails.
\[ \mathrm{IoU}(A, B) \;=\; \frac{|A \cap B|}{|A \cup B|} \;=\; \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FP} + \mathrm{FN}} \]
Relation to Dice: \[ \mathrm{Dice} = \frac{2\,\mathrm{IoU}}{1 + \mathrm{IoU}} \]
Both range \([0, 1]\). IoU is always smaller than (or equal to) Dice for the same prediction.
When IoU is the convention
When Dice is the convention
Either is fine — but be explicit.
For \(K\) classes with one-hot true label \(\mathbf{y}\) and predicted probabilities \(\hat{\mathbf{p}} = \mathrm{softmax}(\mathbf{z})\) Sandfeld, Stefan et al., (2024), eq. 11.20:
\[ \mathcal{L}_{\mathrm{CE}} \;=\; -\sum_{k=1}^{K} y_k \log \hat{p}_k \]
For binary: \[ \mathcal{L} = -[y\log \hat{p} + (1-y)\log(1-\hat{p})] \]
Punishes confident wrong predictions disproportionately — \(\log(0)\) blows up.
Why cross-entropy is the standard loss
Connect: later units use cross-entropy to train CNNs (Unit 5), language-model heads, and even contrastive losses.
df.describe(), plot histograms and scatter, check dtypes, sample IDs, units.The greatest hits of materials-ML failures:
Unit 4: From classical microstructure metrics to learned representations.
Today’s tools you’ll need next week
Required reading
Optional
Exercises (problem set 3)
(X, y, group_id) triples. Verify against sklearn.GroupKFold.
© Philipp Pelz - Machine Learning in Materials Processing & Characterization