FAU Erlangen-Nürnberg
Recap — Units 1–6 (ML-PC) and MFML W1–W6
Today — Unit 7 (delivered Week 7)
You have a model that trains. The hard questions begin now:
By the end of this lecture you can:
Six sections, ≈90 min
Two checkpoints, one demo
The model’s job is not to fit the training data.
Formally Bishop, Christopher M., (2006):
\[ L_{\text{gen}}(\hat f) \;=\; \mathbb{E}_{(x,y)\sim p_{\text{data}}} \!\left[ \ell\bigl(y, \hat f(x)\bigr)\right] \]
We can never compute this — we only ever estimate it from a held-out sample. Today is largely about building those estimators correctly.
Engineering reading Sandfeld, Stefan et al., (2024):

A model with insufficient capacity to capture the underlying structure.
Symptoms on the curve: both losses plateau early at a high value; the gap is small.
Materials examples:


A model with excess capacity that memorizes training noise.
Symptoms: train loss tiny, validation loss large; big gap.
Materials-specific causes:


For squared-error regression with target \(y = f(x) + \varepsilon\), \(\mathbb{E}[\varepsilon]=0\), \(\mathrm{Var}(\varepsilon)=\sigma^2\) Bishop, Christopher M., (2006); Murphy, Kevin P., (2012):
\[ \mathbb{E}\bigl[(y - \hat f(x))^2\bigr] = \underbrace{\bigl(\mathbb{E}[\hat f(x)] - f(x)\bigr)^2}_{\text{Bias}^2} + \underbrace{\mathrm{Var}(\hat f(x))}_{\text{Variance}} + \underbrace{\sigma^2}_{\text{Irreducible}} \]


The classical “model selection” picture:
Modern caveat (deep nets). In overparameterized networks the U-curve becomes a double descent: error rises near the interpolation threshold, then falls again deep into the overparameterized regime. We name it; we do not derive it. Pragmatically: regularize and CV either way.
Three structural reasons Sandfeld, Stefan et al., (2024):
Two cultural reasons:
Lesson from Unit 6. When data is scarce, simpler is almost always better unless you can pretrain on a related domain.
Add a penalty on complexity to the training loss:
\[ \mathcal{L}_{\text{reg}}(\theta) \;=\; \mathcal{L}_{\text{data}}(\theta) \;+\; \lambda \cdot \Omega(\theta) \]
| \(\Omega(\theta)\) | Bayesian prior | Effect |
|---|---|---|
| \(\|\theta\|_2^2\) | Gaussian | shrinks all weights toward 0 |
| \(\|\theta\|_1\) | Laplace | sparsity — sets weights to 0 |
| \(\|\nabla \theta\|^2\) | smoothness | spatially smooth fields |
Other regularizers Goodfellow, Ian et al., (2016):




The same data, three values of \(\lambda\). Tuning \(\lambda\) is the most common single hyperparameter problem in applied ML.
Generalization = same distribution, new sample. Robustness = perturbed distribution.
A model is robust if its prediction \(\hat f(x)\) changes only modestly when:
Two flavors of uncertainty (Unit 2 recap) Neuer, Michael et al., (2024):
A robust model is aleatory-tolerant and epistemic-honest.
The detector noise we cannot remove (Unit 2):
Robust ML requirement: prediction insensitive to noise within the measurement’s natural fluctuation envelope.
Diagnostic test.
The model has not seen this region of input space.
A robust model abstains rather than extrapolates with false confidence.
How to detect it (preview of Unit 11):
If \(\mathrm{Var}(\hat f(x))\) is high, refuse to act.
Train and test data are no longer drawn from the same distribution.
Materials examples:
Detection: prediction distribution looks different on test, or input statistics drift (KS-test, MMD, energy distance).
Single bad data points can derail a non-robust loss.
| Loss | Sensitivity | Comment |
|---|---|---|
| MSE / OLS | \(\propto r^2\) | one large \(r\) dominates |
| MAE | \(\propto |r|\) | robust |
| Huber | quadratic small \(r\), linear large \(r\) | best of both |
| Tukey biweight | bounded influence | redescending |
Try it on the chalkboard. Add one \(r = 100\) outlier to a 10-point regression. The MSE-fit line moves visibly; the MAE fit barely twitches.
For classification: the analogous knob is the loss margin (hinge vs cross-entropy vs focal loss). Imbalanced datasets (rare defects, rare phases) further amplify outlier effects Sandfeld, Stefan et al., (2024).
The scientific caveat.
Discuss with your neighbor (2 min):
Test of intent: does the mechanism generating the rare point fit our physical model?
Adversarial example: an imperceptibly small input perturbation that flips the prediction.
\[ x' = x + \delta,\quad \|\delta\| < \epsilon,\quad \hat f(x') \neq \hat f(x) \]
When does it matter for ML-PC?
From Unit 6: augmentation = teaching invariances explicitly.
For materials specifically:

Engineering smell test.
If I change the temperature by 1 °C, the predicted yield strength should not jump by 100 MPa.
Operationalize:
\[ \bigl|\hat f(x + \Delta x) - \hat f(x)\bigr| \;\le\; L \cdot \|\Delta x\| \]
— Lipschitz continuity with constant \(L\) matched to physical knowledge.
How to enforce / encourage:
The single hold-out problem. With \(N \approx 100\) samples, an 80/20 split tests on 20 — way too noisy, very dependent on which 20.
\(k\)-fold CV Sandfeld, Stefan et al., (2024):

Problem. In an imbalanced dataset (e.g., 95% “good” parts, 5% “defect”), random folds may end up with zero defects in a test fold.
Stratified \(k\)-fold: preserve class proportions in each fold.
Materials examples:
Regression analog. Bin the target \(y\) into quantiles, then stratify on the bin. Keeps fold means comparable. Critical when \(y\) has heavy tails (most materials properties do).
Never put images from the same specimen in both train and test folds.
Group-aware k-fold: partition by group ID (specimen, batch, session, instrument), not by row.
GroupKFold in scikit-learn.Time-aware variant for sequential data: train on past, test on future (“walk-forward CV”).
TimeSeriesSplit with an expanding window.Parameters \(\theta\) — learned by the algorithm:
Hyperparameters \(\eta\) — chosen by the human (or HPO algorithm):
Cardinal rule: never tune hyperparameters on the test set.
Two nested optimization problems:
The outer loop is expensive: each evaluation requires a full training run.
Three classes of outer-loop methods:
Modern variants: Hyperband, BOHB, population-based training.
Try every combination on a regular lattice.
The curse of dimensionality bites: cost scales as \(L^d\) for \(d\) hyperparameters, \(L\) levels per dim.
When grid search is fine:
When grid search is wrong: any time you have \(\ge 3\) hyperparameters of unknown sensitivity. Use random search.
Sample hyperparameters i.i.d. from a prior.
Intuition. Most hyperparameters do not matter. Random search marginalizes over the unimportant ones; grid search wastes budget on them.
Practical recipe:
The 2-D intuition.
If only 1 of 2 hyperparameters matters, grid search at 5 levels samples that 1 hyperparameter at 5 distinct values; random search at 25 trials samples it at ~25 distinct values. 5× more resolution on the important axis, for free.
Build a surrogate model of \(L_{\text{val}}(\eta)\), query where it expects to improve most.
When to use BO:
Tools: Optuna, Hyperopt, BoTorch, scikit-optimize, Ax.
AutoML: automate the entire pipeline — preprocessing, feature engineering, model class selection, hyperparameter search, ensembling.
auto-sklearn, TPOT, H2O AutoML, AutoGluon.Architecture search (NAS): search over NN architectures themselves.
Caveats for materials.
Discipline:
For small \(N\): nested cross-validation.
Forbidden moves.
These are test-set leakage. The number you report is no longer an estimate of generalization.
The region in process-parameter space where the product meets specifications.
A process window is the intersection of all “good enough” regions.
Why ML? Each specification corresponds to a model that maps process → property. Combining them gives a window. ML fills the parameter space without running every experiment.
Engineering deliverable. A process window is the output the production team actually wants. Not a test accuracy. Not an \(R^2\). A map: “which \((P, v, T)\) values are safe?”
Pre-ML construction:
Limitations:

Train a classifier or regressor on (process, outcome) pairs:
The boundary = the set where the classifier output transitions from “pass” to “fail” — typically the \(p = 0.5\) contour.
What this gives you that DoE alone does not:
A sharp boundary lies. Use probability contours.
Operational use:

The promise. Whatever you trained in §1–§3 — ridge, random forest, deep ensemble, GP — you can wrap it with split conformal prediction and get:
Five lines of code, one theorem, zero retraining.
Materials deployment recipe
One row added to the model card:
“90% split-conformal coverage, \(\alpha=0.10\), verified on \(n_{\text{cal}}=200\) held-out samples.”
Note
Full method + adaptive widths (CQR) in Unit 11. Today’s slide tells you it exists and that any §1–§3 model can use it.
The canonical AM process map. Laser-powder-bed fusion (L-PBF):
Three regimes (drawn on the chalkboard):

Real specifications combine multiple properties:
\[ \Omega = \bigcap_{i=1}^{m} \{\eta : f_i(\eta) \in [a_i, b_i]\} \]
Each \(f_i\) is a separate learned model with its own uncertainty. The intersection’s confidence is the worst component’s confidence at every point.
Trade-off frontier. When the intersection is empty: optimize Pareto-front — find the set of \(\eta\) values that are non-dominated.
This is multi-objective optimization; classical tools: NSGA-II, Bayesian-multi-objective with hypervolume.
Real processes have \(d \gg 2\) parameters. You cannot draw a \(d\)-D window. Strategies:
Caveat. A 2-D slice through a 5-D window can look like a closed contour but be misleading: nearby slices can be empty. Always supplement with marginal probabilities \(p(\Omega \mid \eta_j)\) for each parameter \(j\).
The closed loop.
Distance-to-boundary ≈ \(\frac{p(\eta) - 0.5}{\|\nabla_\eta p(\eta)\|}\) — local linearization.
Examples Neuer, Michael et al., (2024):
This is autonomous process control — Unit 10 picks it up.
The keyhole boundary, with your neighbor (3 min):
You have an L-PBF dataset of \((P, v)\) pairs and porosity measurements. You train two models:
You must deliver a process window for production. Which do you ship?
Hints to consider:
One reasonable answer: Ship Model B. Use Model A as a sanity-check baseline. The 5-point \(R^2\) gap is dwarfed by the value of calibrated uncertainty.
Sensitivity analysis quantifies how much the output changes when an input changes.
Two questions, not one:
Why this is a robustness diagnostic.
At a query point \(x_0\), perturb each input by \(\Delta x_j\):
\[ S_j(x_0) \approx \frac{\partial \hat f}{\partial x_j}\bigg|_{x_0} \;\approx\; \frac{\hat f(x_0 + \Delta x_j) - \hat f(x_0)}{\Delta x_j} \]
Limitations:
Materials interpretation: \(S_j\) has units. Yield strength (MPa) per unit composition (at%) is a physical sensitivity — compare to thermodynamic models, Hume-Rothery rules, etc.
How does \(x_j\) matter, on average, across the input distribution?
Variance decomposition (Sobol):
\[ \mathrm{Var}(\hat f) \;=\; \sum_j V_j \;+\; \sum_{j<k} V_{jk} \;+\; \cdots \]
Practical use. Rank inputs by \(S_j^T\). Drop or de-prioritize inputs with \(S_j^T \approx 0\).
For materials processes: \(S_j^T\) tells you which control knobs are worth tightening tolerances on.
Empirical, model-agnostic, easy.
For each feature \(j\):
Strengths:
Caveats:
For image inputs, “feature importance” is spatial.
The Husky-vs-Wolf test (Unit 3 recap).
If the saliency map highlights the scale bar, the watermark, or the corner of the field of view → the model is shortcutting. Catch this before deployment.
Materials examples:
Shapley values: fairly distribute the “credit” for a prediction among the input features.
For each feature \(j\) and prediction \(\hat f(x)\):
\[ \phi_j = \sum_{S \subseteq F\setminus\{j\}} \frac{|S|!\,(d-|S|-1)!}{d!} \bigl[\hat f(S\cup\{j\}) - \hat f(S)\bigr] \]
Why SHAP, not just gradient × input:
Tools: shap library; TreeSHAP for trees is exact and fast.
What §5 has been doing. Saliency, Grad-CAM, SHAP all answer “which inputs matter for this prediction.”
What they cannot do. Tell us “what concept the model internally represents in its hidden layers.” A single neuron in a defect-classification CNN typically fires for many unrelated patterns at once — polysemanticity — because the network had more features to encode than neurons available Elhage, Nelson et al., (2022).
The mechanistic-interpretability fix. Train a Sparse Autoencoder (SAE) on the layer’s activations \(h \in \mathbb{R}^d\):
\[ \hat h = D\big(\text{ReLU}(E h - b)\big), \quad \mathcal{L} = \|h - \hat h\|_2^2 + \lambda \|E h - b\|_1. \]
The wide (\(d' \gg d\)), sparsely-activating SAE features tend toward monosemanticity — one feature, one concept Templeton, Adly et al., (2024).
Why this lands in materials
Connection to Unit 5. An SAE is exactly the Unit-5 autoencoder + an \(\ell_1\) activation penalty. The architecture is unchanged; the loss adds one term.
Note
SHAP/Grad-CAM remain the right tools for single-prediction explanations. SAEs are for global model audit — what concepts does the network actually carry around.
Two complementary criteria:
Combined diagnostic:
\[ Q \;=\; \frac{\|\partial \hat f / \partial x_{\text{physical}}\|}{\|\partial \hat f / \partial x_{\text{nuisance}}\|} \]
— physical signal over nuisance noise. Larger is better.
Anti-pattern. A model with low overall sensitivity looks robust but is useless: a constant predictor has zero sensitivity to everything.
The right combination: high signal sensitivity, low nuisance sensitivity.
One coherent story:
Each layer is a check on the previous one. Skip a layer and your published model fails on next month’s batch.
Textbook foundations:
Selected papers:
SALib.Related ML-PC units:

© Philipp Pelz - Machine Learning in Materials Processing & Characterization