FAU Erlangen-Nürnberg
Materials processes are dynamical.
Centre of gravity for today.
Note
This is the Week 7 lecture, delivered as guided self-study.
The Tuesday 26.05.2026 lecture slot is cancelled (Pfingstdienstag — public holiday), so work through this deck independently. The Thursday 28.05.2026 exercise runs in class as scheduled and consolidates this material. This is a delivered part of the SS26 schedule, not optional reading.
What you can read here on your own.
Where the delivered curriculum picks it up.
By the end of 90 minutes you can:
So far in ML-PC.
Reality of materials manufacture.
Typical channels in a process log.
Heterogeneity is the rule.
MLP.
CNN (1-D).
We need an architecture with memory — a hidden state \(h_t\) updated as new data arrive.
Sampling rate vs. process timescale.
Non-stationarity & autocorrelation.
A point prediction.
A predictive distribution.
Operational consequence. A 90 % prediction interval whose empirical coverage is 60 % is not a forecast — it is a liability. Calibration is non-negotiable for safety-critical loops.
Feed-forward neuron.
\[y = \sigma(Wx + b)\]
Recurrent neuron (McClarren 2021, sec. 7.1).
\[h_t = \sigma(W_h h_{t-1} + W_x x_t + b)\]
Unrolled view.
Forward equations.
\[h_t = \tanh(W_{hh} h_{t-1} + W_{xh} x_t + b_h)\] \[\hat y_t = W_{hy} h_t + b_y\]
Unrolled RNN — a loop becomes a chain of identical layers.
The idea.
Truncated BPTT.
The mechanism (McClarren 2021, sec. 7.1.1).
Why it is fundamental.
Exploding gradients.
Mitigations.
Where they work.
Where they fail.
The first three failure modes motivate LSTM/GRU (next section). The fourth motivates Part 4 — the centre of gravity of today’s lecture.
Goal.
The LSTM idea (Hochreiter and Schmidhuber 1997).
\(C_t\) as a conveyor belt.
\(h_t\) vs \(C_t\).
Forget gate. What to drop from \(C_{t-1}\). \[f_t = \sigma(W_f [h_{t-1}, x_t] + b_f)\]
Input gate. What new information to write. \[i_t = \sigma(W_i [h_{t-1}, x_t] + b_i)\] \[\tilde C_t = \tanh(W_C [h_{t-1}, x_t] + b_C)\]
Cell-state update. \[C_t = f_t \odot C_{t-1} + i_t \odot \tilde C_t\]
Output gate. What to expose as \(h_t\). \[o_t = \sigma(W_o [h_{t-1}, x_t] + b_o)\] \[h_t = o_t \odot \tanh(C_t)\]
Each gate is a small sigmoid network on \((h_{t-1}, x_t)\).
The crucial line. \[C_t = f_t \odot C_{t-1} + i_t \odot \tilde C_t\]
When \(f_t \approx 1\) and \(i_t \approx 0\) (forget nothing, write nothing), \(C_t \approx C_{t-1}\) — the identity in the recurrence.
Gradient flow.
Gated Recurrent Unit (Cho et al. 2014).
\[z_t = \sigma(W_z[h_{t-1}, x_t])\] \[r_t = \sigma(W_r[h_{t-1}, x_t])\]
Update. \[\tilde h_t = \tanh(W_h[r_t \odot h_{t-1}, x_t])\] \[h_t = (1-z_t) \odot h_{t-1} + z_t \odot \tilde h_t\]
| Property | RNN | LSTM | GRU |
|---|---|---|---|
| Hidden state | \(h\) | \(h, C\) | \(h\) |
| Gates | 0 | 3 | 2 |
| Params (rel.) | \(1\times\) | \(\sim 4\times\) | \(\sim 3\times\) |
| Long-range | Poor | Good | Good |
| Default? | No | Yes | Yes (compute-bound) |
Decision rule.
Bidirectional RNN.
Stacked RNN.
Encoder–decoder structure.
Materials-relevant uses.
Where this leads. Attention (Bahdanau et al. 2015) was originally an encoder–decoder fix; transformers (W10) drop the recurrence entirely. We will not develop attention here — keep the RNN as the recurrent baseline.
Why look past LSTM at all.
Materials hook. LPBF melt-pool monitoring at 10 kHz over a 100-layer build \(\Rightarrow 10^7\)+ frames per part. Both LSTM (slow to train) and Transformer (memory blowup) choke. Mamba does not.
The selective-scan trick.
Practical recipe.
mamba_ssm on PyPI; drop-in PyTorch module.The shape of the trade-off. For long process streams: Mamba. For short windows (a few hundred frames): a small Transformer is still competitive.
Deterministic LSTM.
Probabilistic LSTM.
A 90 % prediction interval is what a control engineer actually needs.
Aleatoric — irreducible.
Epistemic — reducible.
\[\sigma_\mathrm{total}^2(x_{t+1}) = \underbrace{\sigma_\mathrm{aleatoric}^2}_{\text{predicted by the head}} + \underbrace{\sigma_\mathrm{epistemic}^2}_{\text{measured across models / dropout masks}}\]
Synthetic melt-pool signal.
Decomposition recipe.
Synthetic melt-pool signal — the kind of trace we will train on.
Old. \[\hat x_{t+1} = f_\theta(x_{1:t}) \in \mathbb{R}\]
New. \[p_\theta(x_{t+1}\mid x_{1:t})\]
Two parameterisations we will cover.
Both use the same LSTM body; only the final layer changes.
Output two quantities per step.
\[h_t \xrightarrow{\;\text{linear}\;} (\mu_t, \log\sigma_t^2)\]
Predictive distribution.
\[p_\theta(x_{t+1}\mid x_{1:t}) = \mathcal{N}(\mu_t, \sigma_t^2)\]
Loss per step.
\[-\log p(x_{t+1}\mid \mu_t, \sigma_t^2) = \tfrac{(x_{t+1}-\mu_t)^2}{2\sigma_t^2} + \tfrac{1}{2}\log\sigma_t^2 + \mathrm{const.}\]
This is the MLE objective from MFML W8, applied at every step.
Two terms, two roles.
Reduces to MSE when \(\sigma\) is held constant (homoscedastic).
The picture.
\[p(x_{t+1}\mid x_{1:t}) = \sum_{k=1}^K \pi_{k,t}\,\mathcal{N}(\mu_{k,t}, \sigma_{k,t}^2)\]
Head outputs \(3K\) numbers.
\[h_t \to \{\,\pi_{k,t},\, \mu_{k,t},\, \log\sigma_{k,t}^2\,\}_{k=1}^K\]
(Bishop 2006, sec. 5.6; Murphy 2012 ch. 23)
Training. NLL of the mixture. \[\mathcal L = -\sum_t \log \sum_{k=1}^K \pi_{k,t}\,\mathcal{N}(x_{t+1}\mid \mu_{k,t}, \sigma_{k,t}^2)\]
logsumexp for numerical stability.Inference choices.
Failure: mode collapse. All \(\pi_k \to 1\) on one component.
The trick (Gal and Ghahramani 2016b).
Total predictive variance.
\[\sigma^2_\mathrm{total} = \underbrace{\tfrac{1}{T}\sum_j \sigma^{(j)\,2}_t}_{\text{aleatoric}} + \underbrace{\mathrm{Var}_j(\mu_t^{(j)})}_{\text{epistemic}}\]
Recipe.
Aggregation rule.
\[\mu = \tfrac{1}{K}\sum_k \mu_k\] \[\sigma^2 = \underbrace{\tfrac{1}{K}\sum_k \sigma_k^2}_{\text{aleatoric}} + \underbrace{\tfrac{1}{K}\sum_k(\mu_k-\mu)^2}_{\text{epistemic}}\]
The question. Do my \(\alpha\)-quantile prediction intervals contain the truth at rate \(\alpha\)?
For a held-out test set:
Reading the plot.
[MFML W8: this is the calibration plot they already saw, applied here.]
Platt scaling.

Isotonic regression.

Always recalibrate on a held-out set you did not train on. Otherwise you are fitting calibration on the same data you measured it from — guaranteed-good calibration plot, no real improvement.
The problem with vanilla split conformal.
Adaptive Conformal Inference (ACI). Gibbs & Candès (Gibbs and Candès 2021) update the miscoverage level online: \[\alpha_t \leftarrow \alpha_{t-1} + \gamma\,\big(\mathbf{1}[Y_t \notin C_t] - \alpha\big).\]
After each observed outcome, \(\alpha_t\) moves up if we just missed, down if we just covered. Long-run target coverage is guaranteed under arbitrary drift — no exchangeability needed.
Materials picture.
The only knob. The step size \(\gamma\), typically \(0.005\)–\(0.05\).
The contract. Long-run coverage, not per-step coverage. Intervals can be wide; they will not be wrong on average.
Two desiderata, in tension.
A constant predictor with infinitely wide intervals is perfectly calibrated and useless.
Proper scoring rules.
The mantra. Maximise sharpness subject to calibration. (Gneiting and Raftery 2007)
Linear-Gaussian state-space model (Bishop 2006 ch. 13; Murphy 2012 ch. 17).
\[z_t = A z_{t-1} + w_t, \quad w_t \sim \mathcal{N}(0, Q)\] \[x_t = H z_t + v_t, \quad v_t \sim \mathcal{N}(0, R)\]
Why mention this in an LSTM lecture.
The score.
\[s_t = -\log p_\theta(x_{t+1}\mid x_{1:t})\]
Contrast with W5 (autoencoder anomaly).
Anomaly score over time — predictive likelihood drops sharply at process events.
The setup.
Baseline performance.
The upgrade.
What the predictive distribution gives us.
Remaining Useful Life.
Why intervals enable maintenance scheduling.
Survival curve.
\[S(t) = P(\Delta T > t \mid x_{1:T_\mathrm{now}})\]
Acting on \(S\).
The toy.
\[y(t) = \sin(\omega t) + \varepsilon, \quad \varepsilon \sim \mathcal N(0, \sigma_\mathrm{noise}^2)\]
With probabilistic head.
LSTM tracking a noisy sine — the simplest sandbox for probabilistic forecasting.
The picture.
Why probabilistic fusion is the natural framing.
The architecture.
What changes with calibration.
One sentence to take home. “Deterministic for prototypes, probabilistic for control.”
Sliding window.
Padding & masking.
Per-channel standardisation.
Common leakage patterns to avoid.
Deterministic.
Probabilistic.
The mantra (Slide 37 reprise). Use proper scoring rules. MSE on the mean is not one.
Week 7 lecture, delivered as self-study (Tuesday slot cancelled — Pfingstdienstag); the Thursday exercise runs in class.
Today’s anchors.
Forward links.
Exercise (90 min).
Deliverable. A single notebook with the three calibration plots side by side. Each plot must include a 90 % CI line.

© Philipp Pelz - Machine Learning in Materials Processing & Characterization