Machine Learning in Materials Processing & Characterization
Unit 7: Time-Series and Process Monitoring
FAU Erlangen-Nürnberg
Key question: How do we build neural networks that understand order and history?
By the end of this unit, you can:
Slides 03–08
The process history determines the final microstructure — we should learn from it directly.
A single additive manufacturing build can produce terabytes of time-series data across thousands of sensors.

The MLP problem: Flattening a time-series destroys temporal order
Note
An MLP cannot distinguish “heat then quench” from “quench then heat” — both are just a bag of temperature values.
The CNN problem: 1D convolutions capture local patterns, but lack global memory
The fixed-length problem: Both MLPs and CNNs require fixed input dimensions
Key properties of process signals that matter for ML:
What we need: An architecture that naturally handles variable-length, ordered, autocorrelated data with a persistent memory.
Slides 09–20
The hidden state \(h_t\) is the network’s memory:
\[h_t = f(x_t, h_{t-1})\]
This simple recursion is the foundation of all recurrent architectures.
The vanilla RNN computes:
\[h_t = \tanh(W_{xh}\, x_t + W_{hh}\, h_{t-1} + b_h)\] \[y_t = W_{hy}\, h_t + b_y\]
Crucially: The same weights \(W_{xh}, W_{hh}, W_{hy}\) are shared across all time steps — just like CNN kernels are shared across spatial positions.
To understand training, we “unroll” the loop across time:
graph LR
x1["x<sub>1</sub>"] --> R1["RNN<br>Cell"]
x2["x<sub>2</sub>"] --> R2["RNN<br>Cell"]
x3["x<sub>3</sub>"] --> R3["RNN<br>Cell"]
xT["x<sub>T</sub>"] --> RT["RNN<br>Cell"]
R1 -- "h<sub>1</sub>" --> R2
R2 -- "h<sub>2</sub>" --> R3
R3 -. "..." .-> RT
R1 --> y1["y<sub>1</sub>"]
R2 --> y2["y<sub>2</sub>"]
R3 --> y3["y<sub>3</sub>"]
RT --> yT["y<sub>T</sub>"]
h0["h<sub>0</sub>"] --> R1
style R1 fill:#2d6a4f,stroke:#fff,color:#fff
style R2 fill:#2d6a4f,stroke:#fff,color:#fff
style R3 fill:#2d6a4f,stroke:#fff,color:#fff
style RT fill:#2d6a4f,stroke:#fff,color:#fffKey observations from the unrolled view:

Example: 3-step sequence, \(h_0 = \mathbf{0}\)
| Step | Computation |
|---|---|
| \(t=1\) | \(h_1 = \tanh(W_{xh} x_1 + W_{hh} \cdot \mathbf{0} + b_h)\) |
| \(t=2\) | \(h_2 = \tanh(W_{xh} x_2 + W_{hh} h_1 + b_h)\) |
| \(t=3\) | \(h_3 = \tanh(W_{xh} x_3 + W_{hh} h_2 + b_h)\) |
| Output | \(y_3 = W_{hy} h_3 + b_y\) |
Notice: \(h_3\) contains information about \(x_1, x_2, x_3\) — but \(x_1\)’s influence has been transformed twice by \(W_{hh}\).
Unlike MLPs and CNNs, RNNs naturally handle sequences of different lengths:
How it works: Simply run the forward pass for as many steps as the sequence requires. The hidden state accumulates information regardless of length.
For batched training: Pad shorter sequences and use a mask to ignore padded positions (more in Part 5).
\[\frac{\partial \mathcal{L}}{\partial W_{hh}} = \sum_{t=1}^{T} \frac{\partial \mathcal{L}_t}{\partial W_{hh}} = \sum_{t=1}^{T} \sum_{k=1}^{t} \frac{\partial \mathcal{L}_t}{\partial h_t} \prod_{j=k+1}^{t} \frac{\partial h_j}{\partial h_{j-1}} \frac{\partial h_k}{\partial W_{hh}}\]
The product \(\prod_{j=k+1}^{t} \frac{\partial h_j}{\partial h_{j-1}}\) is the source of all problems.
Scenario: You are monitoring a laser powder bed fusion build. At layer 50 (out of 500), there was a brief power fluctuation that caused incomplete melting. The final part is tested at layer 500.
Question: Can a vanilla RNN at layer 500 still “remember” what happened at layer 50?
Answer: Almost certainly not. The information must survive 450 multiplications by \(W_{hh}\). Unless the eigenvalues of \(W_{hh}\) are exactly 1, the signal either vanishes or explodes.
This is the vanishing gradient problem — and it is a show-stopper for long sequences.
Each factor in the gradient product is:
\[\frac{\partial h_j}{\partial h_{j-1}} = \text{diag}\!\bigl(\tanh'(z_j)\bigr) \cdot W_{hh}\]
Consequence: The network cannot learn long-range dependencies. It effectively has a “memory horizon” of \(\sim 10\)–\(20\) steps (McClarren 2021).
The opposite case: if the largest singular value of \(W_{hh}\) is \(> 1\):
Practical fix: Gradient clipping — rescale the gradient if its norm exceeds a threshold:
\[\tilde{g} = \begin{cases} g & \text{if } \|g\| \leq \theta \\ \theta \cdot \frac{g}{\|g\|} & \text{if } \|g\| > \theta \end{cases}\]
Gradient clipping handles the exploding case, but does not fix vanishing gradients. We need a fundamentally different architecture for that.
Next: LSTMs and GRUs — architectures designed to maintain long-term memory.
Slides 21–32
The core insight (Hochreiter & Schmidhuber, 1997):
Tip
Think of the cell state as a conveyor belt running through time. Information can be placed on or removed from the belt via gates, but the belt itself just moves forward — no multiplicative shrinkage.
The LSTM introduces a cell state \(C_t\) alongside the hidden state \(h_t\):
graph LR
C_prev["C<sub>t-1</sub>"] -- "×forget + add new" --> C_next["C<sub>t</sub>"]
h_prev["h<sub>t-1</sub>"] --> Gates["Gates"]
x_t["x<sub>t</sub>"] --> Gates
Gates --> C_next
Gates --> h_next["h<sub>t</sub>"]
C_next --> h_next
style Gates fill:#e76f51,stroke:#fff,color:#fff
style C_prev fill:#264653,stroke:#fff,color:#fff
style C_next fill:#264653,stroke:#fff,color:#fffPurpose: Decide which parts of the old cell state \(C_{t-1}\) to erase
\[f_t = \sigma(W_f [h_{t-1}, x_t] + b_f)\]
Materials example: When a new casting cycle begins, the forget gate can learn to clear the memory of the previous cycle.
Purpose: Decide what new information to store in the cell state
Step 1 — What to update: \[i_t = \sigma(W_i [h_{t-1}, x_t] + b_i)\]
Step 2 — Candidate values: \[\tilde{C}_t = \tanh(W_C [h_{t-1}, x_t] + b_C)\]
Cell state update (the critical equation): \[C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t\]
This is additive — no vanishing gradient along the \(C_t\) pathway!
Purpose: Decide what to output from the cell state as the new hidden state
\[o_t = \sigma(W_o [h_{t-1}, x_t] + b_o)\] \[h_t = o_t \odot \tanh(C_t)\]
Materials example: The cell state might remember the entire thermal history, but the output gate selects only the information relevant for predicting the current microstructural phase.
Gradient flow through the cell state:
\[\frac{\partial C_t}{\partial C_{t-1}} = f_t\]
Compare to vanilla RNN: \(\frac{\partial h_t}{\partial h_{t-1}} = \text{diag}(\tanh') \cdot W_{hh}\)
| Vanilla RNN | LSTM (cell state) | |
|---|---|---|
| Gradient factor | \(\tanh' \cdot W_{hh}\) | \(f_t\) (learned, near 1) |
| After 100 steps | \(\sim 10^{-30}\) | \(\sim 0.9^{100} \approx 10^{-5}\) |
Cho et al. (2014) simplified the LSTM to two gates:
Update gate: \(z_t = \sigma(W_z [h_{t-1}, x_t])\) — combines forget + input gates
Reset gate: \(r_t = \sigma(W_r [h_{t-1}, x_t])\) — controls how much past to use for candidate
\[\tilde{h}_t = \tanh(W [r_t \odot h_{t-1}, x_t])\] \[h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t\]
| Property | Vanilla RNN | LSTM | GRU |
|---|---|---|---|
| Gates | 0 | 3 (forget, input, output) | 2 (update, reset) |
| Memory states | \(h_t\) only | \(h_t\) and \(C_t\) | \(h_t\) only |
| Parameters per unit | \(n_h^2 + n_h n_x\) | \(4(n_h^2 + n_h n_x)\) | \(3(n_h^2 + n_h n_x)\) |
| Long-range memory | Poor (\(\sim 10\) steps) | Excellent (\(\sim 1000\) steps) | Good (\(\sim 500\) steps) |
| Training speed | Fast | Slowest | Medium |
Rule of thumb: Start with LSTM. Try GRU if training is too slow. Vanilla RNN only for very short sequences or as a pedagogical stepping stone.
\[\overrightarrow{h_t} = \text{RNN}_{\text{fwd}}(x_t, \overrightarrow{h_{t-1}}) \qquad \overleftarrow{h_t} = \text{RNN}_{\text{bwd}}(x_t, \overleftarrow{h_{t+1}})\] \[h_t = [\overrightarrow{h_t};\, \overleftarrow{h_t}]\]
When to use: Only for offline analysis where the full sequence is available
\[h_t^{(l)} = \text{RNN}^{(l)}(h_t^{(l-1)}, h_{t-1}^{(l)})\]
Practical tip: 2–3 layers is typical. Beyond that, use dropout between layers to prevent overfitting.
Not all tasks map a sequence to a single output. Common patterns:
Encoder-Decoder for many-to-many:
Materials application: Input = current thermal history \(\to\) Output = predicted future temperature profile for the next 100 time steps.
Slides 33–42
Task: Detect defects during laser powder bed fusion from in-situ sensor data
Anomaly score: \(a_t = |x_{t+1} - \hat{x}_{t+1}|\)

Going beyond detection: Can we predict defects before they occur?
Architecture: Stacked LSTM with many-to-one output
Problem: Given sensor data from a machine or component, predict how many cycles/hours remain before failure
LSTM approach:
Note
RUL prediction is a regression task where the target decreases monotonically. The LSTM must learn degradation trajectories from historical failures and generalize to new operating conditions.
Creep: Time-dependent deformation under constant stress at elevated temperature
Corrosion monitoring: Electrochemical impedance spectroscopy (EIS) over months
McClarren’s example (McClarren 2021): Can an RNN recover the frequency of a noisy sine wave?
\[x(t) = A \sin(2\pi f t) + \epsilon(t)\]
Materials relevance: Extracting characteristic frequencies from acoustic emission, vibration analysis, or impedance spectroscopy in noisy industrial environments.
Extension: Predicting the phase shift between two synchronized signals
LSTM advantage: The phase relationship between signals requires comparing events separated in time — exactly the kind of long-range dependency that LSTMs handle well.

McClarren’s benchmark (McClarren 2021): Balancing an inverted pendulum on a cart
Real manufacturing processes have multiple sensors recording simultaneously:
Multivariate LSTM: Input \(x_t \in \mathbb{R}^d\) where \(d\) = number of sensor channels
graph LR
S["Sensors"] --> P["Preprocessing<br>(filtering, scaling)"]
P --> L["LSTM Model<br>(inference)"]
L --> D{"Anomaly<br>Detected?"}
D -- "Yes" --> A["Adjust Parameters<br>(power, speed)"]
D -- "No" --> C["Continue<br>Normal Operation"]
A --> M["Manufacturing<br>Process"]
C --> M
M --> S
style L fill:#2d6a4f,stroke:#fff,color:#fff
style D fill:#e76f51,stroke:#fff,color:#fffLatency requirements: The entire loop must complete in \(< 1\) ms for high-speed processes
Slides 43–50
The sliding window approach: Convert a long time-series into many training samples
Given a signal \([x_1, x_2, \ldots, x_N]\) with window size \(W\) and prediction horizon \(H\):
| Input (window) | Target |
|---|---|
| \([x_1, \ldots, x_W]\) | \(x_{W+H}\) |
| \([x_2, \ldots, x_{W+1}]\) | \(x_{W+1+H}\) |
| \([x_3, \ldots, x_{W+2}]\) | \(x_{W+2+H}\) |
Critical choices:
Problem: Batched training requires all sequences in a batch to have the same length
Solution: Pad shorter sequences with zeros and use a mask to ignore padded positions
# Example: Three sequences of different lengths
seq_1 = [0.3, 0.5, 0.7, 0.2, 0.1] # length 5
seq_2 = [0.8, 0.6, 0.4] # length 3
seq_3 = [0.1, 0.9, 0.3, 0.7] # length 4
# After padding (to max length 5)
padded = [[0.3, 0.5, 0.7, 0.2, 0.1], # mask: [1,1,1,1,1]
[0.8, 0.6, 0.4, 0.0, 0.0], # mask: [1,1,1,0,0]
[0.1, 0.9, 0.3, 0.7, 0.0]] # mask: [1,1,1,1,0]PyTorch: torch.nn.utils.rnn.pack_padded_sequence handles this efficiently — padded positions are skipped during computation, saving time and preventing the model from learning to predict zeros.
Why scaling matters more for RNNs than for other architectures:
Recommended approach:
Tip
Beware of look-ahead leakage: Never compute statistics using future data. In time-series, even the mean and standard deviation must be computed only from past and current observations.
Scenario: You train an LSTM to predict melt pool temperature 10 steps ahead. Your model achieves MSE = 0.01 on the test set. Your colleague congratulates you.
But wait: What is the MSE of a “naive” baseline that simply predicts \(\hat{x}_{t+10} = x_t\) (the last known value)?
Surprise: For slowly-varying signals, the naive baseline often achieves MSE \(\approx 0.005\) — better than your LSTM!
Lesson: Always compare against baselines. For time-series, the persistence model (\(\hat{x}_{t+H} = x_t\)) and linear trend (\(\hat{x}_{t+H} = x_t + H \cdot \Delta x_t\)) are essential baselines. Report skill score: \(\text{SS} = 1 - \frac{\text{MSE}_\text{model}}{\text{MSE}_\text{baseline}}\).
Single-step prediction (\(H = 1\)): Easy, often trivially good due to autocorrelation
Multi-step prediction (\(H \gg 1\)): Hard, error accumulates over the horizon
Two strategies for multi-step prediction:
Best practice: Report error as a function of horizon \(H\) — plot MSE(\(H\)) to show where the model’s predictive skill degrades.
Problem: Two time-series may have the same shape but different speeds
DTW finds the optimal alignment between two sequences:
\[\text{DTW}(X, Y) = \min_{\pi} \sum_{(i,j) \in \pi} d(x_i, y_j)\]
where \(\pi\) is a warping path that maps indices of \(X\) to indices of \(Y\)

Architecture hierarchy:
Practical checklist:
Next unit: Generalization and robustness — how to ensure your models work on new data.

© Philipp Pelz - Machine Learning in Materials Processing & Characterization