Machine Learning in Materials Processing & Characterization
Unit 5: Convolutional Neural Networks for Microstructure Analysis
FAU Erlangen-Nürnberg
The problem: Images have spatial structure that MLPs destroy.
By the end of this unit, you can:
Slides 03–08
Materials characterization produces diverse image data:
All share a common property: nearby pixels are correlated — physics ensures spatial continuity.
Example: A modest \(64 \times 64\) image
This is already large. But real micrographs are much bigger…
Realistic example: A \(1024 \times 1024\) SEM image
Note
With only 100 training images, a 537M parameter model will memorize every sample perfectly — and generalize to nothing.

What we want: A feature detector that works the same way regardless of position — translation invariance.
Slides 09–20
\[(I * K)_{m,n} = \sum_{i}\sum_{j} I_{m-i,\, n-j} \cdot K_{i,j}\]

A single kernel produces one feature map. We use multiple kernels to detect multiple features simultaneously.
Edge detection (Laplacian):
\[K = \begin{pmatrix} 0 & -1 & 0 \\ -1 & 4 & -1 \\ 0 & -1 & 0 \end{pmatrix}\]
Highlights boundaries between regions.
Blur (Gaussian):
\[K = \frac{1}{16}\begin{pmatrix} 1 & 2 & 1 \\ 2 & 4 & 2 \\ 1 & 2 & 1 \end{pmatrix}\]
Smooths noise by averaging neighbors.
Horizontal edges: \[K_h = \begin{pmatrix} -1 & -1 & -1 \\ 0 & 0 & 0 \\ 1 & 1 & 1 \end{pmatrix}\]
Vertical edges: \[K_v = \begin{pmatrix} -1 & 0 & 1 \\ -1 & 0 & 1 \\ -1 & 0 & 1 \end{pmatrix}\]
In a CNN: The network discovers that these filters (and many others) are useful for the task — automatically, from data.
Higher stride = faster computation, lower resolution output. A design trade-off.
Compare: Processing a \(1024 \times 1024\) image
| Approach | Parameters |
|---|---|
| MLP (512 hidden units) | 537 million |
| One \(3 \times 3\) conv filter | 9 |
| 64 conv filters | 576 |
That’s a 930,000× reduction! Weight sharing is the key: the same 9 weights are reused at every position in the image.
A kernel on multi-channel input has shape \(C \times k \times k\). It sums across all channels.

After each convolution, apply a non-linear activation:
\[\text{Feature Map} = \text{ReLU}(I * K + b)\]
Slides 21–30
Note
Weight sharing encodes translation invariance into the architecture — we don’t need to learn the same feature at every position.
Translation Equivariance:
If the input shifts, the feature map shifts by the same amount.
Conv layers are equivariant.
Translation Invariance:
The output doesn’t change when the input shifts.
Classification layers (after pooling + flattening) are invariant.
Equivariance preserves “where” the feature is. Invariance discards “where” and keeps only “what.”
More filters = more features detected. But also more parameters and computation.
With stride and pooling, the receptive field grows faster:
\[r_{\text{eff}} = r + (k - 1) \times \prod_{i} s_i\]
where \(k\) is kernel size and \(s_i\) are strides of preceding layers.
Design rule: The receptive field should be large enough to “see” the features you want to detect. Grain boundary detection needs at least grain-sized receptive fields.

Task: You need to classify microstructure images as “martensitic” or “ferritic.” Your images are \(256 \times 256\) grayscale. You have 200 labeled images.
What’s your concern?
Answer: 200 images is far too few to train a CNN from scratch. Even a modest CNN has millions of parameters. You’ll need transfer learning (Unit 6) or data augmentation. Start with the simplest possible model.
Slides 31–38
\[\text{MaxPool}(x)_{m,n} = \max_{i,j \in \text{window}} x_{m+i, n+j}\]

\[\text{AvgPool}(x)_{m,n} = \frac{1}{|W|}\sum_{i,j \in \text{window}} x_{m+i, n+j}\]
Used more often in later layers. Global Average Pooling (over the entire feature map) is standard for the final layer of modern architectures.
Materials implication: A precipitate at pixel (50, 50) and one at pixel (51, 51) produce the same classification — which is physically correct.
The standard building block of CNNs:
\[\text{Input} \xrightarrow{\text{Conv}} \xrightarrow{\text{ReLU}} \xrightarrow{\text{Pool}} \text{Output}\]
Typically repeated 3-5 times, with increasing filter count:
| Block | Filters | Spatial Size (from 256×256) |
|---|---|---|
| 1 | 64 | 128×128 |
| 2 | 128 | 64×64 |
| 3 | 256 | 32×32 |
| 4 | 512 | 16×16 |
Early layers (Block 1-2):
Deep layers (Block 3-4):

These visualization tools are essential for scientific trust — they answer: “What is the CNN actually looking at?”
Slides 39–44

Legacy: Proved that learned features outperform hand-crafted features for vision tasks.

Deeper is not always better… unless you have a trick.
The Residual Block:
\[\mathbf{y} = F(\mathbf{x}) + \mathbf{x}\]
Note
ResNet enabled networks with 100+ layers. The 2015 ImageNet winner had 152 layers.
| Architecture | Year | Innovation | Use Case |
|---|---|---|---|
| LeNet | 1995 | Learned filters | Digit recognition |
| AlexNet | 2012 | ReLU, Dropout, GPU | Image classification |
| ResNet | 2015 | Skip connections | Very deep networks |
| U-Net | 2015 | Encoder-Decoder | Pixel segmentation |
Slides 45–50


The synthetic data captured the topological truth of grain networks — the CNN learned grain boundary detection without ever seeing a real micrograph.


Next week: Strategies to overcome data scarcity — Transfer Learning, Augmentation, and Synthetic Data.
Key Takeaways:
Reading:
Next Week: Unit 6 — Data Scarcity, Transfer Learning & Synthetic Data

© Philipp Pelz - Machine Learning in Materials Processing & Characterization