FAU Erlangen-Nürnberg
Institute of Micro- and Nanostructure Research
notebooks/week06_cnn_inference.ipynb — apply hand-set kernels (Sobel, Laplacian, Gaussian) to a synthetic grain image; inspect feature maps; run a tiny random CNN forward pass; design a kernel to detect a specific boundary type. No training. Fast on CPU.MLP parameter count vs a single convolutional layer for images of increasing size. Note the log scale. A 1024×1024 image with 1000 hidden neurons requires ~10⁹ weights; 64 conv filters of size 3×3 require only 576.
| Problem | MLP behaviour | CNN fix |
|---|---|---|
| Parameter explosion | \(D \times M\) weights per layer | \(k_h k_w C_{out} C_{in}\) kernel weights |
| No spatial structure | All pixels treated equally | Local receptive field: inspect \(k \times k\) patch |
| No translation awareness | Relearn at every location | Weight sharing: one kernel applied everywhere |
CNNs are MLPs with locality and weight sharing built in as hard constraints.
Two-dimensional cross-correlation (the operation CNNs actually use). The 3×3 kernel slides one position at a time. At each position, nine element-wise multiplications are summed to give one output value. The output grid records the detector response at every spatial location.
Convolution applied to a synthetic two-grain microstructure. Left: input image (two grains with different intensities, separated by vertical and horizontal boundaries). Centre-left: vertical Sobel kernel responds strongly at the vertical grain boundary. Centre-right: horizontal Sobel responds at the horizontal boundary. Right: Gaussian blur smooths noise. All kernels are 3×3; weights are hand-set, not trained.
For input image \(I\) and kernel \(K\):
\[ (I \star K)_{m,n} = \sum_{a=-\Delta}^{\Delta}\sum_{b=-\Delta}^{\Delta} K_{a,b}\,I_{m+a,\, n+b} \]
Three classic 3×3 kernels. Left: vertical Sobel — responds to left–right intensity changes. Center: horizontal Sobel — responds to top–bottom changes. Right: Laplacian — responds to any local intensity peak or boundary. Numbers are the kernel weights.
In a trained CNN, the network discovers these (and more complex) filters automatically from labelled examples — no manual kernel design needed.
Stride=1 with same-padding (left): the kernel moves one pixel at a time and output is the same size as input. Stride=2 (right): the kernel skips every other position, halving spatial resolution.
For input \(C_{in}\times H\times W\), kernel size \(k\), padding \(p\), stride \(s\):
\[ H_{out} = \left\lfloor\frac{H + 2p - k}{s}\right\rfloor + 1 \]
Dense layer (left): every input node connects to every output node — \(n_{in} \times n_{out}\) unique weights. Conv layer (right): the same 3-weight kernel connects each output node to only a local neighborhood of inputs — 3 shared weights total (for the 1-D case shown).
Equivariance preserves “where.” Invariance discards “where” and keeps only “what.”
Max-pooling with a 2×2 window. For each non-overlapping 2×2 region, keep the maximum value. Spatial resolution halves in each dimension; channel count is unchanged.
Input feature map (4×4):
\[ \begin{pmatrix} 1 & 3 & 2 & 4 \\ 5 & 6 & 7 & 8 \\ 3 & 2 & 1 & 0 \\ 1 & 2 & 3 & 4 \end{pmatrix} \]
After 2×2 max-pool (stride 2):
\[ \begin{pmatrix} 6 & 8 \\ 3 & 4 \end{pmatrix} \]
Top-left 2×2 block: \(\max(1,3,5,6)=6\). Top-right: \(\max(2,4,7,8)=8\).
| Block | Channels | Spatial size (from 256×256) |
|---|---|---|
| 1 | 64 | 128×128 |
| 2 | 128 | 64×64 |
| 3 | 256 | 32×32 |
| 4 | 512 | 16×16 |
Receptive field of a single output neuron grows with depth. One 3×3 conv layer: 3×3 input region. Two stacked layers: 5×5 region. Three layers: 7×7 region. Red star marks the output neuron; blue region is its receptive field in the input image.
Feature hierarchy on a synthetic grain microstructure. Input (left): two-grain image with boundaries. Layer 1 (centre-left): Laplacian-like edge features highlight all boundaries. Layer 2 (centre-right): neighbourhood-level grain-boundary motifs. Layer 3+ (right): coarse phase/grain-region labels.
| Design choice | What it encodes | EM benefit |
|---|---|---|
| Local receptive field | Features depend on local context | Grain boundaries are local |
| Weight sharing | Same feature appears at many positions | Atomic columns repeat |
| Multiple channels | Many detectors in parallel | Multi-contrast detection |
| Stride / pooling | Coarser scale, larger context | Phase-level reasoning |
| Depth / nonlinearity | Hierarchical feature composition | Atoms → grains → phases |
Timeline from LeNet (1998) to AlexNet (2012) to ResNet (2015). Each box states the year, approximate parameter count, depth, and key innovation. Read left to right as increasing depth, scale, and capability.
ResNet residual block. The main path learns a residual function F(x) = y − x. The skip connection adds the input x directly to the output. During backpropagation, gradients flow through the skip path without attenuation — solving the vanishing gradient problem for 50+ layer networks.
\[\mathbf{y} = F(\mathbf{x}) + \mathbf{x}\] Instead of learning the target \(\mathbf{y}\), learn the residual \(F(\mathbf{x}) = \mathbf{y} - \mathbf{x}\).
U-Net architecture. Left column (blue): encoder — successive Conv+Pool blocks extract features while halving spatial resolution and doubling channel count. Right column (red): decoder — successive upsample+Conv blocks restore spatial resolution. Yellow dashed arrows: skip connections concatenate encoder features into corresponding decoder levels.
Encoder path
Decoder path
Why skip connections are non-negotiable for segmentation: without them, precise boundary locations are lost in the bottleneck compression; the output mask is correct in texture but blurry in boundary position.
U-Net applied to TEM images of Au nanoparticles on an amorphous support. Left: input TEM image (representative Au-nanoparticle-on-amorphous-support TEM segmentation task). Centre: ground-truth binary mask (crystalline=bright, amorphous=dark). Right: U-Net prediction — pixel-wise classification matching the ground truth closely.
U-Net segmentation of TEM images from a published materials science application. The encoder–decoder with skip connections accurately reproduces phase boundaries, even in noisy low-dose images where the boundary is only 1–3 pixels wide.
Grain boundary segmentation pipeline. Left: synthetic Voronoi grain image (used for CNN training). Centre: automatically generated ground-truth grain labels (free — no expert annotation). Right: U-Net predicted grain boundaries from the output mask.
CNN grain segmentation applied to a real SEM polycrystal image. The network was trained only on synthetic Voronoi grain images (with perfect free labels), yet it correctly identifies grain boundaries in the real SEM image — demonstrating that the topological signature of boundaries transfers from synthetic to real data.
GroupKFold by specimen or microscopy session — not random splits.
©Philipp Pelz - FAU Erlangen-Nürnberg - Data Science for Electron Microscopy