FAU Erlangen-Nürnberg
Institute of Micro- and Nanostructure Research
notebooks/week08_denoising_autoencoder.ipynb — build synthetic noisy EELS-like spectra, train a small denoising autoencoder on CPU in under 2 minutes, compare AE vs PCA denoising, visualise latent-space phase clusters, and explore the effect of bottleneck dimension in the exercise.The four families of unsupervised learning. Clustering assigns each spectrum to a discrete group (k-means, GMM). Dimensionality reduction finds a low-dimensional coordinate that summarises each spectrum (PCA, autoencoder). Density estimation models the full probability distribution over spectra. Generative models sample new spectra from that distribution — introduced as a teaser for Week 12.
Assign \(N\) spectra \(\{x_1, \ldots, x_N\} \subset \mathbb{R}^d\) to \(K\) groups \(C_1, \ldots, C_K\) by minimising:
\[ J = \sum_{k=1}^{K} \sum_{x_i \in C_k} \|x_i - \mu_k\|^2. \]
Lloyd’s algorithm — alternate two steps until assignments stop changing:
Each step strictly decreases \(J\); convergence in finitely many steps is guaranteed — to a local minimum.

Elbow method: - Run k-means for \(K = 1, 2, \ldots, K_{\max}\). - Plot \(J(K)\) vs \(K\). Look for the elbow: the point where \(J\) stops dropping fast. - Intuition: adding one more cluster beyond the true \(K\) gives diminishing returns.
Silhouette score: \[s(x_i) = \frac{b(x_i) - a(x_i)}{\max(a(x_i), b(x_i))}\] where \(a\) = mean intra-cluster distance, \(b\) = mean distance to nearest other cluster. Range: \([-1, 1]\), higher is better. Pick \(K\) that maximises mean silhouette.
Rule: use elbow and silhouette together. If they agree, more confidence.
Hard vs soft assignment on the same spectral dataset. Left: k-means — each point belongs to exactly one phase (coloured circles). Right: GMM — each point carries a probability over phases; points near boundaries are shown as colour mixtures. This matters physically when EELS spectra from a transition zone contain contributions from both adjacent phases.
A weighted sum of \(K\) Gaussian densities: \[ p(x) = \sum_{k=1}^{K} \pi_k \,\mathcal{N}(x;\,\mu_k,\,\Sigma_k), \quad \sum_k \pi_k = 1. \]
Intuition: each phase \(k\) is modelled as a Gaussian cloud of spectra centred on the phase-representative spectrum \(\mu_k\) with a covariance \(\Sigma_k\) encoding spectral variability.
Soft assignment (responsibility): \[\gamma_{ik} = P(\text{phase } k \mid x_i) \propto \pi_k\,\mathcal{N}(x_i;\,\mu_k,\Sigma_k).\]
| K-means | GMM | |
|---|---|---|
| Assignment | hard (one cluster) | soft (probability vector) |
| Cluster shape | spherical | ellipsoidal (full \(\Sigma\)) |
| Speed | fast | slower (covariance update) |
| Boundary handling | abrupt | smooth |
| Choosing \(K\) | elbow + silhouette | BIC (penalised log-likelihood) |
| High-dim risk | Euclidean dominance | covariance explosion (\(d^2\) params) |
Rule of thumb: k-means for fast exploration; GMM when clusters overlap, vary in shape, or you need probabilities for downstream decisions. In both cases, prefer clustering latent codes over raw spectra Murphy, Kevin P., (2012).
From high-dimensional spectra to a curved low-dimensional manifold. Left: three-dimensional view of data lying on a curved 2D surface embedded in 3D (analogy for 1024-D spectra on a ~10-D surface). Centre: PCA finds the best flat hyperplane — it unrolls the curve linearly and loses structure. Right: a nonlinear autoencoder follows the curved manifold, keeping phases separated along interpretable latent axes.
A deep autoencoder for spectral data. The encoder (blue, left) progressively compresses a 1024-channel spectrum to an 8-dimensional latent code z through successive hidden layers. The decoder (green, right) reconstructs the original 1024 channels from z. The bottleneck (red, centre) is the only constraint: it must retain enough information to reconstruct the input. The MSE reconstruction loss (orange, bottom) is the only supervisory signal — no labels needed.
The autoencoder minimises: \[ \mathcal{L}(\theta, \phi) = \frac{1}{N}\sum_{i=1}^{N} \bigl\|x_i - g_\theta(f_\phi(x_i))\bigr\|^2. \]
No labels — only the inputs themselves serve as targets. This is self-supervised: the supervisory signal is built into the data.
Minimising \(\mathcal{L}\) forces the bottleneck \(z = f_\phi(x)\) to retain enough information to reconstruct \(x\). This is the minimum description length principle: compress as much as possible while losing as little as possible.
In PyTorch (one line):
For Poisson-noisy EM spectra: consider Poisson negative-log-likelihood as the loss instead of MSE, because pixel variance grows with the signal Sandfeld, Stefan et al., (2024).
import torch
import torch.nn as nn
class SpectralAE(nn.Module):
def __init__(self, input_dim=128, latent_dim=8):
super().__init__()
self.encoder = nn.Sequential(
nn.Linear(input_dim, 64), nn.ReLU(),
nn.Linear(64, 32), nn.ReLU(),
nn.Linear(32, latent_dim) # bottleneck — no activation
)
self.decoder = nn.Sequential(
nn.Linear(latent_dim, 32), nn.ReLU(),
nn.Linear(32, 64), nn.ReLU(),
nn.Linear(64, input_dim) # output — no activation for spectra
)
def forward(self, x):
z = self.encoder(x)
x_hat = self.decoder(z)
return x_hat, z # return both for clustering
# Training loop (standard):
# loss = F.mse_loss(x_hat, x_clean) # denoising AE: target is clean!
# loss.backward(); optimizer.step()Left: data lying on a curved 2D line embedded in the plane. Centre: linear AE (= PCA) projects onto the best straight line — many points are far from the line (high reconstruction error, orange residuals). Right: nonlinear AE follows the curved surface — all points are close (low reconstruction error, green residuals). This is why a nonlinear AE outperforms PCA when the spectral manifold is curved.
Denoising performance on synthetic low-dose EELS spectra near the Fe-L₂₃ edge (700–740 eV). Top-left: clean ground-truth spectrum (two Gaussian peaks: L₃ main edge ~707 eV and L₂ shoulder ~720 eV). Top-right: noisy input at 50 electrons/channel (Poisson noise). Bottom-left: PCA denoising (rank-4) — the MSE title shows the quantitative cost of linear approximation. Bottom-right: denoising AE (k=4) — achieves ≈81% lower MSE than PCA and ≈92% lower than the raw noisy input. Green dashed line is the clean reference in the bottom panels. PCA recovers the spectral shape but at ~5× higher reconstruction error; the AE captures the non-linear inter-phase variation and denoises with far less residual error.
| PCA denoising | Denoising AE | |
|---|---|---|
| Assumption | linear spectral mixing | non-linear manifold |
| Training cost | SVD, seconds | gradient descent, minutes |
| Data required | any N | N > ~500 for best results |
| Recovers rare phases? | risk of erasure | better — non-linear projection |
| Bottleneck dim | elbow of scree plot | sweep \(k\), check recon error |
| EM sweet spot | quick first look | complex mixtures, fine structure |
AE latent space (z₁ vs z₂) for 400 synthetic iron-oxide EELS spectra (4 phases, 100 spectra each: Fe₂O₃ red, FeO blue, Fe₃O₄ orange, Surface/amorphous green). The four phases cluster into well-separated islands without any labels — the only information the AE received was the raw spectra and the instruction to reconstruct them. Stars mark the k-means cluster centroids applied to the 4-D latent codes (silhouette ≈ 0.73). Only z₁ and z₂ are shown; all four latent dimensions contribute to the clustering.
Left: t-SNE (perplexity=30) — 2D embedding of 240 latent codes from a 10-D AE. Four iron-oxide phases form islands. Warning: inter-island distances are not metric — the gap between Fe₂O₃ and FeO in this plot does not reflect their spectral similarity. Right: UMAP (n_neighbors=15) — same codes, tighter clusters, better preservation of global structure. Use UMAP for 2026 pipelines; t-SNE is a useful diagnostic.
Left: reconstruction error distribution for 300 normal iron-oxide spectra (phases A, B, C; blue) and 100 anomalous spectra from Phase D (surface/amorphous; red). The AE was trained on normal spectra only. Anomalies reconstruct poorly — they lie outside the manifold the AE learned — with mean error ~25× higher than normal spectra. The 99th-percentile threshold ≈ 0.0013 (orange dashed line) flags all 100/100 anomalies correctly. Right: simulated reconstruction-error spatial map where the high-error region (bottom rows) marks anomalous Phase D pixels.
Left: vanilla AE latent space — clusters form but with gaps between them (question marks). Sampling from the gaps (grey arrows) decodes to nonsense because those regions were never seen during training. Right: VAE latent space — the KL divergence term forces the encoding distribution toward a Gaussian; the space is continuous and sampling anywhere gives a meaningful spectrum. Full VAE mathematics are Week 12.
| Vanilla AE | VAE | |
|---|---|---|
| Encoder output | a point \(z\) | a distribution \(\mathcal{N}(\mu, \sigma^2)\) |
| Latent space structure | no structure | Gaussian, continuous |
| Sampling | from known points only | from anywhere → meaningful |
| EM use today | denoising, clustering, anomaly | atomic-STEM disentanglement |
| Full math | done ✓ | Week 12 |
Today’s take-away: use a vanilla AE for denoising and clustering. Use a VAE when you need to generate new spectra or interpolate smoothly between phases.
Complete pipeline from raw EELS/EDS spectra to actionable outputs. Raw spectra enter a denoising AE encoder; the latent space \(z\) branches into three outputs: (1) a phase map via k-means clustering, (2) denoised spectra via the decoder, and (3) anomaly flags where reconstruction error exceeds the threshold. All three outputs require zero labels.

©Philipp Pelz - FAU Erlangen-Nürnberg - Data Science for Electron Microscopy