FAU Erlangen-Nürnberg
MFML Unit 6 prerequisite (this week): Optimization for Deep Learning.
Everything in this lecture is a special case of those ideas applied to a pretrained starting point.
Acquisition cost per labeled sample.
Expert annotation cost.
Implications for the ML pipeline.
Bias–variance picture (from MFML).
Three levers, one pipeline.
Typical real-world combination:
[ImageNet pretrained backbone]
│ (transfer)
▼
[fine-tune on synthetic Voronoi]
│ (synthetic data)
▼
[fine-tune on 100 real SEMs]
│ (transfer + augmentation)
▼
deployable model
Pretraining (Task A).
Fine-tuning (Task B).
The continuity claim.
\[\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t \;-\; \eta_t \, \mathbf{m}(\nabla \mathcal{L}_B(\boldsymbol{\theta}_t))\]
The non-trivial part: \(\mathcal{L}_B\) is related to \(\mathcal{L}_A\) (both are image-classification-like losses), but its minimum is at a different point. Fine-tuning is the controlled walk from one minimum to the other.
Pretraining lands you in a flat basin of \(\mathcal{L}_A\).
Why the same point is good for \(\mathcal{L}_B\).
Fine-tuning is a controlled walk in this landscape.
Mental model: fine-tuning = SGD with a really good prior on where the minimum is.
Symptom. Fine-tuning destroys generic features the model painstakingly learned during pretraining.
Mechanism. Optimization on a non-stationary loss.
Cure (MFML W6 toolkit).
The MFML W6 thread. Per-parameter LRs were the central idea of AdaGrad / RMSProp / Adam:
\[\boldsymbol{\theta}^{(i)}_{t+1} = \boldsymbol{\theta}^{(i)}_t - \frac{\eta}{\sqrt{v^{(i)}_t}+\varepsilon}\,g^{(i)}_t\]
Layer-wise LR is the coarsened version of the same idea.
Three-group recipe (the standard).
| Group | Role | LR |
|---|---|---|
| Early backbone | Generic edges, blobs | \(10^{-5}\) |
| Late backbone | Mid-level textures | \(10^{-4}\) |
| Head (new) | Task-specific | \(10^{-3}\) |
Why the 10× steps? Backbone is almost right (small adjustments). Head is random (large adjustments needed). Mid layers interpolate.
Name to know. This is “discriminative fine-tuning” (Howard and Ruder 2018), originally proposed for ULMFiT in NLP and now the default recipe for fine-tuning everywhere.
PyTorch one-liner.
params group can have its own LR, weight decay, and even its own schedule.Why this works pedagogically.
Optimization-theoretic sanity check. The total effective update is
\[\|\Delta\boldsymbol{\theta}\| \approx \eta \cdot \|\nabla \mathcal{L}_B\|\]
per step per group. Backbone step is \(10^{-2} \times\) head step.
Adam (the default first choice).
Caveat (MFML W6). Adam tends to converge to sharper minima than SGD+momentum, which can cost generalization on small datasets.
SGD + momentum (for the final tightening).
Practical workflow.
AdamW: two extra tensors per parameter.
Lion: one extra tensor per parameter.
Practical takeaway (Chen et al. 2023 (Chen et al. 2023)).
Anti-pattern. Dropping Lion in with the AdamW learning rate — it diverges immediately. The sign-update has constant magnitude, so the LR controls the step size directly.
The fine-tuning warm-up problem.
Fix: linear warm-up.
\[\eta_t = \eta_\text{max} \cdot \min(1, t / T_\text{warm})\]
Cosine annealing for the long tail.
\[\eta_t = \eta_\text{max} \cdot \tfrac{1}{2}\!\left(1 + \cos\!\frac{\pi t}{T}\right)\]
Combined schedule. Warm-up for first ~5 % of steps, cosine decay for the rest. This is the de-facto standard for fine-tuning.
MFML W6 result. The SGD update is
\[\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta \, \hat{g}_t\]
where \(\hat{g}_t\) is a stochastic gradient with variance \(\sigma^2 / B\) (B = batch size).
Implications for fine-tuning.
The five MFML W6 lessons, applied to TL.
Warm start in a flat basin — pretraining (slide 8).
Per-layer LR — discriminative fine-tuning (slides 10–11).
Schedules + warm-up — preserve pretraining (slide 14).
Adam → SGD+momentum — speed first, generalization last (slide 12).
Small batches — implicit flat-minimum regularizer (slide 15).
Synthesis.
Fine-tuning is continued optimization in a related loss landscape, starting from a flat basin, using per-layer learning rates with warm-up and cosine annealing, transitioning from Adam to SGD+momentum for the final tightening.
“Reusing existing images by applying transformations.” (Sandfeld et al. 2024)
Optimization view. Augmentation modifies the loss:
\[\mathcal{L}_\text{aug}(\boldsymbol{\theta}) = \mathbb{E}_{\alpha,(x,y)}\!\left[\ell(f_{\boldsymbol{\theta}}(T_\alpha x), y)\right]\]
Standard kit.
For microstructures.
Sample augmentations applied to a microstructure image. Each row shows different rotations and flips of the same input.
The rule. \(T_\alpha\) must preserve the label.
Examples where rotation breaks the label:
Examples where flips break the label:
Rule of thumb.
If your physics changes under \(T_\alpha\), don’t augment with \(T_\alpha\).
Elastic deformations.
Cutout / random erasing.
Both encode physics. Elastic = drift; cutout = detector defects / artifacts.
Standard intensity augmentations.
Why this matters in materials.
Augmenting intensity makes the model microscope-agnostic — usually the single most useful augmentation for cross-instrument generalization.
Physically-motivated noise types.
Match noise type to the expected detector physics (Unit 2 callbacks).
Why this works.
Especially valuable for low-dose imaging (cryo-EM, beam-sensitive samples) where deployment noise is much higher than training noise.
Albumentations recipe (typical).
p).Compose handles image and mask consistently.Torchvision v2 (newer, native PyTorch).
torch.compile.On-the-fly (the default).
Offline (rarely needed).
Label-consistency rule (segmentation/detection).
Whatever transformation you apply to the image, you must apply identically to the mask / boxes / keypoints.
aug(image=img, mask=mask).Pitfalls.
Section take-homes.
Augmentation is a way to encode physical invariances as a prior.
Always augment in fine-tuning (essentially free regularization).
Match augmentations to the expected deployment distribution (cross-microscope, low-dose, etc.).
Combine augmentation with TL — the two compound multiplicatively in data efficiency.
“Learning on peas to count lentils.” (Sandfeld et al. 2024)
Optimization view (§2 callback).
Quantitatively. ImageNet-pretrained ResNet on 100 medical/materials images typically beats scratch-trained ResNet on 10 000 images.
ImageNet at a glance.
Hierarchical feature reuse.
The trick: keep the universal early layers, replace the class-specific late ones.
Three backbones, one steel-defect benchmark.
On a NEU-DET-style steel surface defect classification / detection task:
Practical recipe.
# Off-the-shelf DINOv2 features (inference)
import torch
m = torch.hub.load(
'facebookresearch/dinov2', 'dinov2_vits14')
m.eval() # frozen feature extractor
# Domain MAE pretraining
import timm
backbone = timm.create_model(
'vit_base_patch16_224.mae', pretrained=True)
# then continue MAE pretraining on your micrographsTwo adaptation paths.
Note
1080Ti budget. DINOv2-small inference and LoRA fine-tuning fit fine. DINOv2-large full fine-tuning does not — use LoRA or stick to the frozen-features path.
Backbone.
Head.
The TL workflow in two lines.
Recipe.
param.requires_grad = False).Implementation (PyTorch).
When to use it.
Optimization view (§2).
Recipe.
Result. Backbone adapts to the new domain.
When to use it.
Optimization view.
The fine-tuning LR menu (typical starting point for ResNet-50 + custom head):
| Layer group | LR (Adam) | LR (SGD+momentum) |
|---|---|---|
| Head (new) | \(10^{-3}\) | \(10^{-2}\) |
| Layer 4 | \(10^{-4}\) | \(10^{-3}\) |
| Layer 3 | \(10^{-4}\) | \(10^{-3}\) |
| Layer 2 | \(10^{-5}\) | \(10^{-4}\) |
| Layer 1 | \(10^{-5}\) | \(10^{-4}\) |
Why it works.
Combine with §2 recipe.
The procedure (ULMFiT-style).
Freeze all but head; train head to convergence (\(\sim 5\) epochs).
Unfreeze the last backbone block; train (\(\sim 5\) epochs).
Unfreeze the second-to-last block; train.
Continue until all layers are unfrozen.
Optional: final fine-tuning with the full §2 recipe.
Why it works (optimization view).
When to use it. Very small datasets, large domain gap, or when plain fine-tuning is unstable.
Where natural-image features do transfer.
Where they don’t.
Diagnostic. If feature extraction performs worse than scratch training, the pretrained features actively mislead — switch backbone or pretrain on a closer source domain.
A more friendly TL setup.
Why it usually works very well.
Generalizes:
Setup. (Sandfeld et al. 2024) §19.3.1
Outcome.
Predicted nanoparticle segmentation on a held-out TEM frame using the recipe described above.
The catch. Synthetic data is too clean.
The recipe. (Sandfeld et al. 2024) §19.3.2
Drop \(N\) random seed points in a 2D box.
Each pixel is assigned to the nearest seed (Voronoi cell).
The cell boundaries are exact grain boundaries.
Optionally: relax seed positions (Lloyd’s algorithm) for more regular grains.
Knobs.
Number of seeds \(N\) \(\to\) grain size distribution.
Seed-placement model (random / blue-noise / clustered) \(\to\) texture.
Boundary thickness, smoothing \(\to\) visual realism.
Anisotropy weights \(\to\) elongated grains (rolled metals).
Generate millions of unique microstructures in seconds; perfect masks for free.
Synthetic polycrystalline microstructure generated by Voronoi tessellation, with the corresponding ground-truth grain-boundary mask.
Step 1: clean geometry. Voronoi mask, binary boundaries.
Step 2: add texture. Per-grain intensity (random or orientation-dependent).
Step 3: add PSF blur. Gaussian blur with realistic \(\sigma_\text{psf}\).
Step 4: add noise. Poisson + Gaussian per detector model.
Step 5: add artifacts. Scan distortions, charging, gradients.
Each step closes a piece of the sim-to-real gap.
Step 1 alone \(\to\) network learns to detect “perfect line on uniform background”; fails completely on real SEM.
Through step 4 \(\to\) network learns to detect “blurred edges in noisy textured field”; works on most real SEM.
Step 5 (artifact simulation) \(\to\) robustness to specific instrument quirks.
The phenomenon.
Why it happens (optimization view).
Closing the gap is two ML problems — better simulation, or domain adaptation (next slide).
Strategy A: Make synthetic look real.
Strategy B: Domain randomization.
Strategy C: Fine-tune on real.
The pragmatic recommendation: Combine B (random synthesis) + C (real fine-tuning). Skip A unless you have GAN expertise.
Setup. (Sandfeld et al. 2024) Fig 19.11
Outcome.
Take-home. With careful augmentation, the synthetic→real transfer is real. No real labels were used.
Real SEM image (left) and predicted grain-boundary mask (right) from a U-Net trained only on Voronoi synthetic data. The model captures the closed-cell topology characteristic of polycrystalline metals.
Other domains where synthetic data shines.
Across all of these, the recipe is the same:
Take-homes.
Synthetic data closes the labeling gap but opens a sim-to-real gap.
Both gaps must be addressed: aggressive augmentation + (ideally) small real fine-tuning set.
Domain randomization (vary synthesis parameters widely) is the cheapest, most effective sim-to-real tool.
Cascade pretraining (ImageNet \(\to\) synthetic \(\to\) real) is the state-of-the-art workflow.
Optimization view (§2 callback).
Steps.
Pick a pretrained backbone (ResNet-50 / ConvNeXt / ViT-B).
Replace the head for your number of classes / output channels.
Freeze backbone; train head with Adam, LR \(\sim 10^{-3}\).
Unfreeze with discriminative LRs + warm-up + cosine schedule.
Switch to SGD+momentum for the last \(\sim 20\%\) of steps.
Augment throughout (geometric + intensity + noise).
Validate with grouped K-fold; held-out test on a different sample.
Map back to MFML W6.
| Step | MFML W6 idea |
|---|---|
| 1, 3 | Warm start in flat basin |
| 4 | Per-parameter LR + schedule + warm-up |
| 5 | Adam → SGD+momentum (flat minima) |
| 6 | Effective dataset size + invariances |
| 7 | Generalization measurement |
The problem. With \(N=100\) images:
The solution: K-fold cross-validation (Unit 3 callback).
The trap.
The fix: group by specimen.
The bigger principle.
Whatever varies in deployment, must vary across train/test in your validation.
The idea.
Why this works.
Foreshadowing. Full treatment in W11 (automation) and W14 (Bayesian experiment design). Today: just know it exists and is the smart way to allocate your labeling budget.
The principle.
A small, never-augmented, never-touched test set is non-negotiable.
Why this matters.
The K-fold validation tells you about one slice of the distribution.
The gold-standard test is your external check.
Iterating on the validation set \(\Rightarrow\) overfit the validation set.
The gold-standard test catches that overfitting.
If you check the gold-standard test more than once, it stops being a gold-standard test.
Why early stopping is essential for fine-tuning.
Combine with cosine schedule. Cosine reduces LR toward zero, so by the end of training, gradient steps are small and overfitting is gentle. But early stopping is still cheap insurance.
Fine-tuning is continued optimization. Every MFML W6 lesson applies — flat basins, per-layer LRs, schedules, batch noise.
Augment to encode physical invariances. Don’t break physics. Don’t leak.
Transfer from large-scale pretraining (ImageNet) and / or synthetic data. Cascade is best.
Synthetic data closes the labeling gap; aggressive augmentation closes the sim-to-real gap.
Validate with grouped K-fold + a true gold-standard test on a different specimen / microscope.
The recipe is the deliverable: backbone choice → head replace → freeze/unfreeze with discriminative LRs → warm-up + cosine → Adam→SGD switch → augment → group K-fold.
Modern levers when ImageNet is not enough: self-supervised backbones (DINOv2, MAE-on-lab-data) for features, and memory-light optimizers (Lion) for fitting larger models on lab GPUs.
Course textbooks.
Foundational references. (Howard and Ruder 2018) ULMFiT — discriminative fine-tuning; (Loshchilov and Hutter 2017) SGDR — warm restarts and cosine annealing; (Smith 2017) — LR-finder; (Wilson et al. 2017) — marginal value of adaptive methods; (Goyal et al. 2017) — linear-scaling rule; (Tobin et al. 2017) — domain randomization; (Kumar et al. 2022) — fine-tuning vs. linear probing; (Yosinski et al. 2014) — how transferable are features in deep neural networks.
MFML cross-reference. Unit 6 of the MFML course (this same week) covers loss landscapes, SGD/Adam, learning-rate schedules, and the batch–noise trade-off. Every practical recipe in this lecture is a direct application of that material.

© Philipp Pelz - Machine Learning in Materials Processing & Characterization