Machine Learning in Materials Processing & Characterization
Unit 6: Data Scarcity & Transfer Learning

Prof. Dr. Philipp Pelz

FAU Erlangen-Nürnberg

01. The Materials Data Bottleneck

Expensive data: 1 micrograph ≈ hours of prep + days of labeling
Big models vs. small data: A model with \(10^7\) parameters will memorize 100 samples
Goal: Build deep models that generalize even when data is scarce

Open with the core tension, stated cleanly. Deep learning’s success was bought with data — ImageNet’s 14 M labelled images. Materials science cannot buy that. So this entire unit is about winning the same game with three orders of magnitude less data. Say that sentence out loud; it is the thesis of the whole 90 minutes.

Materials hook — make the cost concrete. Walk one example end to end: a single cross-sectional SEM of an additively-manufactured Ti-6Al-4V coupon needs sectioning, mounting, grinding, polishing, etching — half a day — and then a domain expert spends another hour segmenting the prior-β grains. That is one labelled image. Contrast with a crowdworker labelling “cat” in two seconds. The bottleneck is not pixels; it is expert-hours per label.

The parameter-counting argument. A ResNet-50 has ~25 M parameters; with 100 training images it has ~250 000 parameters per image. There is no information-theoretic way that does not memorise. This is the quantitative anchor students should carry: parameters ≫ samples ⇒ memorisation unless you inject outside knowledge (transfer), free knowledge (synthetic), or invariances (augmentation).

MFML cross-reference. This week’s MFML unit is Loss Landscapes & Optimization. Plant the seed now: the reason transfer learning works is that a pretrained backbone starts you near a good minimum of the loss landscape — we will cash that out in the differential-LR slides.

Pacing. Title plus this slide: ≤ 3 minutes. Do not linger; the payoff slides are in Parts 3–5.

02. Learning Outcomes

By the end of this unit, you can:

Explain why materials data is scarce and how this leads to overfitting
Design physically valid augmentation pipelines
Distinguish feature extraction from fine-tuning in transfer learning
Apply gradual unfreezing and differential learning rates
Evaluate synthetic-to-real transfer approaches
Build a complete small-data training workflow

Frame the exam contract explicitly. Outcomes 3 and 4 (feature extraction vs fine-tuning; gradual unfreezing + differential LRs) are the heavily exam-weighted ones — these are the conceptual core and reward derivation, not recall. Outcomes 2 and 5 (valid augmentation; sim-to-real) reward judgement — expect a scenario question (“which augmentations are legal here, and why”). Outcome 6 is exercise-weight: the workflow is assembled in the lab, not on the exam.

Three statements they must be able to defend without notes. (1) Early conv features are domain-general, late features are task-specific — transferability decreases with depth. (2) Feature extraction freezes the backbone and is the safe choice below ~100 labels; fine-tuning updates it and wins when you have more labels and a domain gap. (3) Any augmentation is a claim that the physics has an invariance — an illegal augmentation injects a false invariance and the model pays for it at test time.

Tell them where the trap is. The single most common exam mistake from this unit is conflating “I have little data” with “I should fine-tune the whole network.” The opposite is true. Flag it now; we hammer it in Part 3.

Pacing. 1 minute. Read outcomes 3–4 slowly, the rest at speed.

Part 1: The Small Data Challenge

03. The “Big Data” Myth in Materials Science

Materials labs generate TBs of raw data (e.g., 4D-STEM datasets)
But labeled data is extremely sparse
In computer vision (ImageNet): labels are cheap (crowdsourcing)
In materials science: labels require PhD-level experts and hours of annotation

04. Why Is Materials Data Scarce?

High acquisition cost: Synchrotron beamtime, specialized TEMs
Limited facility access: Only a few instruments in the world for some techniques
Expert annotation time: Segmenting 100 grains in an SEM takes hours
Reproducibility barriers: Different instruments produce different images

Resolve the apparent paradox first. Students hear “materials labs have terabytes” and conclude data is abundant. The slide’s job is to separate raw data from labelled data. A single 4D-STEM acquisition is hundreds of GB of diffraction patterns — but it carries zero task labels until a human says “this region is the ordered phase.” The bottleneck is annotation, not acquisition.

Materials hook. Make the asymmetry concrete: a 4D-STEM scan is 256×256 probe positions, each a full diffraction pattern — millions of “images” in one dataset, and not one of them tells you where the dislocation is. Compare ImageNet, where the label is the cheap part (a crowdworker, two seconds).

Pre-empt the question. “Can’t we just use the terabytes unsupervised?” Yes — and that is exactly self-supervised pretraining (DINOv2/MAE on your own micrographs), which we reach in Part 3 and which links to ML-PC Unit 5’s autoencoder features and MFML W9. Flag it now, deliver it later.

One-liner for the board. “Raw data is free; labelled data costs a PhD an afternoon.”

Mechanism — four independent multipliers. Each row is a different reason, and they compound. Acquisition cost limits how many samples exist; facility access limits how many you can image; annotation time limits how many you can label; instrument variability limits how many you can pool. The last is subtle and the most pedagogically important: even if two labs each have 200 images, you often cannot merge them because a Zeiss and a FEI SEM produce systematically different contrast — this is the domain gap we formalise in Part 3 and the group-leakage problem in Part 5.

War story. A student pooled grain images from three microscope sessions, random-split, got 94% accuracy, and the model collapsed to 61% on a fourth session. The classifier had learned which microscope took the image (detector vignetting, brightness baseline), not the microstructure. The diagnostic symptom: accuracy that tracks session ID, not class. This is the seed of the group-CV discussion in slide 45–46 and ML-PC Unit 7’s leakage treatment.

Forward link. Reason 4 (instrument variability) is not just a scarcity problem — it is the reason validation must be grouped by physical unit. Mark it now; pay it off in Part 5.

05. The Labeled Data Gap

Domain	Typical Dataset Size	Labels
ImageNet	14,000,000 images	Crowdsourced
Medical Imaging	10,000–100,000	Expert radiologists
Materials Science	50–500 images	PhD microscopists

Standard deep learning (ResNet-50: 25M parameters) is designed for 1M+ images.

If we train from scratch on 100 images → guaranteed overfitting.

06. Overfitting on Small Data

Model “memorizes” specific noise and artifacts of those 100 images
Fails catastrophically on a new dataset from a different microscope
Classic symptoms:
- Training accuracy: 99%
- Test accuracy: 55% (barely better than random)

Forward link. Every cure in Parts 2–5 attacks this slide: augmentation removes brightness/orientation shortcuts, transfer supplies real features so the net does not need a shortcut, grouped CV detects the shortcut.

Make the table land as a scale argument. The three orders of magnitude between rows is the whole point. Medical imaging sits in between — and notice medicine solved its scarcity largely through transfer learning, which is exactly the playbook we are about to teach. Materials is medicine’s situation, one order of magnitude worse.

The quantitative anchor. ResNet-50: 25 M parameters, designed for 1 M+ images. With 100 images that is a 250 000:1 parameter-to-sample ratio. State the rule: when free parameters vastly exceed constraints, the network has enough capacity to fit any labelling of the training set, including the noise — so it will. This is the over-parametrisation argument; it connects directly to the MFML W6 loss-landscape view (a hugely over-parametrised net has many zero-training-loss minima, most of which generalise terribly).

Pre-empt. “But modern over-parametrised nets generalise fine (double descent)!” True with millions of images and implicit regularisation. With 100 images and no inductive bias you are on the wrong side of every interpolation-threshold result. Acknowledge it in one sentence, do not rabbit-hole.

Pacing. This is a 60-second slide — the table does the work.

The mechanism, stated cleanly. Overfitting on micrographs is rarely “the model memorised the objects” — it is “the model latched onto a spurious shortcut that is perfectly predictive on these 100 images and absent in the next batch.” On micrographs the usual shortcuts are detector vignetting, the scale bar, the brightness baseline of one session, or JPEG block artefacts.

The diagnostic symptom — teach them to read it. The 99/55 gap is not the diagnosis; it is the alarm. The diagnosis comes from Grad-CAM (ML-PC Unit 7): if the saliency map lights up on the image corner or the scale bar rather than the defect, you have a shortcut, not a classifier. I show this every year — a “porosity detector” whose attention is entirely on the SEM vignette.

War story. A weld-defect classifier hit 99% train / 58% test. Grad-CAM showed it had learned that defective coupons were photographed slightly darker (different session). Fix was not more epochs — it was grouped splitting + brightness augmentation. Both are in this unit.

07. The “Small Data” Survival Kit

Three strategies to overcome data scarcity:

Data Augmentation: Multiply data by applying valid transformations
Transfer Learning: Reuse knowledge from large-dataset models
Synthetic Training: Generate labeled data for free using simulations

This is the roadmap slide — narrate the diagram, do not just show it. Three strategies attack the scarcity problem from three independent angles: augmentation manufactures more views of the data you have (encodes invariances), transfer imports features learned elsewhere (imports inductive bias), synthetic data manufactures labels for free (imports a generative model). They are orthogonal and composable — the diagram’s whole point is that the arrows converge on one fine-tuned model.

Pre-empt the “which one is best” question. Wrong question. They are not competitors. The production answer is: synthetic-or-ImageNet pretraining → fine-tune on real → with augmentation throughout. Tell students the exam will not accept “I picked transfer learning” as a strategy; it expects the stacked pipeline.

Materials hook. Point at the “Synthetic Data (Voronoi, Phase Field)” node — this is the materials-specific superpower that generic CV does not have. We can simulate microstructure physics and get perfect masks for free. Part 4 is entirely this.

Pacing. 90 seconds. This diagram returns as the spine of the Unit summary (slide 50) — tell them to remember it.

08. Part 1 Recap

Materials science has a labeled data bottleneck (50-500 images typical)
Standard deep learning overfits massively on small datasets
Three complementary strategies: Augmentation, Transfer, Synthetic
These strategies are not alternatives — use them all together

Part 2: Data Augmentation

Slides 09–20

09. Concept: Artificially Expanding the Dataset

“Reusing existing images by applying transformations”
A form of oversampling — the same physical content, different pixel arrangements
Forces the network to focus on structure, not specific pixel patterns

The mechanism, stated cleanly. Augmentation does not create information — it removes the network’s freedom to use the wrong information. By showing the same micrograph at many rotations, you make every rotation-dependent feature useless for distinguishing classes, so gradient descent is forced toward rotation-stable features. That is why it regularises: it shrinks the hypothesis space to functions consistent with the invariance.

Pre-empt the obvious objection. “But it is the same image — how can copies help?” They help because the decision function must now be flat along the augmentation direction. It is a soft constraint on the model, not new data. This framing (augmentation = invariance constraint, not data) is the one I want them to keep; it is the through-line of the whole part.

Connection to MFML W6. Augmentation reshapes the loss landscape: it penalises minima that depend on augmentation-sensitive directions, biasing optimisation toward flatter, more generalisable basins. Mention it in one line; the full landscape story is MFML this week.

Pacing. 60 seconds — this is setup for the substantive slides.

10. Geometric Transformations

Flips: Horizontal, vertical
Rotations: 90°, 180°, 270° (or arbitrary angles)
Scaling/Cropping: Zoom in/out, random crops
Elastic deformation: Simulating sample warping or drift

Each transformation multiplies your effective dataset size. Flips alone give 4× more data.

Caveat the “4×” claim immediately. The “Flips alone give 4× more data” line is pedagogically useful but technically loose — say so. Flipped copies are highly correlated, not independent; the effective sample-size gain is far less than 4×. The real value of augmentation is the invariance it enforces, not a literal multiplier. Students who quote “augmentation gave me 4× data” on the exam should expect a follow-up.

Materials hook — and the trap on this very slide. Elastic deformation is sold here as “sample warping/drift,” and for amorphous or texture-classification tasks it is excellent (mimics SEM drift, focus distortion). But elastic warp destroys metric properties — if your label is grain size, aspect ratio, or anything quantitative, an elastic transform silently corrupts the ground truth. Foreshadow slide 12.

War story. Arbitrary-angle rotation on a segmentation task with zero-padding creates black wedges in the corners; the U-Net learns “black corner ⇒ background” and then mis-segments any genuinely dark region. Use reflection padding, or rotate-then-centre-crop. Diagnostic symptom: systematic mis-segmentation in image corners.

Pacing. 75 seconds; the warp/rotation caveats are the takeaway, not the list.

11. Invariance via Augmentation

By rotating images, we force the network to be rotation-invariant
Crucial for microstructures where “up” and “down” are arbitrary
The augmentation encodes physical knowledge into the training process

12. When Augmentation Is “Illegal”

Physical reality check: Transformations must not violate materials physics!

Don’t rotate if there’s a physical gradient (e.g., surface hardening layer, directional solidification)
Don’t flip vertically if gravity matters (e.g., sedimentation structures)
Don’t warp if topology is critical (e.g., grain boundary network connectivity)

Think before you augment: “Would this transformation produce a physically plausible image?”

Note

Augmentation is a way to tell the network: “This transformation doesn’t change the physics.”

This is the conceptual keystone of Part 2 — slow down. The callout sentence is the whole unit’s augmentation philosophy in one line. Write it on the board: augmentation encodes an invariance you believe the physics has. The contrapositive is the danger and the subject of the next slide: if you augment with a transform the physics does not respect, you have told the network a lie, and it will believe you.

Materials hook — “up is arbitrary” is true only sometimes. Equiaxed recrystallised grains: orientation truly is arbitrary, rotation-invariance is correct, augment freely. Directionally solidified columnar grains, rolled-sheet texture, a thermal gradient zone: orientation is physically meaningful, and forcing rotation-invariance erases exactly the signal of interest. Same pixels, opposite correct answer — the difference is the physics, not the image.

Pre-empt the question. “How do I know which invariances hold?” You ask a materials scientist, not a data scientist. This is the slide where I emphasise that augmentation design is a domain-knowledge task that cannot be outsourced to a default config. EBSD orientation maps are the cleanest example and we hit them on slide 42.

Pacing. ~90 seconds. This slide and slide 12 are a pair; do not let the lecture rhythm rush through them.

The anti-pattern that ruins the deployment. Reach for the strongest examples beyond the slide’s list. (1) Horizontal flip is illegal when chirality/handedness is the label — twin variants, screw-dislocation sense, chiral nanostructures: a mirror image is a different physical object with a different label, so flipping is hand-injected label noise. (2) Any rotation is illegal on EBSD orientation maps — the colour is the crystallographic orientation; rotating the image without rotating the IPF colour key produces a physically impossible map. (3) Intensity jitter is illegal when the property is calibrated to absolute intensity — EELS edge quantification, quantitative EDS: brightness encodes composition, so jitter corrupts the target.

The diagnostic symptom. Illegal augmentation rarely crashes — it quietly caps achievable accuracy and produces a model that is confidently wrong on exactly the cases that matter. The tell: validation accuracy plateaus below what the data should allow, and disabling one suspect transform suddenly lifts it. Teach the debugging move: ablate augmentations one at a time.

Pre-empt. “Isn’t more augmentation always safer?” No — that intuition is from natural images where almost everything is invariant. In physics-grounded data, augmentation is a signed operation: the right ones help, the wrong ones are pure harm. This is the single most important slide in Part 2 for exam purposes.

13. Intensity Transformations

Brightness jittering: ±10-20% intensity variation
Contrast adjustment: Simulating different detector settings
Gamma correction: Non-linear intensity mapping

Purpose: Make the model robust to different imaging conditions. A model trained at one brightness level should work at another.

14. Adding “Physical” Noise

Gaussian noise: Electronic/thermal noise

Simulates detector readout noise
Makes model robust to noisy images

Poisson/Shot noise: Counting statistics

Simulates low-dose conditions
Important for electron microscopy

Blur: Gaussian or motion blur

Simulates defocus or sample drift
Forces model to rely on structure, not sharpness

The mechanism — and the legality fork. Intensity augmentation directly attacks the brightness-shortcut failure from slide 06: by randomising brightness/contrast you make absolute intensity uninformative, forcing the net onto structure. That is exactly right when the label is structural (grain morphology, defect class). It is exactly wrong when the label is calibrated to intensity — recall slide 12: EELS/EDS quantification, BSE Z-contrast phase ID. State the fork explicitly: structural label ⇒ jitter freely; intensity-quantitative label ⇒ jitter is illegal.

Quantitative anchor. The slide’s ±10–20% is a sane default for SEM/optical microscopy. Going much beyond ±30% starts simulating images no detector would produce and wastes capacity modelling impossible inputs. Gamma is the most defensible transform here because real detector/display response is nonlinear.

Materials hook. “Different detector settings” is not hypothetical — the same sample on Monday vs Friday, or chamber A vs B, differs mostly in exactly brightness/contrast/gamma. This augmentation is the cheapest insurance against the cross-session collapse from slide 05.

Pacing. 45 seconds.

Why noise modelling is the materials differentiator. Generic CV adds Gaussian noise out of habit. In electron microscopy the physically correct noise is Poisson (shot) noise — it is signal-dependent, dominant at low dose, and is the actual constraint on dose-limited imaging of beam-sensitive materials (zeolites, MOFs, polymers, battery cathodes). A model augmented with realistic Poisson noise transfers to low-dose acquisition; one trained only on clean images fails the moment you reduce dose to protect the sample.

The mechanism for blur. Defocus and drift are the two most common real degradations in an automated session. Augmenting with them forces the net to use structure that survives blur (grain topology, boundary networks) rather than fine texture that does not. This is the same “rely on structure not sharpness” logic as the synthetic grain-segmentation success in Part 4.

Anti-pattern. Adding noise after normalisation, or noise so strong the SNR drops below anything a real instrument produces — you then train the net to be robust to images it will never see, at the cost of capacity for ones it will. Match augmentation strength to measured instrument noise (foreshadows slide 38, “use measured noise characteristics”).

Pacing. 60 seconds; emphasise Poisson over Gaussian — that is the materials-literate point.

15. Advanced Augmentations

CutOut / Random Erasing: Mask random regions with zeros
- Handles occlusions and artifacts (contamination spots)
Mixup: Linear combination of two images and their labels
- \(x' = \lambda x_1 + (1-\lambda) x_2\), \(y' = \lambda y_1 + (1-\lambda) y_2\)
- Regularizes the model, smooths decision boundaries

CutOut, with the materials twist. Random erasing is unusually well-motivated here: contamination spots, beam damage, charging patches, and dust are real occlusions, so CutOut literally simulates the deployment distribution. It also breaks the single-cue shortcut — if the net leans on one bright particle, masking it forces redundant evidence.

Mixup — flag the label-validity caveat. Mixup blends two images and their labels. That is sensible for soft classification (regularises, smooths boundaries — the slide is right). It is questionable for segmentation (a 0.6/0.4 blended pixel mask is not a physical microstructure) and for physically additive but nonlinear signals (mixing two XRD patterns linearly is fine in intensity but the phase fractions do not interpolate the way the mixed label claims). Tell students: Mixup is a generic regulariser, not a physics model — use it on classification, be wary on dense/quantitative tasks.

Pre-empt. “Why does averaging two unrelated images help at all?” It penalises overconfident, sharp decision boundaries (a flatter-minimum bias — MFML W6 connection again) and discourages memorising individual training images, which is precisely the failure mode in the small-data regime.

Pacing. 60 seconds.

16. Implementation: Torchvision & Albumentations

import albumentations as A

transform = A.Compose([
    A.HorizontalFlip(p=0.5),
    A.RandomRotate90(p=0.5),
    A.GaussNoise(var_limit=(10, 50), p=0.3),
    A.RandomBrightnessContrast(p=0.3),
    A.ElasticTransform(alpha=120, sigma=6, p=0.2),
])

# Apply to image AND mask simultaneously
augmented = transform(image=image, mask=mask)

17. On-the-fly vs. Offline Augmentation

Offline:

Generate augmented images on disk before training
Pro: Faster training
Con: Fixed set of augmentations

On-the-fly (preferred):

Transform images in RAM during each batch
Pro: Infinite diversity — each epoch sees different augmentations
Con: Slightly slower per batch

The one line that matters on this slide. transform(image=image, mask=mask) — Albumentations applies the same sampled geometric transform to image and mask jointly. This is the entire reason to prefer Albumentations over hand-rolled torchvision for segmentation: it guarantees the label-consistency rule (slide 18) by construction. If students remember one API fact from this unit, it is this.

Anti-pattern, live. The classic bug: calling transform twice — once on the image, once on the mask — which samples two independent random angles, so the mask no longer matches the image. The model trains on systematically misaligned ground truth and never converges past mediocre IoU. Symptom: segmentation loss plateaus high with no obvious cause. Show the broken two-call version next to the correct one-call version on the board.

Pre-empt. “Which augmentations are safe defaults for micrographs?” The slide’s set is reasonable, but remind them every line is a physics claim — HorizontalFlip and RandomRotate90 are claims of orientation invariance that slide 12 just told them are sometimes false. The config is not a default; it is a hypothesis about your material.

Pacing. 45 seconds — point at the joint-transform line, move on.

State the recommendation and the reason. On-the-fly is the default for the small-data regime specifically because the dataset is small: with 50 images, a fixed offline set of (say) 10 variants each gives 500 frozen images the net will still memorise over many epochs. On-the-fly resamples transforms every epoch, so the network essentially never sees the exact same pixels twice — that is the regularisation that matters when data is the constraint.

Quantitative anchor. “Slightly slower per batch” is usually true (CPU augmentation overlaps with GPU compute via DataLoader workers), but elastic transforms and large warps can become the bottleneck. If GPUs are starved, the fix is more workers or GPU-side augmentation (Kornia / NVIDIA DALI) — not switching to offline.

Pre-empt. “When is offline ever right?” When augmentation is extremely expensive (physics-based rendering, full noise simulation) or must be deterministic for reproducibility/debugging. Otherwise on-the-fly.

Pacing. 45 seconds. This is a defaults slide, not a deep one.

18. The Label Consistency Rule

If you transform the image, you must transform the labels identically!

Rotate image → rotate mask
Flip image → flip mask
Crop image → crop mask at the same location

Intensity augmentations (brightness, noise) don’t affect labels — only geometric ones do.

The mechanism, stated cleanly. A label is not always a scalar. For classification the label is rotation-invariant, so a geometric transform leaves it untouched. For segmentation, detection, keypoints, instance masks the label lives in image space and must undergo the identical sampled transform — same angle, same crop box, same flip. The slide’s image (rotation applied to image and mask together) is the canonical picture; narrate it.

Sharpen the slide’s last bullet. “Intensity augmentations don’t affect labels” is true for structural labels — but recall slide 12/13: if the label is calibrated to intensity (EELS/EDS quantification), an intensity augmentation does change the correct label and is therefore illegal. So the precise rule is: geometric transforms move spatial labels; intensity transforms move intensity-calibrated labels; only transforms that move no label are free.

War story / diagnostic. Bounding-box task, the team augmented images but forgot to transform box coordinates. mAP was near zero with a perfectly healthy loss curve — the loss was computed against garbage boxes. The symptom of label-desync is plausible loss, impossible metric. Teach: when a dense-prediction model trains “fine” but evaluates terribly, suspect label-transform desync first.

Pacing. 60 seconds, then straight into the scenario slide — slide 19 is the assessment moment.

19. Think About This: Augmentation Design

Scenario: You have 50 SEM images of a laser-welded joint. The weld bead runs left-to-right. You want to classify weld quality (good/defective).

Which augmentations are valid?

Horizontal flip: Valid (symmetric about weld center)
Vertical flip: Invalid (top surface ≠ bottom)
90° rotation: Invalid (weld direction matters)
Brightness jitter: Valid
Gaussian noise: Valid

Run this as a cold-call, not a reveal. Ask the room before showing the answer fragment. The reasoning is the assessment, not the verdicts. The intended discussion answers and the misconception to surface for each:

Horizontal flip — valid: the weld is (approximately) mirror-symmetric about its centreline, so a left-right flip yields a physically plausible weld of the same quality class. Misconception to surface: “flip is always safe” — only true here because of the symmetry; it would be false for a chiral or single-sided feature.
Vertical flip — invalid: top surface (cap, possible undercut) ≠ root (penetration, possible lack-of-fusion). A vertical flip produces a weld that cannot physically exist; the label is no longer meaningful.
90° rotation — invalid: the weld runs left-to-right; rotating makes it vertical. Bead direction is physically defined (travel direction, gravity during solidification). This is the directional-gradient case from slide 12.
Brightness jitter / Gaussian noise — valid: quality is a structural judgement, not an absolute-intensity measurement, so intensity perturbations preserve the label and add robustness to session/detector variation.

The meta-point to land. Every verdict came from physics, not from a CV default. That is the exam skill. A strong answer cites the physical reason; a weak answer just lists valid/invalid.

Pre-empt the sharp student. “Could horizontal flip be invalid too — what if the weld start vs end differs (crater, arc strike)?” Excellent objection — concede it: if start/end are label-relevant, even horizontal flip becomes suspect. The lesson is that legality is task-dependent, not image-dependent. ~3 minutes.

20. Part 2 Recap

Augmentation multiplies your effective dataset size
Geometric transforms encode physical symmetries
Only apply transformations that produce physically plausible images
Noise augmentation prepares models for real experimental conditions
Always transform images and labels together

Part 3: Transfer Learning

Slides 21–32

What this part delivers. The intellectual heart of the unit and the most exam-weighted material. Students must leave able to (1) explain why ImageNet features transfer to micrographs, (2) choose feature-extraction vs fine-tuning from data size and domain gap, and (3) derive — not memorise — why backbone and head need different learning rates.

The one idea to leave with. “A pretrained backbone is already sitting near a good minimum of the loss landscape; fine-tuning must therefore take small steps on the backbone and large steps on the freshly-initialised head. Everything else in Part 3 follows from that one optimisation picture.” This is the explicit hand-off to MFML W6 (Loss Landscapes & Optimization) running in parallel this week.

Pacing. 12 slides, ~25 minutes — the longest part by design. Load-bearing: 22 (why features transfer), 24–25 (the two strategies), 26 (differential LRs), 27 (gradual unfreezing), 28 (domain gap). Slides 23, 29–31 are faster. Budget accordingly; do not arrive at slide 26 with two minutes left.

21. Concept: Knowledge Reuse

“Learning on Peas to count Lentils.” — Sandfeld (2024)

Take a model trained on Task A (e.g., classifying dogs vs. cats)
Adapt it for Task B (e.g., classifying phases in micrographs)
Why does this work? Because early visual features are universal

Land the Sandfeld quote — it is the whole concept in six words. “Learning on Peas to count Lentils.” The skill (count near-circular blobs against a background) transfers even though peas are not lentils. Map it to materials immediately: learning edges/textures on dogs transfers to learning grain boundaries on steel, because the low-level visual vocabulary is shared, not the objects.

The mechanism, stated cleanly. A CNN trained on ImageNet does not memorise dogs in its first layers — it learns a general-purpose visual front-end: oriented edge detectors, colour-opponent and texture filters, Gabor-like responses. Those are not “dog features,” they are “image features,” and micrographs are images. The transferable part is the front-end; the dog-specific part is the head, which we throw away.

Pre-empt the inevitable objection (it returns on slides 28 and 30). “ImageNet has no TEM/SEM images — how can it possibly help?” Because transfer happens at the level of edges and textures, not objects. The answer is on slide 30 (Au-nanoparticle success) and is empirically robust; flag it now, deliver the evidence then.

Pacing. 60 seconds — this is the motivating frame; slide 22 supplies the rigour.

22. Why ImageNet Features Transfer

ImageNet: 14 million images, 1000 classes (dogs, cats, cars, buildings…)

The hierarchical features learned on ImageNet:

Layer 1: Edges, gradients → universal
Layer 2: Textures, corners → mostly universal
Layer 3: Object parts → domain-specific
Layer 4+: Full objects → very domain-specific

Early layers transfer well. Late layers need adaptation.

This is the theoretical anchor of the unit — cite the result by name. This depth-wise picture is Yosinski et al. 2014 (“How transferable are features in deep neural networks?”). The empirical finding to state crisply: transferability decreases monotonically with depth. Layer 1 features are almost perfectly general; by the final layers they are so task-specific that transferring them hurts unless adapted. Draw the curve on the board: x-axis = layer depth, y-axis = transfer benefit, monotone decreasing.

Why this is true, mechanistically. Early layers are constrained by the statistics of natural images themselves (edges, \(1/f\) spectra, local correlations) — and micrographs share those statistics. Late layers are optimised for the ImageNet label set (1000 everyday categories), which has nothing to do with grain phases. The head is maximally domain-specific because its entire job is to map features to those specific 1000 classes.

The actionable consequence — this is the bridge to the whole rest of Part 3. If transferability falls with depth, then your fine-tuning strategy should be depth-graded: keep early layers nearly fixed (tiny LR), adapt late layers more (higher LR), and fully retrain the head. That single sentence motivates differential LRs (slide 26) and gradual unfreezing (slide 27). Say it explicitly so those slides feel inevitable, not arbitrary.

Pre-empt. “How deep is ‘early’?” There is no universal cut — it depends on the domain gap (slide 28). That is why we prefer the smooth solution (a learning-rate gradient) over a hard freeze/unfreeze boundary.

23. The Backbone and the Head

Backbone: The feature extractor — pretrained on ImageNet
Head: The classifier/regressor — newly initialized for your task
Replace the head to match your number of classes

The mental model to put on the board. Backbone = image → feature vector (\(\phi: \text{image} \to \mathbb{R}^d\), e.g. ResNet-50’s 2048-D penultimate vector). Head = feature vector → your answer. Transfer learning keeps \(\phi\) (expensive, learned on millions of images) and replaces only the cheap final map. This is exactly the frozen-feature-extractor picture from ML-PC Unit 5 (clustering CNN embeddings) — same \(\phi\), the difference is Unit 5 clustered the features unsupervised, here we train a supervised head. Make that connection explicit; it unifies two units.

Why you must replace the head. The pretrained head outputs 1000 ImageNet logits. Your task has 3 phases (or a scalar grain size). The head’s output dimension and learned class semantics are useless; only its input — the feature vector — is valuable. Re-initialise the head to your output shape; this is non-negotiable and a common student bug (trying to “fine-tune” the 1000-way head).

Forward link. The next two slides are the two ways to use this decomposition: freeze \(\phi\) (feature extraction, slide 24) or also update \(\phi\) (fine-tuning, slide 25). The whole rest of Part 3 is “how aggressively do we touch \(\phi\).”

Pacing. 60 seconds; the diagram does the work.

24. Strategy 1: Feature Extraction

Freeze the entire backbone (no weight updates)
Train only the new head on your materials dataset
The backbone becomes a fixed feature extractor

When to use: Very small dataset (<100 images), risk of overfitting is high.

Advantage: Fast training, minimal risk of destroying pretrained features.

Disadvantage: Cannot adapt backbone to domain-specific textures.

The mechanism, stated cleanly. Freeze \(\phi\), train only the head. Equivalent to: precompute the 2048-D feature for every image once, then train a tiny classifier (often just logistic regression / a small MLP) on those frozen vectors. With <100 images this is the safe choice precisely because you are fitting only the head’s few thousand parameters, not 25 M — the over-parametrisation argument from slide 06 no longer bites.

The BatchNorm running-stats trap — teach this explicitly, it is the #1 silent bug in feature-extraction mode. A frozen backbone put in model.eval() uses ImageNet’s stored BN running mean/variance. Those statistics were computed on RGB natural images; your grayscale 16-bit micrographs have completely different per-channel statistics. Result: the “frozen” features are silently mis-normalised and weak. The fix: either (a) keep BN layers in training mode so their running stats adapt to your data even though weights are frozen, or (b) recompute BN statistics on your dataset before extracting. Diagnostic symptom: feature-extraction accuracy mysteriously far below fine-tuning, with a huge train-test gap that augmentation does not fix — classic mis-normalised-BN signature.

War story. Student reported “transfer learning doesn’t work, only 64%.” Single line — set BN to train mode under the frozen backbone — jumped it to 86%. Nothing else changed. This is worth two minutes of lecture time; it is the most expensive lesson in the unit if learned the hard way.

MFML W6 link. Feature extraction = optimise only in the low-dimensional head subspace, leaving the backbone parameters frozen at ImageNet’s good minimum. Tiny, convex-ish problem; no risk of falling out of the good basin. Contrast with slide 25.

Pacing. ~2 minutes — spend the time on the BN trap, it pays off in the exercise.

25. Strategy 2: Fine-Tuning

Initialize with pretrained weights
Train the entire network (or the last few layers)
Use a very low learning rate for the backbone

When to use: Moderate dataset (100-1000 images), enough to adapt the backbone.

Advantage: Backbone adapts to “micrograph-specific” textures.

Risk: Catastrophic forgetting — destroying useful pretrained features with aggressive updates.

The decision rule, stated as a table the exam will test. Feature extraction vs fine-tuning is governed by two axes, not one: (1) how many labels you have, (2) how large the domain gap is. Few labels + small gap → feature extraction. More labels (~100s–1000s) + meaningful gap → fine-tuning. Few labels + large gap is the hard case — feature extraction underfits (features are wrong for the domain) and fine-tuning overfits (too few labels) — and is exactly where synthetic pretraining (Part 4) and self-supervised pretraining on your own micrographs earn their keep.

Catastrophic forgetting — the mechanism and its diagnostic symptom. A freshly-initialised head produces huge, near-random gradients in the first few iterations. Backpropagated at a normal LR through the pretrained backbone, these random gradients overwrite the carefully-learned ImageNet features before the head has stabilised — you have destroyed your transfer in epoch 1. Symptom: training loss spikes upward at the start, or accuracy is worse than feature-extraction despite “more capacity.” This is the single most important reason differential LRs (next slide) and gradual unfreezing (slide 27) exist — say so explicitly so those slides land as solutions to a problem the students now feel.

MFML W6 link — make it central. Picture the loss landscape: the pretrained backbone sits in a good, wide basin. A large backbone LR is a large step that hurls you out of that basin into a random region — catastrophic forgetting is “step size too large relative to distance-to-good-minimum.” The head, by contrast, is randomly initialised — far from any minimum — and needs large steps. Two parameter groups, two distances-to-minimum, therefore two learning rates. The entire next slide is this sentence made quantitative.

Pre-empt. “Why not just fine-tune everything at a small LR?” Then the head trains hopelessly slowly (it is far from its minimum and you crippled its step size). You need asymmetric steps, not uniformly small ones. That is differential LRs.

26. Differential Learning Rates

High LR for the head (\(10^{-3}\)): Learning new classes from scratch
Low LR for the backbone (\(10^{-5}\)): Gently adapting existing features
Ratio: Typically 100× between head and backbone

optimizer = torch.optim.Adam([
    {'params': model.backbone.parameters(), 'lr': 1e-5},
    {'params': model.head.parameters(), 'lr': 1e-3},
])

This is the payoff slide of Part 3 — derive it, do not recite it. Everything since slide 23 was building to this. The head is randomly initialised: far from its minimum ⇒ needs a large step (\(10^{-3}\)). The backbone is at ImageNet’s good minimum: a large step ejects it (catastrophic forgetting, slide 26) ⇒ needs a small step (\(10^{-5}\)). The 100× ratio is not a magic number — it is the ratio of “distance to minimum” for the two parameter groups. Make students derive “head LR ≫ backbone LR” from the loss-landscape picture; that is the exam-grade understanding.

Quantitative anchor — give them defensible defaults. Head LR \(\sim 10^{-3}\); backbone LR = head_LR / 10 to / 100, i.e. \(10^{-4}\)–\(10^{-5}\). Discriminative LRs in the ULMFiT/fastai sense generalise this to a per-layer geometric decay: \(\text{lr}_\ell = \text{lr}_{\text{top}} \cdot \gamma^{(\,L-\ell\,)}\) with \(\gamma \approx 2.6\) between consecutive blocks — early layers get the smallest LR (most general, most worth preserving), consistent with the Yosinski depth curve from slide 23. Two groups (slide code) is the simple version; per-block decay is the production version.

The code is the takeaway. torch.optim.Adam([{params: backbone, lr: 1e-5}, {params: head, lr: 1e-3}]) — parameter groups. Many students have never seen that PyTorch supports per-group LR; show it explicitly, it is directly needed in the exercise.

Anti-pattern. A single global LR. Pick \(10^{-3}\) → backbone forgets. Pick \(10^{-5}\) → head never trains. There is no single LR that is right for both groups — that is the argument for this slide. Diagnostic: with one global LR you see either an early loss spike (too big) or a head that plateaus at near-chance (too small).

MFML W6 link, closing the loop. Optimizer parameter groups are the practical realisation of “different regions of the loss landscape need different step sizes” — the central theme of the parallel MFML unit this week. Tell them the two lectures are describing the same picture from two sides.

27. Gradual Unfreezing

A safer fine-tuning protocol:

Freeze all backbone layers. Train head until convergence.
Unfreeze the last backbone block. Train with low LR.
Unfreeze the next block. Train further.
Repeat until the entire network is fine-tuned.

This prevents catastrophic forgetting of low-level features while allowing high-level adaptation.

The mechanism and the protocol’s logic. Gradual unfreezing (the ULMFiT recipe, Howard & Ruder 2018) is the scheduled sibling of differential LRs. The reason for the order — head first, then top backbone block, then deeper — is the Yosinski depth curve again: the deepest layers are the most domain-specific (need the most adaptation, can be unfrozen earliest with least risk), the earliest layers are the most general (preserve them longest). You are unfreezing in order of decreasing safety to disturb.

Why it beats unfreezing everything at once even with differential LRs. Differential LRs control step size; gradual unfreezing also controls timing. By keeping the backbone frozen until the head has converged, you eliminate the epoch-1 random-gradient blast entirely — the head is already producing sensible gradients before any backbone weight is allowed to move. It is defence-in-depth against catastrophic forgetting: combine both (frozen-then-staged-unfreeze and discriminative LRs), do not choose between them.

Quantitative anchor / recipe. Typical schedule: epochs 0–E₁ head only (LR \(10^{-3}\)); E₁–E₂ unfreeze last block (backbone LR \(10^{-5}\)); E₂–E₃ unfreeze next; etc. E₁ is “until validation loss plateaus,” not a fixed number. This is exactly the staged recipe that reappears as the production recipe on slide 44.

Pre-empt. “Isn’t this just slow?” Yes, it is more wall-clock — and in the small-data regime where overfitting is the real enemy, that trade is almost always worth it. The fast path (unfreeze all) is the right choice only with abundant data, which is precisely the situation this unit does not address.

Pacing. ~90 seconds; the head-first ordering and its Yosinski justification are the takeaway.

28. The Domain Gap: Natural vs. Scientific

Natural images and micrographs differ in:

Color: RGB vs. grayscale / 16-bit
Perspective: 3D with vanishing points vs. orthographic top-down
Textures: Organic, varied vs. crystallographic, periodic
Noise: Compression artifacts vs. shot noise

If the domain gap is large, more fine-tuning is needed. Feature extraction alone may not suffice.

The mechanism, stated cleanly. “Domain gap” is the distribution shift between the pretraining data (ImageNet: 3-channel 8-bit RGB, perspective projection, organic textures, JPEG noise) and your target data (1-channel 16-bit, orthographic, periodic/crystallographic textures, Poisson noise). The size of that gap is precisely what determines the slide-26 decision: small gap → feature extraction suffices (the pretrained features are nearly right); large gap → you must fine-tune so \(\phi\) can move toward the new distribution.

The single most common practical bug from the channel mismatch. Micrographs are grayscale; ImageNet models expect 3 channels. Students either (a) replicate the gray channel ×3 (fine, standard) or (b) feed 1 channel and get a shape error, then “fix” it by re-initialising the first conv layer randomly — destroying the most transferable layer in the network (slide 23!). The correct move: replicate to 3 channels, or average the pretrained first-conv RGB weights into a 1-channel kernel. Diagnostic symptom of the bad fix: transfer suddenly performs no better than from-scratch — because you discarded exactly the layer that transfers best.

Materials hook. The “periodic/crystallographic vs organic” texture row is the deepest part of the gap: ImageNet has essentially no periodic lattices, Moiré fringes, or diffraction-contrast banding. This is why self-supervised pretraining on your own micrographs (DINOv2/MAE) or synthetic pretraining (Part 4) often beats ImageNet for microscopy — they close the texture gap that ImageNet cannot. Forward-link to MFML W9 (SSL/latent spaces) and ML-PC Unit 5 (AE features).

Pacing. ~90 seconds; the channel-mismatch anti-pattern is the actionable takeaway.

29. Cross-Material Transfer

Train on a large database of steel micrographs
Fine-tune on a small set of aluminum samples
Physics intuition: Grain boundary topology is similar across alloy systems

The mechanism — and a domain-correctness caveat for the slide’s claim. The pitch (“grain-boundary topology is similar across alloy systems”) is partly true and worth a careful nuance: the geometry of space-filling grain networks (triangular triple junctions, ~120° equilibrium angles, log-normal size distributions) genuinely is alloy-agnostic — that is what transfers. But contrast mechanisms differ (etching response, channelling/BSE contrast, second-phase appearance) between steel and aluminium, so this is mid-gap, not zero-gap: fine-tuning on the 50 Al images is doing real work, not cosmetic adjustment. Tell students the transferable invariant is topology, not appearance.

Why this is the more powerful flavour of transfer. ImageNet → micrograph crosses a huge gap (slide 28). Steel → aluminium crosses a small one — same modality, same physics family. So expect cross-material transfer to need less fine-tuning and fewer target labels than ImageNet transfer. This is the in-domain transfer that, where a large same-modality dataset exists, beats ImageNet outright.

Materials hook / forward link. This is conceptually the same move as synthetic→real (Part 4): pretrain where labels are abundant (1000 steel images, or unlimited Voronoi), fine-tune where they are scarce (50 Al images, or the real SEMs). The diagram on this slide is structurally identical to slide 29’s synthetic-pretraining diagram — point that out; it primes Part 4.

Pacing. 60 seconds.

30. Success Story: Au Nanoparticle Segmentation

Task: Segment crystalline Au nanoparticles from amorphous TEM background
Method: U-Net initialized with ImageNet weights
Result: High accuracy despite limited labeled TEM frames

ImageNet pretraining helped even though ImageNet contains no TEM images — the low-level features transferred.

This is the empirical payoff of slide 22’s promise — call that back explicitly. Two slides ago students were told “ImageNet has no TEM images, yet it helps.” Here is the receipt: U-Net with an ImageNet-initialised encoder segments crystalline Au from amorphous carbon background, with very few labelled TEM frames, and beats from-scratch. The closing line on the slide (“low-level features transferred”) is the Yosinski result from slide 23 confirmed on real microscopy. Make the through-line audible: slide 22 claim → slide 23 theory → slide 30 evidence.

The mechanism, made concrete. What actually transferred: edge/contrast detectors that fire on the boundary between ordered lattice fringes (crystalline) and featureless speckle (amorphous). That distinction is a textbook edge/texture-contrast problem — exactly what ImageNet’s first layers are superb at. The encoder did not need to “know” gold; it needed to know “structured region vs unstructured region,” which is domain-general.

Materials hook / why U-Net specifically. Segmentation, not classification — so the architecture matters: U-Net’s encoder is the transferable backbone, the decoder is the new head with skip connections preserving spatial detail. Same backbone/head decomposition as slide 23, just in an encoder–decoder. Worth one sentence so students do not think transfer is “classification only.”

Pre-empt. “Would self-supervised pretraining on unlabelled TEM have done even better?” Almost certainly yes for the texture-specific layers — foreshadow Part 4 and the SSL alternative (MFML W9 / ML-PC Unit 5). Honest framing: ImageNet transfer is the strong, cheap baseline, not the ceiling.

Pacing. 60 seconds — this is the “it actually works” emotional beat of Part 3; let it land before the recap.

31. Transfer from Simulations

Pretrain on simulated data (DFT, molecular dynamics, phase field)
Fine-tune on real experiments
Advantage: Simulations provide unlimited labeled data at zero annotation cost

The next frontier: physics-simulation-based pretraining for materials ML.

This slide is the hinge between Part 3 and Part 4 — narrate it as such. Transfer learning’s source domain need not be ImageNet or another material — it can be a physics simulation (phase-field grain growth, MD, DFT-derived spectra). Same transfer machinery (pretrain where labels are free → fine-tune on scarce real data), but now the “free labels” come from physics we know, not from a human annotator. That reframing is Part 4’s thesis.

The mechanism / why simulations are the ideal source. A simulation gives you the label by construction — you placed the grains, so you know the segmentation mask exactly; you set the phase fraction, so the regression target is exact. Zero annotation cost, zero label noise, and unlimited volume. The catch is the sim-to-real gap (the entire subject of Part 4) — flag it here in one sentence so the transition is motivated, do not resolve it yet.

MFML W9 / ML-PC Unit 5 link. Mention that the “next frontier” line is partly already here: foundation-model and self-supervised pretraining on large unlabelled experimental corpora is the 2025–26 reality (DINOv2/MAE for microscopy), complementary to simulation-based pretraining. One sentence; do not detour.

Pacing. 45 seconds, then a crisp recap.

32. Part 3 Recap

Don’t train from scratch — always start with a pretrained backbone
Feature extraction: Freeze backbone, train head only (safest)
Fine-tuning: Adapt backbone with low LR (more powerful)
Gradual unfreezing prevents catastrophic forgetting
Differential learning rates: 100× between head and backbone
Transfer works across domains (natural → scientific) and materials (steel → aluminum)

Part 4: Learning from Synthetic Data

Slides 33–43

What this part delivers. The materials-specific superpower no generic CV course can teach: because we know the physics, we can simulate microstructures and get perfect labels for free. The part is a balanced argument — the promise (unlimited free labels) and the catch (sim-to-real gap), with the closing-the-gap toolkit.

The one idea to leave with. “Synthetic data converts the labelling problem into a modelling problem: you no longer need someone to label real images, but your model is only as good as your generator’s physics — and it fails precisely on whatever feature the generator omits.”

Pacing. 11 slides, ~19 minutes. The part opens with the concrete proof of the premise — the Construction-Zone HRTEM case study (slides 33–34): purely-synthetic training beating experimental-trained SOTA on real micrographs. Load-bearing: 37 (the sim-to-real gap) and 42 (the failure-mode scenario — run as discussion). Slides 35–36 (Voronoi pipeline) and 41 (spectra) are faster. Slide 39 (synthetic-only → perfect real segmentation) is the emotional high point — pair it explicitly with slide 37’s warning so students hold both truths at once.

33. The “Infinite Data” Dream

Simulate the structure → generate unlimited labeled data
Perfect masks for free — no expert annotation
Controllable — sweep grain size, defects, dose, aberrations at will

Note

The labeling arrow flips: instead of labeling real images, we render images from known structures.

Made concrete — Construction Zone (Rakowski et al., npj Comput. Mater. 10, 165, 2024): build thousands of random Au nanoparticles on carbon → multislice HRTEM simulation → physical post-processing (thermal, aberrations, plasmon loss, Poisson noise) → segmentation masks by construction.

Construction Zone pipeline: random atomic structures → multislice HRTEM simulation → imaging-condition post-processing → metadata-rich training database with exact masks. Rakowski et al., *npj Comput. Mater.* 2024 (arXiv:2309.06122).

The mechanism, stated cleanly — and the inversion is the key insight. Normally: acquire image → expert produces label (expensive, error-prone). Synthetic: choose label (the ground-truth atomic structure) → render the image from it (free, exact). The callout’s word “flips” is precise — the causal arrow between image and label reverses. Because the label generates the image, the mask is perfect by construction: no annotator disagreement, no boundary ambiguity, no label noise.

The running example for slides 33–34 — Construction Zone (Rakowski et al., npj Comput. Mater. 2024; arXiv:2309.06122). Anchor the abstract dream on this concrete HRTEM pipeline so students see it is not hand-waving. Walk the figure left→right: (1) Construction Zone procedurally builds thousands of random spherical Au nanoparticles — varied radii, orientations, twins/stacking faults — on amorphous-carbon substrates; (2) the Prismatic multislice algorithm simulates the 300 kV exit wavefunction at 0.02 nm/pixel — this is real image-formation physics, not a GAN; (3) post-processing stamps in the imaging conditions: focal series, frozen-phonon thermal averaging, residual aberrations, plasmonic losses, Poisson dose noise; (4) the segmentation mask falls out of the known structure by phase thresholding — millions of perfectly-labeled pixels at zero annotation cost.

Materials hook. “Controllable” is the underrated power. You can sweep particle size, defect content, defocus, dose across ranges no real dataset spans, giving the model balanced coverage of parameter space — impossible from a fixed set of real micrographs. This directly attacks class imbalance and the long tail, and it is why the next slide’s result is even possible.

Pre-empt the obvious objection (the whole tension of the part). “If we can simulate it perfectly, why image at all?” Because the simulation is a model — it omits exactly the artefacts (detector MTF, scan distortion, contamination, beam damage) that slides 37–39 confront. State now: the rest of Part 4 is the price of this free lunch — and slide 34 shows the price is payable.

Pacing. ~90 seconds. The causal inversion is the takeaway; the figure is the “this is real, here is the machinery” beat that sets up the payoff slide.

34. …And It Works: Purely Synthetic Beats Experimental SOTA

A U-Net trained only on simulated HRTEM — zero experimental images in training — segmenting real Au / CdSe nanoparticles:

Benchmark	Synthetic-trained F1	Prev. best (real-trained)
Au, large (5 nm)	0.92	0.89
Au, small (2.2 nm)	0.86	0.75
CdSe (2 nm)	0.75	0.59

Important

The twist: what mattered was imaging-condition diversity + simulation fidelity — not the number of unique atomic structures. Simulate the right variation, not more structures.

Synthetic-trained U-Net segmenting real experimental HRTEM Au nanoparticles. Top row: experimental images; lower rows: model predictions from weak → strong curation (purple = particle, pink = boundary). Rakowski et al., *npj Comput. Mater.* 2024.

The headline result — say the number, then the surprise. A segmentation U-Net (ResNet-18 encoder) trained on purely synthetic multislice HRTEM reaches F1 ≈ 0.92 / 0.86 / 0.75 on three real experimental nanoparticle benchmarks, beating the previous best models that were trained on annotated experimental data (0.89 / 0.75 / 0.59). This is the slide that retires the reflexive objection “synthetic can’t beat real” — here it did, decisively, on the data-scarce small-particle and CdSe cases especially.

The twist is the actual teaching payload — do not let it pass as a footnote. Ablating the curated database showed the performance lever is not “more unique atomic structures.” A few structures sufficed; what moved F1 by 0.10–0.15 in the data-scarce regime was simulation fidelity (frozen-phonon thermal averaging, residual aberrations, plasmonic losses) and imaging-condition diversity (defocus spread, dose range, substrate-thickness variation). Restate as the slogan students should leave with: simulate the right variation, not more structures. This is the synthetic-data analogue of the augmentation lesson from Part 2 (it is the invariances you inject that matter) and a direct setup for the sim-to-real gap on slide 37.

Why this works — connect to the unit’s spine. The model never needs to have seen this nanoparticle; it needs to have seen this kind of contrast under this kind of microscope state. Filter/edge features are domain-general (Part 3, slide 22); what kills naive synthetic training is a covariate shift in imaging conditions, not in structure. Curating the imaging-condition distribution is exactly closing that covariate shift at the source.

Anti-pattern + diagnostic. Pouring effort into a huge structure library while simulating at one defocus / one dose → ~perfect synthetic-test F1, collapse on real images (the shortcut-learning symptom revisited on slide 42). Diagnostic: hold out real images from epoch 0 and watch the synthetic-vs-real F1 gap; if it is large and not closing, your imaging-condition distribution is too narrow — add fidelity/diversity, not structures.

Pre-empt. “Is this just an electron-microscopy trick?” No — the principle is modality-agnostic: the synthetic generator must span the measurement nuisance distribution (here optics + dose + thermal), whatever the instrument. The Voronoi grain example (slides 35–36) and the spectra example (slide 41) are the same idea in other modalities.

MFML cross-reference. This is the data-side complement to MFML W6 (parallel this week): there, optimisation geometry; here, the data distribution the optimiser sees. Both converge on the same message — generalisation is governed by how well the training distribution covers the deployment distribution.

Pacing. ~2.5 minutes — the emotional + intellectual peak of the part’s opening. Land the table, then spend the bulk of the time on the twist; it is the transferable lesson, the table is just the proof it is worth listening to.

35. Generating Grain Microstructures

Voronoi Tessellations:

Distribute random seed points in 2D
Assign each pixel to its nearest seed → grain regions
Parameters: number of seeds (grain count), regularity, boundary thickness

The mechanism, and its honest limits. A Voronoi tessellation models grains as the regions nearest to random seed points. It captures the topological truth of a space-filling cellular network — exactly the property that makes it transfer to real grains (foreshadow slide 39). But state the caveat plainly: standard Poisson-seeded Voronoi gives convex, roughly equiaxed grains with near-120° triple junctions. Real microstructures have non-convex grains, abnormal grain growth, annealing twins (straight, parallel boundaries — Voronoi never produces these), and elongated/columnar morphologies. The generator’s omissions become the model’s blind spots — this is the seed of slides 37 and 42.

Quantitative anchor / recipe knobs. Number of seeds sets mean grain size; seed-placement regularity (Poisson → Lloyd-relaxed → perturbed lattice) tunes size dispersion; boundary thickness sets the apparent etch width. Sweep these to span the real distribution rather than guessing a single setting.

Materials hook. This is a 30-line NumPy/scipy job (scipy.spatial.Voronoi or a nearest-seed argmin on a pixel grid) — accessible in the exercise. Contrast with the expensive alternative on slide 31: phase-field/Monte-Carlo grain growth gives physically evolved (non-convex, twinned) microstructures but costs orders more compute. Voronoi is the cheap, surprisingly effective baseline; physics simulation is the upgrade.

Pacing. 60 seconds.

36. From Geometry to Realistic Image

A raw Voronoi diagram doesn’t look like an SEM image. We need to add:

Grain contrast: Random intensity per grain
Boundary appearance: Thickened, possibly bright or dark boundaries
Texture: Per-grain crystallographic texture
Noise: Gaussian + Poisson to simulate detector noise
Blur: Slight defocus

The mechanism — render is the inverse of augment. This pipeline (contrast → boundary appearance → texture → noise → blur) is structurally the same operations as Part 2’s augmentations, applied with intent: you are deliberately injecting the imaging-physics degradations that real SEMs have. Make the link explicit — Part 2 used Poisson noise/blur to make a model robust; here you use the identical transforms to make a clean geometry look acquired. Same toolbox, opposite direction.

The single most important design principle on this slide. The realism pipeline must add exactly the degradations the downstream task must be invariant to, and no spurious cues. Each step is a hypothesis about the real imaging chain: per-grain intensity (channelling/orientation contrast), boundary brightening (etch/topographic edge), Poisson+Gaussian noise (counting + readout), defocus blur (objective lens). If your real instrument adds something this list omits — vignetting, scan distortion, charging streaks — the model will not be robust to it. The gap on slide 37 is the difference between this list and reality.

Anti-pattern. Identical per-grain texture/contrast statistics in every synthetic image → the CNN learns the synthetic texture fingerprint as a shortcut and collapses on real images that lack it. Diagnostic symptom: ~perfect synthetic test accuracy, poor real accuracy — the exact scenario dissected on slide 42. Randomise the realism parameters per image, do not fix them.

Pacing. 60 seconds; emphasise “each step is a physics hypothesis.”

37. The Sim-to-Real Gap

Synthetic data is often “too clean” or “too regular”
Real microstructures have:
- Non-uniform lighting
- Sample preparation artifacts (scratches, contamination)
- Complex grain morphologies that Voronoi can’t capture

CNNs might learn synthetic-only features and fail on real SEMs.

This is the load-bearing slide of Part 4 — the honest counterweight to the “infinite data” dream. State the governing principle as a quotable sentence: synthetic data fails exactly on the feature the generator omits. If Voronoi cannot make twins, a twin-detection model trained on it cannot learn twins — not because the network is weak, but because the information was never in the training set. This is not a tuning problem; it is an epistemic one.

The mechanism — covariate shift, named. Train and test distributions differ (\(p_{\text{synth}}(x) \neq p_{\text{real}}(x)\)) while the labelling function is shared. The CNN, free to use any predictive cue, latches onto synthetic-only regularities (too-clean boundaries, absent noise floor, regular grain convexity) that are perfectly predictive on synthetic data and absent in real data. This is the same shortcut-learning failure as slide 06 and the cross-session collapse of slide 05 — point that out; it is one recurring disease with three faces.

Forward link, deliberately staged. Slide 39 will show synthetic-only training producing near-perfect real segmentation — apparently contradicting this slide. Pre-empt the confusion now: it works there because grain-boundary topology genuinely is captured by Voronoi; it fails whenever the task needs a feature topology does not encode (twins, indexing errors — slide 42). Hold both: synthetic works when the generator captures the task-relevant invariant, fails when it does not.

Pacing. ~90 seconds — this slide and slide 39 must be taught as a pair, not in isolation.

38. Domain Adaptation: Closing the Gap

Making synthetic images look more like real ones:

Style transfer: Apply the “style” of real SEMs to synthetic geometry
GANs (Generative Adversarial Networks): Train a generator to produce realistic textures
Noise modeling: Use measured noise characteristics from real instruments

The mechanism — three tools, two philosophies. Make the taxonomy explicit. (1) Make synthetic look real — style transfer / CycleGAN learns the real “texture skin” and paints it onto synthetic geometry while preserving the (free, exact) mask. (2) Make features domain-invariant — adversarial domain adaptation (a domain-discriminator + gradient reversal, DANN-style) trains the encoder so synthetic and real features are statistically indistinguishable, so the classifier cannot tell which domain it is in. (3) Measure, don’t guess — calibrate the noise model to the actual instrument’s gain/readout statistics rather than a textbook Gaussian. The cheapest and most reliable in practice is (3) + a little real fine-tuning; the GAN options are powerful but add training instability.

Anti-pattern / war story. A team spent a month on a CycleGAN to bridge the gap; a colleague matched their accuracy in an afternoon by (a) measuring the detector’s Poisson/Gaussian noise parameters and (b) fine-tuning the synthetic-pretrained net on 20 real labelled images. Lesson: try the boring solution first — realistic noise modelling + a handful of real labels beats elaborate GAN adaptation surprisingly often. Reserve adversarial DA for genuinely large gaps with zero real labels.

Connection. This is conceptually transfer learning again (Part 3): pretrain on synthetic (abundant), fine-tune on the few real labels (scarce). Domain adaptation is what you do when you have zero real labels; if you have even ~10–20, supervised fine-tuning usually wins. Tie back to the slide-26 decision rule.

Pacing. ~90 seconds; emphasise the “boring solution first” anti-pattern.

39. Case Study: SEM Grain Segmentation

Model trained only on Voronoi synthetic data
Tested on real polycrystalline SEM images
Result: Nearly perfect grain boundary segmentation!

The synthetic data captured the topological truth of grain networks — boundaries, junctions, and connectivity patterns.

Resolve the apparent contradiction with slide 37 head-on — students will notice it. Slide 37 said synthetic-only models fail; this slide shows synthetic-only → near-perfect real grain-boundary segmentation. Both are true, and the reconciliation is the deepest lesson of Part 4: synthetic data works iff the generator captures the task-relevant invariant. Grain-boundary segmentation depends on topology (a boundary is where two regions meet, junctions are where three meet) — and Voronoi reproduces that topology exactly. The detector noise, contrast, and convexity errors are irrelevant to this particular task because they do not change where the boundaries are.

The generalisable rule to put on the board. “Synthetic data succeeds when the downstream task only needs the structure your generator gets right.” Grain segmentation needs topology → Voronoi suffices. Twin detection needs twins → Voronoi fails (slide 42). Same generator, opposite outcomes, determined entirely by which physical feature the task depends on. This is the single sentence to test on the exam.

Materials hook. The slide’s phrase “topological truth — boundaries, junctions, connectivity” is exact and worth dwelling on: triple junctions, the space-filling constraint, and boundary continuity are combinatorial properties of any cellular network, alloy-independent — which is also why this connects back to cross-material transfer (slide 29). Topology is the universal invariant in microstructure.

Pacing. ~90 seconds — this is the conceptual climax of Part 4; do not rush the reconciliation.

40. Adaptive Data Generation

Train on 1000 synthetic images
Test on real images → find the hardest cases (worst predictions)
Analyze what makes them hard (unusual grain shapes? specific textures?)
Generate targeted synthetic data mimicking those hard cases
Retrain and iterate

This is a form of active learning for synthetic data generation.

The mechanism — close the loop deliberately. This is a feedback controller: model error on real data is the signal, the synthetic generator’s parameters are the actuator. You diagnose which real cases the model fails (unusual grain shapes? specific contrast?), then push the generator’s distribution toward those cases, retrain, and iterate. It directly attacks the slide-37 problem: instead of hoping the generator covers reality, you measure the coverage gap and close it.

Connection to active learning (the slide’s own framing) and forward link. Classical active learning (slide 48) asks “which real images should I pay an expert to label?” This asks “which synthetic images should I generate?” — same uncertainty-driven principle, but the scarce resource is generator-coverage, not annotation budget. Worth one sentence so students see slide 48 coming and recognise the shared idea.

Anti-pattern / caveat. The loop can chase noise: a few mislabelled or genuinely ambiguous real cases drive the generator toward producing unrealistic microstructures, degrading overall performance. Guard with a held-out real validation set whose distribution you do not let the loop touch (this is exactly the gold-standard test set of slide 49). Diagnostic symptom: synthetic distribution drifts to extreme/unphysical parameters while real validation accuracy stalls or drops.

Pacing. 45 seconds — it is an “advanced technique, know it exists” slide, not a derive-it slide.

41. Procedural Generation for Spectra

Synthetic data works for more than images:

XRD patterns: Simulate peaks with varying noise, background, and peak overlap
EELS spectra: Simulate edges with realistic energy loss and plural scattering
EDS maps: Simulate elemental distributions with counting noise

The same principle: if you can simulate it, you can label it for free.

The mechanism — spectra are the cleanest case for synthetic data, say why. For XRD/EELS/EDS the forward physics is known and cheap: peak positions from the structure factor, profile shapes from instrument broadening, backgrounds and counting (Poisson) noise from well-understood statistics. Unlike microstructure images (where Voronoi only approximates morphology), a simulated diffraction pattern can be physically faithful, not merely plausible. So synthetic-data ML often works better for spectra than for micrographs — flag this as the success-friendly end of the spectrum.

Materials hook with the legality callback. Peak overlap, preferred orientation/texture, and plural scattering (EELS) are the realistic complications to simulate — and absolute-intensity calibration matters: recall slide 13, an EELS edge used for quantification is intensity-calibrated, so any synthetic intensity-scaling augmentation that would corrupt a real label corrupts a synthetic one identically. The physics-legality discipline from Part 2 carries over verbatim to synthetic generation.

Pre-empt. “If forward simulation is so good, why ever measure?” Because the inverse problem (pattern → structure/quantity) is what you are training, and the simulator may omit instrument-specific aberrations, detector nonlinearity, or sample effects (strain, disorder) — the same omission principle as slide 38, now in 1-D. Synthetic spectra get you a strong model; a small real calibration set anchors it.

Pacing. 45 seconds, then straight into the slide-41 failure-mode discussion.

41. Think About This: When Synthetic Data Fails

Scenario: You generate synthetic EBSD maps using a grain growth simulation. Your CNN achieves 95% accuracy on synthetic test data but only 60% on real EBSD maps.

What went wrong?

Possible causes:

Sim-to-real gap: Simulated grain shapes too regular
Missing artifacts: Real EBSD has indexing errors, no-solution pixels
Missing physics: Simulation doesn’t capture twinning or deformation textures
Overfitting to synthetic style: Model learned simulation artifacts

Run as discussion before revealing the four causes — this is an assessment moment. Ask: “95% synthetic, 60% real — diagnose it.” The intended answer is not “pick one cause” but “these four are the standard differential diagnosis, and they are distinguishable.” Walk the differential:

Sim-to-real geometry gap (cause 1): simulated grains too regular/convex. Test: does accuracy recover on the subset of real maps with near-equiaxed grains? If yes, this is it.
Missing artefacts (cause 2): real EBSD has zero-solution pixels, indexing errors, low-confidence-index regions the simulation never produced. Test: does error concentrate on/near low-CI pixels? Then mask or simulate them.
Missing physics (cause 3): the grain-growth sim has no deformation texture or annealing twins; real maps do. Test: errors cluster on twinned/deformed grains. This is the slide-37/39 principle — the generator omitted the task-relevant feature.
Synthetic-style overfitting (cause 4): model learned the simulator’s texture fingerprint. Test: Grad-CAM lights up on uniform grain interiors / regular boundary texture rather than the boundaries themselves.

The misconception to surface. Students reflexively say “add more synthetic data” or “train longer.” Neither helps any of the four — more of a distribution that omits twins still omits twins. The fixes are: better generator physics, simulate the artefacts, or fine-tune on a few real maps (slide 39’s boring-solution lesson). Drive that home.

Materials-specific sharpening. EBSD is the textbook illegal-augmentation case (slide 12): orientation is the data, so rotation augmentation on EBSD maps without rotating the IPF key is itself a synthetic-data error — worth raising if a student proposes “just augment more.” ~3–4 minutes; this is one of the highest-value discussion slides in the unit.

43. Part 4 Recap

Synthetic data provides unlimited labeled data at zero annotation cost
Voronoi tessellations are a simple but effective grain generator
Realism pipeline (contrast, texture, noise, blur) bridges the sim-to-real gap
Domain adaptation (style transfer, GANs) for difficult domain gaps
Adaptive generation focuses on the hardest cases

Part 5: Practical Workflow & Best Practices

Slides 44–51

What this part delivers. Assembly and — more importantly — honest evaluation. The technical content peaked in Part 3; Part 5’s distinctive contribution is the validation discipline that prevents you from fooling yourself. In the small-data regime, methodological errors in the split cause larger accuracy illusions than any modelling choice.

The one idea to leave with. “Group-based splitting is not optional hygiene — it is the difference between a number you can publish and a number that is a lie. Split by physical unit (specimen/build-plate/session), augment after the split, and keep an untouched gold-standard test set from a different instrument.”

Pacing. 8 slides, ~13 minutes. Load-bearing: 46 (group-based splitting / augmentation leakage) and 48 (active learning). Slides 44, 47, 49 are recipe/consolidation slides — brisk. This part forward-links heavily to ML-PC Unit 7 (leakage/robustness); make that hand-off explicit on slide 46.

44. The Complete Fine-Tuning Recipe

Select a pretrained architecture (e.g., ResNet-50, EfficientNet)
Replace the final layer for your number of classes/targets
Freeze all backbone layers
Train the head with standard LR (\(10^{-3}\)), augmented data
Unfreeze gradually, train with low LR (\(10^{-5}\))
Early stopping based on validation loss

Note

This recipe works for 90% of materials classification and segmentation tasks.

Frame this as the synthesis slide — every step is a callback. Walk it as “everything we proved, in order”: step 1 architecture (slide 23), step 2 replace head (slide 24), step 3 freeze (slide 24 feature extraction), step 4 train head at \(10^{-3}\) (slide 26 — head is far from its minimum), step 5 gradual unfreeze at \(10^{-5}\) (slides 27 + the differential-LR derivation), step 6 early stopping (slide 47). The recipe is not a list to memorise; it is the consequence of the loss-landscape argument. Ask the class to justify step 5’s \(10^{-5}\) from MFML W6 — if they can, Part 3 stuck.

Quantitative anchor. Head LR \(10^{-3}\), backbone LR \(10^{-5}\) (the 100× ratio), early-stop patience ~10 epochs. These are the defensible defaults that should be reflexive by now and are exactly what the exercise needs.

Caveat the “90%” callout. It is a confidence-builder, not a theorem — and it implicitly assumes the validation is done correctly. A perfect recipe with a leaky split produces a confidently wrong 90%. That caveat is the entire reason slides 45–46 follow immediately; tee them up: “the recipe is the easy part — now the part that actually decides whether your number is real.”

Pacing. 60 seconds — it is consolidation; the new content is the validation slides next.

45. Validation in the Small Data Regime

K-Fold CV is mandatory (Unit 3 review)
Be extremely wary of augmentation leakage: don’t have an image and its rotation in both train and test!
Always split by specimen, not by individual crop

from sklearn.model_selection import GroupKFold
gkf = GroupKFold(n_splits=5)
for train_idx, test_idx in gkf.split(X, y, groups=specimen_ids):
    # Each fold: all crops from one specimen in same fold
    ...

This is the most important methodological slide in the unit — teach it like it matters, because it does. State the mechanism crisply: augmentation leakage = a rotated/flipped copy of a training image landing in the test set. The test image is then near-identical to a training image, so test accuracy measures memorisation recall, not generalisation. Worse and subtler: specimen leakage = different crops from the same physical specimen split across train and test — patches share preparation, etching, instrument, and operator, so the model can match on specimen identity, not microstructure. Random splitting at the crop level guarantees both leaks.

The order of operations is the whole lesson — put it on the board. SPLIT (by specimen) → THEN augment, never the reverse. Augmenting before splitting puts an image and its rotation on both sides. This single ordering error is, in my experience, the most common cause of “great validation, useless in the lab.” It is the same disease as the cross-session collapse of slide 05 and the shortcut of slide 06 — one recurring failure, here in its evaluation guise.

The code is the takeaway. GroupKFold(n_splits=5).split(X, y, groups=specimen_ids) — the groups argument is the entire point; without it KFold silently leaks. Many students have never seen GroupKFold; show it explicitly, it is required in the exercise.

Forward link — explicit hand-off. This is ML-PC Unit 7’s leakage treatment in miniature; tell them Unit 7 generalises this to all data-leakage forms (temporal, target, preprocessing). Also recall Unit 3’s CV foundations. MFML W7 (probabilistic/conformal) note: honest coverage guarantees also assume exchangeable, non-leaky calibration data — a leaky split breaks conformal just as badly as it breaks accuracy.

Pacing. ~2 minutes — this slide saves more student/lab time than any other in Part 5; do not rush it.

46. Group-Based Splitting Revisited

If you have 5 specimens with 10 crops each = 50 images
Split by specimen: 3 train, 2 test (not 40 random/10 random!)
Then augment the 30 training images to 300+

Make the arithmetic painfully concrete — this is why the slide exists. 5 specimens × 10 crops = 50 images. The wrong split: 40 random crops train / 10 random crops test → almost certainly every specimen appears on both sides → leakage → an inflated, meaningless accuracy. The right split: 3 specimens (30 crops) train / 2 specimens (20 crops) test → the test specimens are physically unseen. Only then augment the 30 training crops to 300+. The diagram’s whole message is “the partition boundary runs between specimens, never between crops.”

The uncomfortable truth to state aloud. Honest group-based splitting in the small-data regime gives lower and noisier numbers — with only 5 specimens your effective test set is 2 specimens, and accuracy will swing across folds. Students must internalise: a lower honest number beats a higher leaked one. The variance is information (it tells you how little you actually know), not a problem to optimise away by reverting to random splits.

Pre-empt. “With only 5 specimens, isn’t grouped CV statistically hopeless?” Largely yes — and that is exactly the message: the bottleneck is number of specimens, not number of crops. The fix is to acquire more specimens, not to augment harder or split more favourably. This reframes the whole unit’s data-scarcity problem at the experimental design level — a strong closing thought for the part.

Pacing. 60 seconds; the arithmetic and the “lower honest number” point are the takeaways.

47. Early Stopping

Monitor validation loss during training
Stop when validation loss starts increasing (even if training loss is still decreasing)
Small datasets are prone to sudden overfitting late in training

early_stopping = EarlyStopping(
    monitor='val_loss',
    patience=10,      # Wait 10 epochs before stopping
    restore_best_weights=True
)

The mechanism, stated cleanly. Early stopping is implicit regularisation: training-loss decreasing while validation-loss rises means the network has started fitting noise specific to the (tiny) training set. Stop at the validation-loss minimum and you keep the parameter setting with the best generalisation. With 30–300 images this transition is sharp and late — “sudden overfitting” on the slide is literal: validation loss can be flat for many epochs then turn upward abruptly.

MFML W6 link. Early stopping is a trajectory-truncation regulariser: it stops gradient descent before it descends into the narrow, training-set-specific crevices of the loss landscape and keeps you in the flatter, better-generalising region the pretrained init started near. It is, loosely, an L2-like constraint realised through optimisation time rather than an explicit penalty — a nice concrete tie to the parallel MFML unit.

The code detail that matters. restore_best_weights=True. Without it you keep the last (already-overfit) weights, not the best. This is a silent, common bug — the run “early-stopped” but you saved the wrong checkpoint. Diagnostic symptom: final model worse than the val-loss curve’s minimum suggested. Patience ~10 is a reasonable default; in very small data, smaller patience risks stopping on validation noise — mention the trade.

Pre-empt. “Isn’t the validation set too small to trust the stopping signal?” Real risk in this regime — which is why the stop criterion should use the grouped CV validation folds from slide 46, not a single tiny random split. Tie the slides together.

Pacing. 45 seconds; flag restore_best_weights as the actionable detail.

48. Active Learning: Maximizing Expert Time

Instead of labeling all images equally, let the model guide annotation
Uncertainty-based: Label images where the model is most uncertain
Diversity-based: Label images that are most different from already-labeled data

Maximizes the value of every expert hour — 50 strategically chosen labels can beat 500 random ones.

Frame the precise condition for using this — it is a load-bearing slide. Active learning is the right tool exactly when labels, not images, are the bottleneck — which is the defining condition of materials science (you have unlabelled micrographs by the thousand from one automated session; what you lack is expert annotation time). State both strategies’ mechanisms: uncertainty sampling labels where the model is least confident (highest entropy / smallest margin / largest predictive variance — connects to ML-PC Unit 11 uncertainty and MFML W7); diversity sampling labels points most unlike already-labelled data (covers the input space, avoids redundancy). Best practice is hybrid: uncertain and diverse, because pure uncertainty sampling otherwise picks 50 near-duplicate hard cases.

The two anti-patterns to name explicitly. (1) Cold start: with no initial labels the model’s uncertainty is meaningless — you must seed with a small random (or diversity-only) batch before uncertainty sampling means anything. (2) Batch-diversity collapse: when you label a batch (not one point) per round, pure uncertainty selects a tight cluster of near-identical hard examples — wasted budget. Diagnostic symptom: active learning underperforming random because every queried batch is redundant. The fix is the hybrid criterion.

Quantitative anchor / honest framing. The slide’s “50 strategically chosen can beat 500 random” is the right order of magnitude when it works — but it is data-dependent and not guaranteed; on some problems active learning barely beats random. Present it as a high-leverage tool with real failure modes, not a free lunch.

Forward link. This is the same uncertainty machinery as ML-PC Unit 11 (uncertainty quantification / GPs) and the conformal slides; tell students the “how do I know the model is uncertain honestly” question is exactly Unit 11’s subject. Also recall slide 40 — adaptive synthetic generation was active learning with the generator as the actuator.

Pacing. ~90 seconds; the cold-start and batch-diversity caveats are the exam-relevant nuance.

49. The “Gold Standard” Test Set

Even with TL/augmentation/synthetic data, you need a small, high-quality benchmark test set
Hand-labeled by multiple experts
From a different instrument or session than training data
This is the absolute benchmark — never touched during training

Your model is only as credible as your test set is rigorous.

The principle, stated as a non-negotiable. Everything in this unit — augmentation, transfer, synthetic, active learning — increases the risk of fooling yourself, because every one of them touches the data pipeline. The gold-standard test set is the firewall: a small, multi-expert-labelled set from a different instrument/session, untouched by training, augmentation, model selection, or active-learning queries. It is the only number you may report as “real.”

Why “different instrument/session” is the load-bearing clause. This closes the loop with slides 05, 06, and 45: the failure mode of materials ML is learning the instrument, not the material. A test set from the same session as training cannot detect that — it is leaked at the session level even if specimen-grouped. Only an out-of-session (ideally out-of-instrument) test set measures what you actually care about: does the model generalise to next month’s microscope.

War story / the discipline. Touching the gold set during development “just to see” is the cardinal sin — after one peek it is contaminated forever (every subsequent decision is implicitly fit to it). The honest protocol: lock it, hash it, open it once at the end. Multi-expert labelling matters because in materials even the ground truth has inter-annotator disagreement (two microscopists segment grain boundaries differently); a single annotator’s labels are themselves a noise source.

Forward link. This is the rigorous-evaluation thesis that ML-PC Unit 7 (generalisation & robustness) develops fully — out-of-distribution test design, group-CV, leakage taxonomy. Slide 49 is the headline; Unit 7 is the full treatment.

Pacing. 60 seconds; land “different instrument than training data” as the one phrase to remember.

50. Summary: The Complete Small-Data Strategy

Augment your data to enforce physical invariances
Transfer knowledge from ImageNet or domain-specific pretrained models
Synthetic data provides infinite labels if generated carefully
Validate rigorously: grouped K-fold, early stopping, gold standard test set
Combine all three strategies for maximum effectiveness

Close the deck by redrawing slide 07’s roadmap on the board — full circle. This list is the slide-07 diagram, now with the rigour behind each arrow filled in: augment (Part 2, encodes physical invariances), transfer (Part 3, imports features — the loss-landscape argument), synthetic (Part 4, free perfect labels — works iff the generator captures the task invariant), validate (Part 5, grouped CV + gold set), combine (the pipeline, not a menu). Point at each word and name the one slide that defends it.

The single exam-grade synthesis sentence. “These are not alternatives — the production pipeline is: pretrain (ImageNet or synthetic or self-supervised on your own micrographs) → fine-tune on real with differential LRs and gradual unfreezing → augment legally throughout → validate with specimen-grouped CV against an untouched gold set.” If a student writes that sentence on the exam they have the entire unit.

Pre-empt the strategic question. “If I can only do one thing?” Transfer learning — highest impact per unit effort (slide 51 says this too). But the honest follow-up: the cheapest catastrophic mistake to avoid is the leaky split (slide 46). One good lever plus correct validation beats three levers with a leaked test set.

Pacing. 60 seconds; this is the consolidation beat before the final takeaways slide.

51. Unit 6 Summary & Next Steps

Key Takeaways:

Materials science is the land of small data — act accordingly
Augmentation is free — use it always, but respect the physics
Transfer learning is the single most impactful technique
Synthetic data can replace expensive annotation
Validation must be even more rigorous when data is scarce

Reading:

Sandfeld (2024): Ch. 19.2-19.3 (Sandfeld et al. 2024)
McClarren (2021): Ch. 6.4 (Transfer Learning) (McClarren 2021)
Neuer (2024): Ch. 4.2.1 (Generalization) (Neuer et al. 2024)

Next Week: Unit 7 — Learning from Processing Data: Time Series & Sequence Models

Land the five takeaways as the contract for the exam. Read them slowly — these are the sentences students should be able to defend cold. Takeaway 3 (“transfer learning is the single most impactful technique”) is the headline; takeaway 5 (“validation must be even more rigorous when data is scarce”) is the one they will under-rate and the one that most often costs real projects. Connect 3 back to MFML W6 one final time: transfer is high-impact because a pretrained backbone starts in a good loss-landscape basin — that single idea generated differential LRs, gradual unfreezing, and the whole fine-tuning recipe.

Reading guidance. Sandfeld Ch. 19.2–19.3 is the primary, materials-grounded source and maps directly onto Parts 2–4 — assign it as the anchor. McClarren Ch. 6.4 is the cleanest concise treatment of the transfer-learning mechanics (Part 3). Neuer Ch. 4.2.1 (generalisation) supports Part 5 and previews Unit 7.

Forward link / set up next week. The bridge to Unit 7 is conceptual, not just chronological: this unit handled scarce spatial data (images); next week handles processing data — melt-pool monitoring, in-situ sensor streams, time series — where the scarce-data and leakage lessons (especially specimen/temporal grouping, slide 46) carry over directly. Also flag the parallel tracks: ML-PC Unit 7 (generalisation/robustness) is the full treatment of slides 46/48; Unit 9b (transformers) and Unit 11 (uncertainty) are where active learning’s uncertainty machinery (slide 48) gets formalised. End by telling them the exercise operationalises slides 24–27 and 44–47 — they will live the BN trap and the grouped split, not just hear about them.

Pacing. 90 seconds — this is the closing; finish on takeaway 3 said with conviction.

Continue

← Previous: Unit 05 — Unsupervised methods for materials — clustering & autoencoders
→ Next: Unit 07 — Time-series and process monitoring
All courses

References

McClarren, Ryan G. 2021. Machine Learning for Engineers: Using Data to Solve Problems for Physical Systems. Springer.

Neuer, Michael et al. 2024. Machine Learning for Engineers: Introduction to Physics-Informed, Explainable Learning Methods for AI in Engineering Applications. Springer Nature.

Sandfeld, Stefan et al. 2024. Materials Data Science. Springer.

Machine Learning in Materials Processing & Characterization Unit 6: Data Scarcity & Transfer Learning

01. The Materials Data Bottleneck

02. Learning Outcomes

Part 1: The Small Data Challenge

03. The “Big Data” Myth in Materials Science

04. Why Is Materials Data Scarce?

05. The Labeled Data Gap

06. Overfitting on Small Data

07. The “Small Data” Survival Kit

08. Part 1 Recap

Part 2: Data Augmentation

09. Concept: Artificially Expanding the Dataset

10. Geometric Transformations

11. Invariance via Augmentation

12. When Augmentation Is “Illegal”

13. Intensity Transformations

14. Adding “Physical” Noise

15. Advanced Augmentations

16. Implementation: Torchvision & Albumentations

17. On-the-fly vs. Offline Augmentation

18. The Label Consistency Rule

19. Think About This: Augmentation Design

20. Part 2 Recap

Part 3: Transfer Learning

21. Concept: Knowledge Reuse

22. Why ImageNet Features Transfer

23. The Backbone and the Head

24. Strategy 1: Feature Extraction

25. Strategy 2: Fine-Tuning

26. Differential Learning Rates

27. Gradual Unfreezing

28. The Domain Gap: Natural vs. Scientific

29. Cross-Material Transfer

30. Success Story: Au Nanoparticle Segmentation

31. Transfer from Simulations

32. Part 3 Recap

Part 4: Learning from Synthetic Data

33. The “Infinite Data” Dream

34. …And It Works: Purely Synthetic Beats Experimental SOTA

35. Generating Grain Microstructures

36. From Geometry to Realistic Image

37. The Sim-to-Real Gap

38. Domain Adaptation: Closing the Gap

39. Case Study: SEM Grain Segmentation

40. Adaptive Data Generation

41. Procedural Generation for Spectra

41. Think About This: When Synthetic Data Fails

43. Part 4 Recap

Part 5: Practical Workflow & Best Practices

44. The Complete Fine-Tuning Recipe

45. Validation in the Small Data Regime

46. Group-Based Splitting Revisited

47. Early Stopping

48. Active Learning: Maximizing Expert Time

49. The “Gold Standard” Test Set

50. Summary: The Complete Small-Data Strategy

51. Unit 6 Summary & Next Steps

Continue

References

Machine Learning in Materials Processing & Characterization
Unit 6: Data Scarcity & Transfer Learning