Machine Learning in Materials Processing & Characterization
Unit 6: Data Scarcity & Transfer Learning

Prof. Dr. Philipp Pelz

FAU Erlangen-Nürnberg

FAU Logo IMN Logo CENEM Logo ERC Logo Eclipse Logo

0. The Bottleneck: “Small Data”

Why is Materials Data Scarce?

Today’s Learning Journey:

  • The Problem: Acquisition costs, expert labeling, and overfitting.
  • Data Augmentation: Geometric and physical transformations.
  • Transfer Learning: Reusing knowledge from “Big Data” models.
  • Synthetic Training: Generating labels for free via procedural simulation.
  • Practical Recipe: Fine-tuning best practices.

1. The Small Data Challenge

The Materials Reality

  • Standard Deep Learning (e.g., ImageNet) assumes millions of labeled images.
  • Materials science often has hundreds or fewer.
  • Result: Models that memorize (overfit) instead of generalizing.

The Survival Kit

  1. Augment: Multiply your data.
  2. Transfer: Start with a pretrained “brain.”
  3. Synthesize: Create digital twins for training.

2. Data Augmentation

Expanding the Feature Space

  • Geometric: Flips, Rotations, Scaling, Cropping.
  • Intensity: Brightness, Contrast, Gamma.
  • Noise: Gaussian, Poisson, Blur.

Physical Validity

  • Transformations must not violate materials physics.
  • Warning: Do not have an original image in your training set and its rotation in your test set! (Data Leakage).

3. Transfer Learning

Knowledge Reuse

  • “Learning on Peas to count Lentils.” Sandfeld, Stefan et al., (2024)
  • ImageNet Pretraining: Using models trained on 14M natural images as a starting point.
  • Hierarchical Filters: Early layers detect edges/textures universal to all imaging.

The Recipe

  • Backbone: Pretrained feature extractor (Frozen).
  • Head: New classifier trained on your scientific data.
  • Fine-Tuning: Slowly unfreezing and training with low learning rates.

4. Learning from Synthetic Data

Procedural Generation

  • Grain Networks: Using Voronoi tessellations to generate microstructures. Sandfeld, Stefan et al., (2024)
  • Spectral Simulation: Building peak models with realistic noise.

The “Sim-to-Real” Gap

  • Synthetic data is often “too clean.”
  • Solution: Domain adaptation and adding “physics-informed” noise to simulations.
  • Case Study: Models trained only on Voronoi data segment real SEM grain boundaries.

5. Summary & Handoff

Top Takeaways

  1. Never train from scratch if a pretrained model exists.
  2. Augmentation enforces physical invariances (e.g., rotation).
  3. Synthetic data provides infinite labels—use it for pre-pretraining.
  4. Validation must be rigorous (K-Fold, Group-based) to avoid being fooled by small data.

Exercise Handoff

  • Load a pretrained ResNet-50.
  • Implement an augmentation pipeline using Albumentations.
  • Fine-tune on the Low-Data Microstructure dataset.

References

Materials data science, Stefan Sandfeld & others