Data Science for Electron Microscopy
Lecture 7: Gaussian Processes 2

Philipp Pelz

FAU Erlangen-Nürnberg

Bayesian Optimization & Active Learning

Introduction

  • Bayesian Optimization is a method for optimizing black-box functions that are expensive to evaluate.
  • Useful for tuning hyperparameters in machine learning models.

Gold Mining Analogy

  • Goal: Find the maximum gold content along a line with minimal drillings.
  • Two objectives:
    1. Active Learning: Estimate gold distribution accurately.
    2. Bayesian Optimization: Find the location of maximum gold content.

Initially, we have no idea about the gold distribution. We can learn the gold distribution by drilling at different locations. However, this drilling is costly. Thus, we want to minimize the number of drillings required while still finding the location of maximum gold quickly.

Active Learning

  • Aim: Minimize labeling costs while maximizing modeling accuracy.
  • Strategy: Label the point with the highest model uncertainty (variance).
  • Use Gaussian Process (GP) as a surrogate model for uncertainty estimates.
  • Each new data point updates our surrogate model, moving it closer to the ground truth. The black line and the grey shaded region indicate the mean μ and uncertainty μ±σ in our gold distribution estimate before and after drilling.

Bayesian Optimization

  • Aim: Find the maximum of an unknown function efficiently.
  • Balance exploration (unknown regions) and exploitation (known high-value regions).
  • Key component: Acquisition function.

To solve this problem, we will follow the following algorithm:

  1. We first choose a surrogate model for modeling the true function f and define its prior.
  2. Given the set of observations (function evaluations), use Bayes rule to obtain the posterior.
  3. Use an acquisition function α(x), which is a function of the posterior, to decide the next sample point
  4. Add newly sampled data to the set of observations and goto step #2 till convergence or budget elapses.

Surrogate Models and Gaussian Processes

  • Surrogate models estimate the unknown function.
  • GPs are flexible and provide uncertainty estimates.
  • Update the surrogate model using Bayes’ rule after each evaluation.

Acquisition Functions

Probability of Improvement (PI)

  • Chooses the point with the highest probability of improvement over the current best.
  • Mathematically, we write the selection of next point as follows: \(x_{t+1} = argmax(\alpha_{PI}(x)) = argmax(P(f(x) \geq (f(x^+) +\epsilon)))\)
  • \(\begin{aligned} x_{t+1} & = argmax(\alpha_{PI}(x))\\ & = argmax(P(f(x) \geq (f(x^+) +\epsilon))) \end{aligned}\)
  • we are just finding the upper-tail probability (or the CDF) of the surrogate posterior. Moreover, if we are using a GP as a surrogate the expression above converts to
  • \(x^+ = \text{argmax}_{x_i \in x_{1:t}}f(x_i)\)

Intuition behind ϵ in PI: ϵ = 0.075

  • Looking at the graph above, we see that we reach the global maxima in a few iterations .
  • Our surrogate possesses a large uncertainty in x∈[2,4] in the first few iterations
  • The acquisition function initially exploits regions with a high promise , which leads to high uncertainty in the region x∈[2,4].
  • This observation also shows that we do not need to construct an accurate estimate of the black-box function to find its maximum.

Intuition behind ϵ in PI: ϵ = 0.3

  • We see that we made things worse!
  • Our model now uses ϵ=3, and we are unable to exploit when we land near the global maximum. Moreover, with high exploration, the setting becomes similar to active learning.
  • Our quick experiments above help us conclude that ϵ controls the degree of exploration in the PI acquisition function.

Expected Improvement: Introduction

  • Probability of improvement considers how likely an improvement is.
  • Expected Improvement (EI) considers how much we can improve.
  • Key Idea: Choose the next query point with the highest expected improvement over the current max \(f(x^+)\).

EI: Mathematical Formulation

  • Equation: \(x_{t+1} = \arg\min_x \mathbb{E} \left( ||h_{t+1}(x) - f(x^\star) || \ | \ \mathcal{D}_t \right)\)
  • Components:
    • \(f\): Actual ground truth function.
    • \(h_{t+1}\): Posterior mean of the surrogate at \(t+1^{th}\) timestep.
    • \(\mathcal{D}_t\): Training data.
    • \(x^\star\): Actual position where \(f\) takes the maximum value.

EI: Mockus’ Acquisition Function

  • Equation: \(x_{t+1} = \mathrm{argmax}_x \mathbb{E} \left( {max} \{ 0, \ h_{t+1}(x) - f(x^+) \} \ | \ \mathcal{D}_t \right)\)
  • Components:
    • $ f(x^+) $: Maximum value encountered so far.

EI: Analytical Expression for GP Surrogate

  • Equation: \(EI(x)=\begin{cases}(\mu_t(x) - f(x^+) - \epsilon)\Phi(Z) + \sigma_t(x)\phi(Z), & \text{if}\ \sigma_t(x) > 0\\ 0, & \text{if}\ \sigma_t(x) = 0 \end{cases}\)

\(Z= \frac{\mu_t(x) - f(x^+) - \epsilon}{\sigma_t(x)}\)

EI: When is EI High?

  • EI is high when:
    • The expected value of \(\mu_t(x) - f(x^+)\) is high.
    • The uncertainty \(\sigma_t(x)\) around a point is high.

EI: Moderating Exploration with \(\epsilon\)

  • Adjusting \(\epsilon\) moderates exploration.
  • Examples:
    • \(\epsilon = 0.01\): Close to the global maxima in few iterations.
    • \(\epsilon = 0.3\): More exploration, less exploitation near the global maxima.
    • \(\epsilon = 3\): Too much exploration, quick reach near the global maxima, less exploitation.

EI: Visualizations

Intuition behind ϵ in EI: ϵ = 0.3

Intuition behind ϵ in PI: ϵ = 0.3

Intuition behind ϵ in PI: ϵ = 3

Thompson Sampling

  • Samples a function from the posterior and optimizes it.
  • Balances exploration and exploitation naturally.

Intuition behind Thompson Sampling

Hyperparameter Tuning

  • Common use of Bayesian Optimization.
  • Examples: SVM, Random Forest, Neural Networks.
  • Bayesian Optimization efficiently searches hyperparameter space.

Example: SVM Hyperparameter Tuning

  • Optimize SVM hyperparameters ( ) and ( C ) on a dataset.
  • Compare acquisition functions (PI, EI, UCB).

Summary

  • Bayesian Optimization is powerful for optimizing expensive black-box functions.
  • Key elements: Surrogate model (GP), Acquisition functions (PI, EI, UCB).
  • Applications in hyperparameter tuning for ML models.

Deep Kernel Learning - Combining Neural Networks and Gaussian Processes

The Big Question

MacKay’s Question (1998)

“How can Gaussian processes possibly replace neural networks? Have we thrown the baby out with the bathwater?”

  • Neural networks: Many design choices, lack of principled framework
  • Gaussian processes: Flexible, interpretable, principled learning
  • Can we combine the best of both worlds?

The Evolution of ML Paradigms

Neural Networks

  • Finite adaptive basis functions
  • Multiple layers of highly adaptive features
  • Automatic representation discovery
  • Inductive biases for specific domains

Key Insight

Neural networks can automatically discover meaningful representations in high-dimensional data

Gaussian Processes

  • Infinite fixed basis functions
  • Non-parametric flexibility
  • Automatic complexity calibration
  • Uncertainty quantification

Key Insight

GPs with expressive kernels can discover rich structure without human intervention

The Deep Kernel Learning Idea

Core Concept

Transform the inputs of a base kernel with a deep architecture to create scalable expressive closed-form kernels

Mathematical Formulation

\[k(\mathbf{x}_i, \mathbf{x}_j | \boldsymbol{\theta}) \rightarrow k(g(\mathbf{x}_i, \mathbf{w}), g(\mathbf{x}_j, \mathbf{w}) | \boldsymbol{\theta}, \mathbf{w})\]

Where:

  • \(g(\mathbf{x}, \mathbf{w})\) = deep architecture (CNN, DNN)
  • \(k(\cdot, \cdot | \boldsymbol{\theta})\) = base kernel (RBF, Spectral Mixture)
  • \(\boldsymbol{\gamma} = \{\mathbf{w}, \boldsymbol{\theta}\}\) = all parameters

Key Benefits

  • Scalable: \(\mathcal{O}(n)\) training, \(\mathcal{O}(1)\) prediction
  • Expressive: Combines deep features with kernel flexibility
  • Non-parametric: Automatic complexity calibration
  • Unified: Joint learning through GP marginal likelihood

Deep Kernel Learning

Deep Kernel Architecture

Network Structure

  1. Input Layer: Raw data \(\mathbf{x}\)
  2. Hidden Layers: Deep transformation \(g(\mathbf{x}, \mathbf{w})\)
  3. Kernel Layer: Infinite basis functions via GP
  4. Output: Probabilistic predictions

Infinite Hidden Units

The GP with base kernel provides an infinite number of basis functions in the final layer

Learning Objective

Maximize the marginal likelihood:

\[\log p(\mathbf{y} | \boldsymbol{\gamma}, X) \propto -[\mathbf{y}^{\top}(K_{\boldsymbol{\gamma}}+\sigma^2 I)^{-1}\mathbf{y} + \log|K_{\boldsymbol{\gamma}} + \sigma^2 I|]\]

Gradients via chain rule: \[\frac{\partial \mathcal{L}}{\partial \mathbf{w}} = \frac{\partial \mathcal{L}}{\partial K_{\boldsymbol{\gamma}}} \frac{\partial K_{\boldsymbol{\gamma}}}{\partial g(\mathbf{x},\mathbf{w})} \frac{\partial g(\mathbf{x},\mathbf{w})}{\partial \mathbf{w}}\]

Base Kernels: RBF vs Spectral Mixture

RBF Kernel

\[k_{\text{RBF}}(\mathbf{x}, \mathbf{x}') = \exp\left(-\frac{1}{2} \frac{\|\mathbf{x}-\mathbf{x}'\|^2}{\ell^2}\right)\]

Properties:

  • Single length-scale parameter \(\ell\)
  • Smooth, stationary
  • Limited expressiveness

Spectral Mixture Kernel

\[k_{\text{SM}}(\mathbf{x}, \mathbf{x}' | \boldsymbol{\theta}) = \sum_{q=1}^Q a_q \frac{|\Sigma_q|^{1/2}}{(2\pi)^{D/2}} \exp\left(-\frac{1}{2} \|\Sigma_q^{1/2} (\mathbf{x}-\mathbf{x}')\|^2\right) \cos\langle \mathbf{x}-\mathbf{x}', 2\pi \boldsymbol{\mu}_q \rangle\]

Properties:

  • Multiple components \(Q\)
  • Quasi-periodic structure
  • Much more expressive

Scalability: KISS-GP

The Scalability Challenge

Standard GPs: \(\mathcal{O}(n^3)\) complexity Goal: Linear scaling \(\mathcal{O}(n)\)

KISS-GP Approximation

\[K_{\boldsymbol{\gamma}} \approx M K^{\text{deep}}_{U,U} M^{\top} := K_{\text{KISS}}\]

Where:

  • \(M\) = sparse interpolation matrix (4 non-zero entries per row)
  • \(K_{U,U}\) = covariance over inducing points \(U\)
  • Kronecker + Toeplitz structure for fast MVMs

Computational Benefits

  • Training: \(\mathcal{O}(n + h(m))\) where \(h(m) \approx \mathcal{O}(m)\)
  • Prediction: \(\mathcal{O}(1)\) per test point
  • Memory: \(\mathcal{O}(n)\) instead of \(\mathcal{O}(n^2)\)

Experimental Results: UCI Datasets

Key Finding

Deep Kernel Learning consistently outperforms both: - Standalone deep neural networks - Gaussian processes with expressive kernels

Performance Comparison

  • 16 UCI regression datasets
  • 2M+ training examples (Electric dataset)
  • DKL-SM achieves best performance on most datasets
  • Minimal runtime overhead (~10% additional cost)

Architecture Details

  • Small datasets (\(n < 6,000\)): \([d\text{-}1000\text{-}500\text{-}50\text{-}2]\)
  • Large datasets (\(n > 6,000\)): \([d\text{-}1000\text{-}1000\text{-}500\text{-}50\text{-}2]\)
  • SM kernel: \(Q=4\) (small), \(Q=6\) (large)

Learned Representations

Spectral Density Analysis

  • SM kernel: Discovers two peaks in frequency domain
  • RBF kernel: Single Gaussian, misses important correlations
  • Result: SM captures quasi-periodic structure better

Covariance Matrix Analysis

  • DKL kernels: Strong correlation for similar orientations
  • Standard RBF: Diffuse correlations
  • Metric learning: Learns orientation-aware similarity

Visualization

The learned metric correlates faces with similar rotation angles, overcoming Euclidean distance limitations

Scalability Analysis

Training Time Scaling

  • Linear scaling with data size \(n\)
  • Slope ≈ 1 in log-log plot
  • KISS-GP enables large-scale training

Runtime Comparison

  • DNN: ~7-5000s (depending on dataset size)
  • DKL: ~10-5000s (minimal overhead)
  • Additional cost: ~10% of DNN runtime

Key Benefit

Scalability enables learning from large datasets where expressive representations matter most

Step Function Recovery

Challenge

Recover step function with sharp discontinuities

Problem Characteristics

  • Multiple discontinuities
  • Sharp changes in covariance structure
  • Difficult for smooth kernels

Results

  • GP-RBF: Smooth, misses discontinuities
  • GP-SM: Better, but still smooth
  • DKL-SM: Accurately captures discontinuities with reasonable uncertainty

Key Advantage

DKL provides posterior predictive distributions useful for:

  • Reinforcement learning
  • Bayesian optimization
  • Uncertainty quantification

Step Function Recovery

Key Contributions

1. Scalable Deep Kernels

  • Linear scaling \(\mathcal{O}(n)\) training
  • \(\mathcal{O}(1)\) prediction time
  • Retains non-parametric representation

2. Expressive Power

  • Combines deep architectures with kernel flexibility
  • Spectral mixture base kernels
  • Automatic complexity calibration

3. Unified Learning

  • Joint optimization through GP marginal likelihood
  • No separate pre-training required
  • Drop-in replacement for standard kernels

Summary

Deep Kernel Learning Successfully Combines:

Neural Networks

  • Automatic representation discovery
  • Inductive biases for specific domains
  • Scalable training procedures

Gaussian Processes

  • Non-parametric flexibility
  • Uncertainty quantification
  • Principled learning framework

Result

Scalable, expressive, and principled machine learning approach that consistently outperforms both paradigms alone

Discussion Points

  1. How does DKL compare to other kernel learning approaches?
  2. What are the computational trade-offs?
  3. When would you choose DKL over standalone DNNs or GPs?
  4. How does the choice of base kernel affect performance?

Why automate 4D‑STEM?

  • High‑dimensional data cube → costly in dose & time
  • Human ROI selection = biased & inefficient
  • Active learning can steer the probe to where information content is highest

From Bayesian Optimisation to Autonomous STEM

We have already covered

  1. Bayesian optimisation (BO)
  2. Deep‑Kernel Learning (DKL)

Today we see both in action inside the microscope.

\[\begin{equation} \operatorname{EI}(\mathbf{x}) = \big(\mu(\mathbf{x}) - y^{+} - \xi\big)\,\Phi\!\left(\dfrac{\mu(\mathbf{x})-y^{+}-\xi}{\sigma(\mathbf{x})}\right) + \sigma(\mathbf{x})\,\phi\!\left(\dfrac{\mu(\mathbf{x})-y^{+}-\xi}{\sigma(\mathbf{x})}\right) \end{equation}\]

4D‑STEM primer

4D‑STEM schematic

DKL workflow for 4D‑STEM

Figure 1: DKL workflow for 4D-STEM: learning (a), prediction (b), and measurement (c). Features are HAADF-STEM image patches; targets are scalarized diffraction patterns from patch centers.

Key idea

CNN‑based embedding → GP kernel → BO acquisition

Ground Truth Data Examples

Ground truth 4D-STEM data: single-layer and bilayer graphene with defects. HAADF-STEM images show selected ronchigrams and local patches. CoM-based scalar quantities (CoMx, CoMy, angle, magnitude) encode local electric fields.

Two 4D datasets demonstrate the approach:

  • Single-layer and bilayer graphene with defects and dopant atoms
  • Diffraction patterns recorded at all pixel positions
  • Center of mass (CoM) calculation from central beam
  • Relative CoM shifts → local electric field computation
  • Derived quantities: charge density and electric potential

Damage Mitigation Strategy

Intelligent sampling reduces beam damage through:

  • Specimen protection: Between training steps, specimen is blocked from electron irradiation
  • Minimal sampling: Only a small fraction of available points are ever visited
  • HAADF-STEM pre-scan: Complete image space access for the model
  • Controlled dose: Small but unavoidable initial dose compared to full 4D acquisition

Reconstruction from 1 % of data

4D-STEM DKL experimental results on Nion UltraSTEM100. Shows CoM magnitude (a) and angle (b) scalarizers with HAADF images, acquisition functions, predictions, and uncertainties at key steps. Red points mark visited locations. Scale bars: 1 nm.

Tip

DKL recovers CoM‑magnitude map with nanometre detail from <1 % of pixels → 30‑fold dose reduction

Nanobeam strain mapping

Ground truth NBED strain mapping and DKL exploration pathway. Shows Bragg distance scalarizer results with acquisition function preferring boundary measurements.

Nanobeam electron diffraction (NBED) approach:

  • Smaller convergence angle → diffraction discs
  • Bragg disc centers scalarizer function
  • First-order diffraction only considered
  • Boundary preference in acquisition function
  • Vacuum measurements for uncertainty reduction

Note

Key insight: DKL learns to measure near boundaries where strain is highest, even without prior knowledge of material structure

Live autonomous experiment

DKL active learning on MnPS3 with DPC CoM scalarizer. Shows exploration pathway, predictions, and uncertainty at key steps. Periodic interference from sulfur vacancy generation. Scale bars: 5 and 2 nm.

MnPS₃ beam-sensitive material:

  • Layered van der Waals 2D material
  • Sulfur vacancy generation from electron beam
  • Hexagonal interference patterns in HAADF
  • Defocus imaging (−40 nm) for atomic contrast
  • 3% sampling → 30× dose reduction

DKL autonomous exploration:

  • CoM-magnitude vs. CoM-angle scalarizer
  • Model retraining every measurement (~2 s GPU)
  • Boundary preference in acquisition function
  • Structure-property learning from minimal data
  • Real-time adaptation to specimen changes

Tip

Key insight: DKL discovers ordered vacancy superstructures while protecting beam-sensitive specimens

Conclusions

Active learning for 4D-STEM imaging:

  • Deep kernel learning enables discovery of internal field behaviors
  • Demonstrated on twisted bilayer graphene and MnPS₃ patterns
  • Physics-based discovery of unique phenomena in quantum materials
  • Extensible approach for strongly correlated materials

Future opportunities:

  • Pre-trained VAE weights for faster DKL training
  • Physical deconvolution knowledge for latent space structuring
  • Conditioning strategies for more directed physical search
  • Interventional strategies via bootstrapping approaches

Note

Key insight: Active learning transforms 4D-STEM into an autonomous discovery platform for quantum materials research

Take‑aways

  • DL‑guided BO turns the microscope into an autonomous scientist
  • Massive dose/time savings without losing resolution
  • Works on physics‑derived targets (E‑field, strain, etc.)
  • Ready for integration with Edge/Cloud inference & reinforcement control

References

Automated experiment in 4D-STEM: Exploring emergent physics and structural behaviors, ACS Nano, Kevin M. Roccapriore, Ondrej Dyck, Mark P. Oxley, Maxim Ziatdinov, & Sergei V. Kalinin https://doi.org/10.1021/acsnano.1c11118.