Data Science for Electron Microscopy
Lecture 7: Gaussian Processes 2

Philipp Pelz

FAU Erlangen-Nürnberg

Bayesian Optimization & Active Learning

Introduction

Bayesian Optimization is a method for optimizing black-box functions that are expensive to evaluate.
Useful for tuning hyperparameters in machine learning models.

Gold Mining Analogy

Goal: Find the maximum gold content along a line with minimal drillings.
Two objectives:
1. Active Learning: Estimate gold distribution accurately.
2. Bayesian Optimization: Find the location of maximum gold content.

Initially, we have no idea about the gold distribution. We can learn the gold distribution by drilling at different locations. However, this drilling is costly. Thus, we want to minimize the number of drillings required while still finding the location of maximum gold quickly.

Active Learning

Aim: Minimize labeling costs while maximizing modeling accuracy.
Strategy: Label the point with the highest model uncertainty (variance).
Use Gaussian Process (GP) as a surrogate model for uncertainty estimates.

Each new data point updates our surrogate model, moving it closer to the ground truth. The black line and the grey shaded region indicate the mean μ and uncertainty μ±σ in our gold distribution estimate before and after drilling.

Bayesian Optimization

Aim: Find the maximum of an unknown function efficiently.
Balance exploration (unknown regions) and exploitation (known high-value regions).
Key component: Acquisition function.

To solve this problem, we will follow the following algorithm:

We first choose a surrogate model for modeling the true function f and define its prior.
Given the set of observations (function evaluations), use Bayes rule to obtain the posterior.
Use an acquisition function α(x), which is a function of the posterior, to decide the next sample point
Add newly sampled data to the set of observations and goto step #2 till convergence or budget elapses.

Surrogate Models and Gaussian Processes

Surrogate models estimate the unknown function.
GPs are flexible and provide uncertainty estimates.
Update the surrogate model using Bayes’ rule after each evaluation.

Acquisition Functions

Probability of Improvement (PI)

Chooses the point with the highest probability of improvement over the current best.
Mathematically, we write the selection of next point as follows: $x_{t+1} = argmax(\alpha_{PI}(x)) = argmax(P(f(x) \geq (f(x^+) +\epsilon)))$
$\begin{aligned} x_{t+1} & = argmax(\alpha_{PI}(x))\\ & = argmax(P(f(x) \geq (f(x^+) +\epsilon))) \end{aligned}$
we are just finding the upper-tail probability (or the CDF) of the surrogate posterior. Moreover, if we are using a GP as a surrogate the expression above converts to
$x^+ = \text{argmax}_{x_i \in x_{1:t}}f(x_i)$

Intuition behind ϵ in PI: ϵ = 0.075

Looking at the graph above, we see that we reach the global maxima in a few iterations .
Our surrogate possesses a large uncertainty in x∈[2,4] in the first few iterations
The acquisition function initially exploits regions with a high promise , which leads to high uncertainty in the region x∈[2,4].
This observation also shows that we do not need to construct an accurate estimate of the black-box function to find its maximum.

Intuition behind ϵ in PI: ϵ = 0.3

We see that we made things worse!
Our model now uses ϵ=3, and we are unable to exploit when we land near the global maximum. Moreover, with high exploration, the setting becomes similar to active learning.
Our quick experiments above help us conclude that ϵ controls the degree of exploration in the PI acquisition function.

Expected Improvement: Introduction

Probability of improvement considers how likely an improvement is.
Expected Improvement (EI) considers how much we can improve.
Key Idea: Choose the next query point with the highest expected improvement over the current max $f(x^+)$.

EI: Mathematical Formulation

Equation: $x_{t+1} = \arg\min_x \mathbb{E} \left( ||h_{t+1}(x) - f(x^\star) || \ | \ \mathcal{D}_t \right)$
Components:
- $f$: Actual ground truth function.
- $h_{t+1}$: Posterior mean of the surrogate at $t+1^{th}$ timestep.
- $\mathcal{D}_t$: Training data.
- $x^\star$: Actual position where $f$ takes the maximum value.

EI: Mockus’ Acquisition Function

Equation: $x_{t+1} = \mathrm{argmax}_x \mathbb{E} \left( {max} \{ 0, \ h_{t+1}(x) - f(x^+) \} \ | \ \mathcal{D}_t \right)$
Components:
- $ f(x^+) $: Maximum value encountered so far.

EI: Analytical Expression for GP Surrogate

Equation: $EI(x)=\begin{cases}(\mu_t(x) - f(x^+) - \epsilon)\Phi(Z) + \sigma_t(x)\phi(Z), & \text{if}\ \sigma_t(x) > 0\\ 0, & \text{if}\ \sigma_t(x) = 0 \end{cases}$

$Z= \frac{\mu_t(x) - f(x^+) - \epsilon}{\sigma_t(x)}$

EI: When is EI High?

EI is high when:
- The expected value of $\mu_t(x) - f(x^+)$ is high.
- The uncertainty $\sigma_t(x)$ around a point is high.

EI: Moderating Exploration with $\epsilon$

Adjusting $\epsilon$ moderates exploration.
Examples:
- $\epsilon = 0.01$: Close to the global maxima in few iterations.
- $\epsilon = 0.3$: More exploration, less exploitation near the global maxima.
- $\epsilon = 3$: Too much exploration, quick reach near the global maxima, less exploitation.

EI: Visualizations

Intuition behind ϵ in EI: ϵ = 0.3

Intuition behind ϵ in PI: ϵ = 0.3

Intuition behind ϵ in PI: ϵ = 3

Thompson Sampling

Samples a function from the posterior and optimizes it.
Balances exploration and exploitation naturally.

Intuition behind Thompson Sampling

Hyperparameter Tuning

Common use of Bayesian Optimization.
Examples: SVM, Random Forest, Neural Networks.
Bayesian Optimization efficiently searches hyperparameter space.

Example: SVM Hyperparameter Tuning

Optimize SVM hyperparameters ( ) and ( C ) on a dataset.
Compare acquisition functions (PI, EI, UCB).

Summary

Bayesian Optimization is powerful for optimizing expensive black-box functions.
Key elements: Surrogate model (GP), Acquisition functions (PI, EI, UCB).
Applications in hyperparameter tuning for ML models.

Deep Kernel Learning - Combining Neural Networks and Gaussian Processes

The Big Question

MacKay’s Question (1998)

“How can Gaussian processes possibly replace neural networks? Have we thrown the baby out with the bathwater?”

Neural networks: Many design choices, lack of principled framework
Gaussian processes: Flexible, interpretable, principled learning
Can we combine the best of both worlds?

The Evolution of ML Paradigms

Neural Networks

Finite adaptive basis functions
Multiple layers of highly adaptive features
Automatic representation discovery
Inductive biases for specific domains

Key Insight

Neural networks can automatically discover meaningful representations in high-dimensional data

Gaussian Processes

Infinite fixed basis functions
Non-parametric flexibility
Automatic complexity calibration
Uncertainty quantification

Key Insight

GPs with expressive kernels can discover rich structure without human intervention

The Deep Kernel Learning Idea

Core Concept

Transform the inputs of a base kernel with a deep architecture to create scalable expressive closed-form kernels

Mathematical Formulation

\[k(\mathbf{x}_i, \mathbf{x}_j | \boldsymbol{\theta}) \rightarrow k(g(\mathbf{x}_i, \mathbf{w}), g(\mathbf{x}_j, \mathbf{w}) | \boldsymbol{\theta}, \mathbf{w})\]

Where:

$g(\mathbf{x}, \mathbf{w})$ = deep architecture (CNN, DNN)
$k(\cdot, \cdot | \boldsymbol{\theta})$ = base kernel (RBF, Spectral Mixture)
$\boldsymbol{\gamma} = \{\mathbf{w}, \boldsymbol{\theta}\}$ = all parameters

Key Benefits

Scalable: $\mathcal{O}(n)$ training, $\mathcal{O}(1)$ prediction
Expressive: Combines deep features with kernel flexibility
Non-parametric: Automatic complexity calibration
Unified: Joint learning through GP marginal likelihood

Deep Kernel Learning

Deep Kernel Architecture

Network Structure

Input Layer: Raw data $\mathbf{x}$
Hidden Layers: Deep transformation $g(\mathbf{x}, \mathbf{w})$
Kernel Layer: Infinite basis functions via GP
Output: Probabilistic predictions

Infinite Hidden Units

The GP with base kernel provides an infinite number of basis functions in the final layer

Learning Objective

Maximize the marginal likelihood:

\[\log p(\mathbf{y} | \boldsymbol{\gamma}, X) \propto -[\mathbf{y}^{\top}(K_{\boldsymbol{\gamma}}+\sigma^2 I)^{-1}\mathbf{y} + \log|K_{\boldsymbol{\gamma}} + \sigma^2 I|]\]

Gradients via chain rule: \[\frac{\partial \mathcal{L}}{\partial \mathbf{w}} = \frac{\partial \mathcal{L}}{\partial K_{\boldsymbol{\gamma}}} \frac{\partial K_{\boldsymbol{\gamma}}}{\partial g(\mathbf{x},\mathbf{w})} \frac{\partial g(\mathbf{x},\mathbf{w})}{\partial \mathbf{w}}\]

Base Kernels: RBF vs Spectral Mixture

RBF Kernel

\[k_{\text{RBF}}(\mathbf{x}, \mathbf{x}') = \exp\left(-\frac{1}{2} \frac{\|\mathbf{x}-\mathbf{x}'\|^2}{\ell^2}\right)\]

Properties:

Single length-scale parameter $\ell$
Smooth, stationary
Limited expressiveness

Spectral Mixture Kernel

\[k_{\text{SM}}(\mathbf{x}, \mathbf{x}' | \boldsymbol{\theta}) = \sum_{q=1}^Q a_q \frac{|\Sigma_q|^{1/2}}{(2\pi)^{D/2}} \exp\left(-\frac{1}{2} \|\Sigma_q^{1/2} (\mathbf{x}-\mathbf{x}')\|^2\right) \cos\langle \mathbf{x}-\mathbf{x}', 2\pi \boldsymbol{\mu}_q \rangle\]

Properties:

Multiple components $Q$
Quasi-periodic structure
Much more expressive

Scalability: KISS-GP

The Scalability Challenge

Standard GPs: $\mathcal{O}(n^3)$ complexity Goal: Linear scaling $\mathcal{O}(n)$

KISS-GP Approximation

\[K_{\boldsymbol{\gamma}} \approx M K^{\text{deep}}_{U,U} M^{\top} := K_{\text{KISS}}\]

Where:

$M$ = sparse interpolation matrix (4 non-zero entries per row)
$K_{U,U}$ = covariance over inducing points $U$
Kronecker + Toeplitz structure for fast MVMs

Computational Benefits

Training: $\mathcal{O}(n + h(m))$ where $h(m) \approx \mathcal{O}(m)$
Prediction: $\mathcal{O}(1)$ per test point
Memory: $\mathcal{O}(n)$ instead of $\mathcal{O}(n^2)$

Experimental Results: UCI Datasets

Key Finding

Deep Kernel Learning consistently outperforms both: - Standalone deep neural networks - Gaussian processes with expressive kernels

Performance Comparison

16 UCI regression datasets
2M+ training examples (Electric dataset)
DKL-SM achieves best performance on most datasets
Minimal runtime overhead (~10% additional cost)

Architecture Details

Small datasets ($n < 6,000$): $[d\text{-}1000\text{-}500\text{-}50\text{-}2]$
Large datasets ($n > 6,000$): $[d\text{-}1000\text{-}1000\text{-}500\text{-}50\text{-}2]$
SM kernel: $Q=4$ (small), $Q=6$ (large)

Learned Representations

Spectral Density Analysis

SM kernel: Discovers two peaks in frequency domain
RBF kernel: Single Gaussian, misses important correlations
Result: SM captures quasi-periodic structure better

Covariance Matrix Analysis

DKL kernels: Strong correlation for similar orientations
Standard RBF: Diffuse correlations
Metric learning: Learns orientation-aware similarity

Visualization

The learned metric correlates faces with similar rotation angles, overcoming Euclidean distance limitations

Scalability Analysis

Training Time Scaling

Linear scaling with data size $n$
Slope ≈ 1 in log-log plot
KISS-GP enables large-scale training

Runtime Comparison

DNN: ~7-5000s (depending on dataset size)
DKL: ~10-5000s (minimal overhead)
Additional cost: ~10% of DNN runtime

Key Benefit

Scalability enables learning from large datasets where expressive representations matter most

Step Function Recovery

Challenge

Recover step function with sharp discontinuities

Problem Characteristics

Multiple discontinuities
Sharp changes in covariance structure
Difficult for smooth kernels

Results

GP-RBF: Smooth, misses discontinuities
GP-SM: Better, but still smooth
DKL-SM: Accurately captures discontinuities with reasonable uncertainty

Key Advantage

DKL provides posterior predictive distributions useful for:

Reinforcement learning
Bayesian optimization
Uncertainty quantification

Key Contributions

1. Scalable Deep Kernels

Linear scaling $\mathcal{O}(n)$ training
$\mathcal{O}(1)$ prediction time
Retains non-parametric representation

2. Expressive Power

Combines deep architectures with kernel flexibility
Spectral mixture base kernels
Automatic complexity calibration

3. Unified Learning

Joint optimization through GP marginal likelihood
No separate pre-training required
Drop-in replacement for standard kernels

Summary

Deep Kernel Learning Successfully Combines:

Neural Networks

Automatic representation discovery
Inductive biases for specific domains
Scalable training procedures

Gaussian Processes

Non-parametric flexibility
Uncertainty quantification
Principled learning framework

Result

Scalable, expressive, and principled machine learning approach that consistently outperforms both paradigms alone

Discussion Points

How does DKL compare to other kernel learning approaches?
What are the computational trade-offs?
When would you choose DKL over standalone DNNs or GPs?
How does the choice of base kernel affect performance?

Why automate 4D‑STEM?

High‑dimensional data cube → costly in dose & time
Human ROI selection = biased & inefficient
Active learning can steer the probe to where information content is highest

From Bayesian Optimisation to Autonomous STEM

We have already covered

Bayesian optimisation (BO)
Deep‑Kernel Learning (DKL)

Today we see both in action inside the microscope.

\[\begin{equation} \operatorname{EI}(\mathbf{x}) = \big(\mu(\mathbf{x}) - y^{+} - \xi\big)\,\Phi\!\left(\dfrac{\mu(\mathbf{x})-y^{+}-\xi}{\sigma(\mathbf{x})}\right) + \sigma(\mathbf{x})\,\phi\!\left(\dfrac{\mu(\mathbf{x})-y^{+}-\xi}{\sigma(\mathbf{x})}\right) \end{equation}\]

4D‑STEM primer

DKL workflow for 4D‑STEM

Figure 1: DKL workflow for 4D-STEM: learning (a), prediction (b), and measurement (c). Features are HAADF-STEM image patches; targets are scalarized diffraction patterns from patch centers.

Key idea

CNN‑based embedding → GP kernel → BO acquisition

Ground Truth Data Examples

Ground truth 4D-STEM data: single-layer and bilayer graphene with defects. HAADF-STEM images show selected ronchigrams and local patches. CoM-based scalar quantities (CoMx, CoMy, angle, magnitude) encode local electric fields.

Two 4D datasets demonstrate the approach:

Single-layer and bilayer graphene with defects and dopant atoms
Diffraction patterns recorded at all pixel positions
Center of mass (CoM) calculation from central beam
Relative CoM shifts → local electric field computation
Derived quantities: charge density and electric potential

Damage Mitigation Strategy

Intelligent sampling reduces beam damage through:

Specimen protection: Between training steps, specimen is blocked from electron irradiation
Minimal sampling: Only a small fraction of available points are ever visited
HAADF-STEM pre-scan: Complete image space access for the model
Controlled dose: Small but unavoidable initial dose compared to full 4D acquisition

Reconstruction from 1 % of data

4D-STEM DKL experimental results on Nion UltraSTEM100. Shows CoM magnitude (a) and angle (b) scalarizers with HAADF images, acquisition functions, predictions, and uncertainties at key steps. Red points mark visited locations. Scale bars: 1 nm.

Tip

DKL recovers CoM‑magnitude map with nanometre detail from <1 % of pixels → 30‑fold dose reduction

Nanobeam strain mapping

Ground truth NBED strain mapping and DKL exploration pathway. Shows Bragg distance scalarizer results with acquisition function preferring boundary measurements.

Nanobeam electron diffraction (NBED) approach:

Smaller convergence angle → diffraction discs
Bragg disc centers scalarizer function
First-order diffraction only considered
Boundary preference in acquisition function
Vacuum measurements for uncertainty reduction

Note

Key insight: DKL learns to measure near boundaries where strain is highest, even without prior knowledge of material structure

Live autonomous experiment

DKL active learning on MnPS3 with DPC CoM scalarizer. Shows exploration pathway, predictions, and uncertainty at key steps. Periodic interference from sulfur vacancy generation. Scale bars: 5 and 2 nm.

MnPS₃ beam-sensitive material:

Layered van der Waals 2D material
Sulfur vacancy generation from electron beam
Hexagonal interference patterns in HAADF
Defocus imaging (−40 nm) for atomic contrast
3% sampling → 30× dose reduction

DKL autonomous exploration:

CoM-magnitude vs. CoM-angle scalarizer
Model retraining every measurement (~2 s GPU)
Boundary preference in acquisition function
Structure-property learning from minimal data
Real-time adaptation to specimen changes

Tip

Key insight: DKL discovers ordered vacancy superstructures while protecting beam-sensitive specimens

Conclusions

Active learning for 4D-STEM imaging:

Deep kernel learning enables discovery of internal field behaviors
Demonstrated on twisted bilayer graphene and MnPS₃ patterns
Physics-based discovery of unique phenomena in quantum materials
Extensible approach for strongly correlated materials

Future opportunities:

Pre-trained VAE weights for faster DKL training
Physical deconvolution knowledge for latent space structuring
Conditioning strategies for more directed physical search
Interventional strategies via bootstrapping approaches

Note

Key insight: Active learning transforms 4D-STEM into an autonomous discovery platform for quantum materials research

Take‑aways

DL‑guided BO turns the microscope into an autonomous scientist
Massive dose/time savings without losing resolution
Works on physics‑derived targets (E‑field, strain, etc.)
Ready for integration with Edge/Cloud inference & reinforcement control

References

Automated experiment in 4D-STEM: Exploring emergent physics and structural behaviors, ACS Nano, Kevin M. Roccapriore, Ondrej Dyck, Mark P. Oxley, Maxim Ziatdinov, & Sergei V. Kalinin https://doi.org/10.1021/acsnano.1c11118.

Distill Article on Bayesian Optimization

Data Science for Electron Microscopy Lecture 7: Gaussian Processes 2

Bayesian Optimization & Active Learning

Introduction

Gold Mining Analogy

Active Learning

Bayesian Optimization

To solve this problem, we will follow the following algorithm:

Surrogate Models and Gaussian Processes

Acquisition Functions

Probability of Improvement (PI)

Intuition behind ϵ in PI: ϵ = 0.075

Intuition behind ϵ in PI: ϵ = 0.3

Expected Improvement: Introduction

EI: Mathematical Formulation

EI: Mockus’ Acquisition Function

EI: Analytical Expression for GP Surrogate

EI: When is EI High?

EI: Moderating Exploration with \(\epsilon\)

EI: Visualizations

Intuition behind ϵ in EI: ϵ = 0.3

Intuition behind ϵ in PI: ϵ = 0.3

Intuition behind ϵ in PI: ϵ = 3

Thompson Sampling

Intuition behind Thompson Sampling

Hyperparameter Tuning

Example: SVM Hyperparameter Tuning

Summary

Deep Kernel Learning - Combining Neural Networks and Gaussian Processes

The Big Question

The Evolution of ML Paradigms

Neural Networks

Gaussian Processes

The Deep Kernel Learning Idea

Mathematical Formulation

Key Benefits

Deep Kernel Architecture

Network Structure

Learning Objective

Base Kernels: RBF vs Spectral Mixture

RBF Kernel

Spectral Mixture Kernel

Scalability: KISS-GP

KISS-GP Approximation

Computational Benefits

Experimental Results: UCI Datasets

Performance Comparison

Architecture Details

Learned Representations

Spectral Density Analysis

Covariance Matrix Analysis

Scalability Analysis

Training Time Scaling

Runtime Comparison

Step Function Recovery

Problem Characteristics

Results

Key Advantage

Key Contributions

1. Scalable Deep Kernels

2. Expressive Power

3. Unified Learning

Summary

Neural Networks

Gaussian Processes

Result

Discussion Points

Why automate 4D‑STEM?

From Bayesian Optimisation to Autonomous STEM

4D‑STEM primer

DKL workflow for 4D‑STEM

Ground Truth Data Examples

Damage Mitigation Strategy

Reconstruction from 1 % of data

Nanobeam strain mapping

Live autonomous experiment

Conclusions

Take‑aways

References

Data Science for Electron Microscopy
Lecture 7: Gaussian Processes 2

Reconstruction from 1 % of data