Data Science for Electron Microscopy
Lecture 2: Optimization, Regression, Sensor Fusion

Prof. Dr. Philipp Pelz

FAU Erlangen-Nürnberg

Institute of Micro- and Nanostructure Research

Optimization and Deep Learning

Optimization and deep learning are closely related
Deep learning typically involves:
- Defining a loss function
- Using optimization to minimize the loss
Note: Most optimization algorithms minimize by convention
- To maximize: simply flip the sign of the objective

Goals of Optimization vs Deep Learning

Optimization

Primary goal: Minimize objective function
Focus on training error
Direct mathematical approach

Deep Learning

Primary goal: Find suitable model
Focus on generalization error
Must handle finite data
Must prevent overfitting

Visualizing the Difference

Let’s examine empirical risk vs. risk:

Code

%matplotlib inline
import d2l
import numpy as np
from mpl_toolkits import mplot3d
import torch

def f(x):
    return x * d2l.cos(np.pi * x)

def g(x):
    return f(x) + 0.2 * d2l.cos(5 * np.pi * x)

def annotate(text, xy, xytext):
    d2l.plt.gca().annotate(text, xy=xy, xytext=xytext,
                           arrowprops=dict(arrowstyle='->'))

x = d2l.arange(0.5, 1.5, 0.01)
d2l.set_figsize((4.5, 2.5))
d2l.plot(x, [f(x), g(x)], 'x', 'risk')
annotate('min of\nempirical risk', (1.0, -1.2), (0.5, -1.1))
annotate('min of risk', (1.1, -1.05), (0.95, -0.5))

Key Challenges in Deep Learning Optimization

Local Minima
Saddle Points
Vanishing Gradients

Local Minima

Definition: Point where function value is smaller than nearby points
Global minimum: Smallest value over entire domain
Example function: \(f(x) = x \cdot \textrm{cos}(\pi x)\)

Code

x = d2l.arange(-1.0, 2.0, 0.01)
d2l.plot(x, [f(x), ], 'x', 'f(x)')
annotate('local minimum', (-0.3, -0.25), (-0.77, -1.0))
annotate('global minimum', (1.1, -0.95), (0.6, 0.8))

Impact of Local Minima

Deep learning models often have many local optima
Gradient approaches zero near local minimum
Minibatch SGD can help escape local minima
- Natural gradient variation provides “noise”
- Can dislodge parameters from local minima

Saddle Points

Characteristics:
- All gradients vanish
- Neither global nor local minimum
Example: \(f(x) = x^3\)
- First and second derivatives vanish at \(x=0\)
- Optimization can stall here

Code

x = d2l.arange(-2.0, 2.0, 0.01)
d2l.plot(x, [x**3], 'x', 'f(x)')
annotate('saddle point', (0, -0.2), (-0.52, -5.0))

Higher-Dimensional Saddle Points

More complex in higher dimensions
Example: \(f(x,y) = x^2 - y^2\)
Has saddle point at \((0,0)\)
- Maximum with respect to \(y\)
- Minimum with respect to \(x\)

Code

x, y = d2l.meshgrid(
    d2l.linspace(-1.0, 1.0, 101), d2l.linspace(-1.0, 1.0, 101))
z = x**2 - y**2

ax = d2l.plt.figure().add_subplot(111, projection='3d')
ax.plot_wireframe(x, y, z, **{'rstride': 10, 'cstride': 10})
ax.plot([0], [0], [0], 'rx')
ticks = [-1, 0, 1]
d2l.plt.xticks(ticks)
d2l.plt.yticks(ticks)
ax.set_zticks(ticks)
d2l.plt.xlabel('x')
d2l.plt.ylabel('y')

Text(0.5, 0.5, 'y')

Hessian Matrix Analysis

For a \(k\)-dimensional input vector:

All positive eigenvalues → Local minimum
All negative eigenvalues → Local maximum
Mixed signs → Saddle point

Vanishing Gradients

Most insidious optimization problem
Example: \(f(x) = \tanh(x)\)
- At \(x = 4\): gradient ≈ 0.0013
- Optimization stalls
Historical context:
- Major challenge before ReLU activation
- Made deep learning training difficult

Code

x = d2l.arange(-2.0, 5.0, 0.01)
d2l.plot(x, [d2l.tanh(x)], 'x', 'f(x)')
annotate('vanishing gradient', (4, 1), (2, 0.0))

Summary

Key takeaways:
- Training error minimization ≠ best generalization
- Many local minima exist
- Saddle points are common in non-convex problems
- Vanishing gradients can stall optimization
Good news:
- Robust algorithms exist
- Perfect solutions not always necessary
- Local optima can be useful
- Many practical solutions available

Exercises

Consider a simple MLP with a single hidden layer of \(d\) dimensions:
- Show that for any local minimum there are at least \(d!\) equivalent solutions
- Why does this happen?
For a symmetric random matrix \(\mathbf{M}\):
- Prove that eigenvalue distribution is symmetric
- Why doesn’t this imply \(P(\lambda > 0) = 0.5\)?
Additional challenges in deep learning optimization?
Balancing a ball on a saddle:
- Why is this hard?
- How might this relate to optimization algorithms?

Convexity

Convexity is crucial for optimization algorithm design
Benefits:
- Easier algorithm analysis and testing
- Better understanding of deep learning optimization
- Properties near local minima often resemble convex functions
Even nonconvex problems can benefit from convex analysis

Code

%matplotlib inline
import d2l
import numpy as np
from mpl_toolkits import mplot3d
import torch

Definitions

Convex Sets

A set \(\mathcal{X}\) is convex if:
- For any \(a, b \in \mathcal{X}\)
- Line segment connecting \(a\) and \(b\) is in \(\mathcal{X}\)
Mathematical definition: \[\lambda a + (1-\lambda) b \in \mathcal{X} \textrm{ whenever } a, b \in \mathcal{X}\] for all \(\lambda \in [0, 1]\)

Visual Examples

The first set is nonconvex and the other two are convex.

Set Operations

Intersections of convex sets are convex
Unions of convex sets need not be convex
Example: \(\mathbb{R}^d\) is convex
Bounded sets (e.g., balls) are often convex

The intersection between two convex sets is convex.

The union of two convex sets need not be convex.

Convex Functions

Definition

Function \(f: \mathcal{X} \to \mathbb{R}\) is convex if:
- For all \(x, x' \in \mathcal{X}\)
- For all \(\lambda \in [0, 1]\)
- Satisfies: \(\lambda f(x) + (1-\lambda) f(x') \geq f(\lambda x + (1-\lambda) x')\)

Examples

Code

f = lambda x: 0.5 * x**2  # Convex
g = lambda x: d2l.cos(np.pi * x)  # Nonconvex
h = lambda x: d2l.exp(0.5 * x)  # Convex

x, segment = d2l.arange(-2, 2, 0.01), d2l.tensor([-1.5, 1])
d2l.use_svg_display()
_, axes = d2l.plt.subplots(1, 3, figsize=(9, 3))
for ax, func in zip(axes, [f, g, h]):
    d2l.plot([x, segment], [func(x), func(segment)], axes=ax)

Jensen’s Inequality

Definition

Generalization of convexity
For convex function \(f\): \[\sum_i \alpha_i f(x_i) \geq f\left(\sum_i \alpha_i x_i\right)\] \[E_X[f(X)] \geq f\left(E_X[X]\right)\]
Where \(\alpha_i \geq 0\) and \(\sum_i \alpha_i = 1\)

Applications

Bounding complex expressions
Log-likelihood of partially observed variables
Variational methods
Clustering algorithms

Properties

Local vs Global Minima

Local minima of convex functions are global minima
Proof by contradiction
Example: \(f(x) = (x-1)^2\)

Code

f = lambda x: (x - 1) ** 2
d2l.set_figsize((8,8))
d2l.plot([x, segment], [f(x), f(segment)], 'x', 'f(x)')

Below Sets

Given convex function \(f\) on convex set \(\mathcal{X}\)
Below set \(\mathcal{S}_b = \{x | x \in \mathcal{X} \textrm{ and } f(x) \leq b\}\) is convex
Proof uses definition of convexity

Second Derivatives

For twice-differentiable \(f: \mathbb{R}^n \rightarrow \mathbb{R}\)
Convex if and only if Hessian is positive semidefinite
One-dimensional case: \(f'' \geq 0\)
Multidimensional case: \(\nabla^2f \succeq 0\)

Constraints

Constrained Optimization

Form: \[\begin{aligned} \mathop{\textrm{minimize~}}_{\mathbf{x}} & f(\mathbf{x}) \\ \textrm{ subject to } & c_i(\mathbf{x}) \leq 0 \textrm{ for all } i \in \{1, \ldots, n\}\end{aligned}\]
Examples:
- Unit ball constraint: \(c_1(\mathbf{x}) = \|\mathbf{x}\|_2 - 1\)
- Half-space constraint: \(c_2(\mathbf{x}) = \mathbf{v}^\top \mathbf{x} + b\)

Lagrangian

Combines objective and constraints
Form: \(L(\mathbf{x}, \alpha_1, \ldots, \alpha_n) = f(\mathbf{x}) + \sum_{i=1}^n \alpha_i c_i(\mathbf{x})\)
Lagrange multipliers \(\alpha_i \geq 0\)
Saddle point optimization

Penalties

Alternative to exact constraint satisfaction
Add \(\alpha_i c_i(\mathbf{x})\) to objective
Example: weight decay
More robust than exact satisfaction
Works well for nonconvex problems

Projections

Projection on convex set \(\mathcal{X}\): \[\textrm{Proj}_\mathcal{X}(\mathbf{x}) = \mathop{\mathrm{argmin}}_{\mathbf{x}' \in \mathcal{X}} \|\mathbf{x} - \mathbf{x}'\|\]
Example: gradient clipping
Applications:
- Sparse weight vectors
- \(\ell_1\) ball projections

Convex Projections.

Summary

Key properties:
- Intersections of convex sets are convex
- Jensen’s inequality for expectations
- Hessian positive semidefinite for convex functions
- Local minima are global minima
Constraint handling:
- Lagrangian approach
- Penalty methods
- Projections
Applications in deep learning:
- Algorithm motivation
- Understanding optimization
- Gradient descent analysis

Introduction

Gradient descent is fundamental to understanding optimization
Key concepts apply to more advanced algorithms
Important considerations:
- Learning rate selection
- Divergence issues
- Preconditioning techniques

One-Dimensional Gradient Descent

Mathematical Foundation

For continuously differentiable \(f: \mathbb{R} \rightarrow \mathbb{R}\)
Taylor expansion: \[f(x + \epsilon) = f(x) + \epsilon f'(x) + \mathcal{O}(\epsilon^2)\]
Moving in negative gradient direction:
- Choose \(\epsilon = -\eta f'(x)\)
- Fixed step size \(\eta > 0\)
- Results in: \(f(x - \eta f'(x)) \lessapprox f(x)\)

Implementation

%matplotlib inline
import d2l
import numpy as np
import torch

def f(x):  # Objective function
    return x ** 2

def f_grad(x):  # Gradient (derivative) of the objective function
    return 2 * x

Basic Gradient Descent

def gd(eta, f_grad):
    x = 10.0
    results = [x]
    for i in range(10):
        x -= eta * f_grad(x)
        results.append(float(x))
    print(f'epoch 10, x: {x:f}')
    return results

results = gd(0.2, f_grad)

epoch 10, x: 0.060466

Visualization

def show_trace(results, f):
    n = max(abs(min(results)), abs(max(results)))
    f_line = d2l.arange(-n, n, 0.01)
    d2l.set_figsize()
    d2l.plot([f_line, results], [[f(x) for x in f_line], [
        f(x) for x in results]], 'x', 'f(x)', fmts=['-', '-o'])

show_trace(results, f)

Learning Rate Effects

Too Small Learning Rate

Slow convergence
More iterations needed
Example with \(\eta = 0.05\):

show_trace(gd(0.05, f_grad), f)

epoch 10, x: 3.486784

Too Large Learning Rate

Solution oscillates
May diverge
Example with \(\eta = 1.1\):

show_trace(gd(1.1, f_grad), f)

epoch 10, x: 61.917364

Local Minima

Nonconvex functions have multiple minima
Example: \(f(x) = x \cdot \cos(cx)\)
High learning rates can lead to poor local minima

c = d2l.tensor(0.15 * np.pi)

def f(x):  # Objective function
    return x * d2l.cos(c * x)

def f_grad(x):  # Gradient of the objective function
    return d2l.cos(c * x) - c * x * d2l.sin(c * x)

show_trace(gd(2, f_grad), f)

epoch 10, x: -1.528166

Multivariate Gradient Descent

Mathematical Foundation

For \(f: \mathbb{R}^d \to \mathbb{R}\)
Gradient vector: \(\nabla f(\mathbf{x}) = [\frac{\partial f(\mathbf{x})}{\partial x_1}, \ldots, \frac{\partial f(\mathbf{x})}{\partial x_d}]^\top\)
Taylor expansion: \[f(\mathbf{x} + \boldsymbol{\epsilon}) = f(\mathbf{x}) + \mathbf{\boldsymbol{\epsilon}}^\top \nabla f(\mathbf{x}) + \mathcal{O}(\|\boldsymbol{\epsilon}\|^2)\]
Update rule: \(\mathbf{x} \leftarrow \mathbf{x} - \eta \nabla f(\mathbf{x})\)

Implementation

def train_2d(trainer, steps=20, f_grad=None):
    """Optimize a 2D objective function with a customized trainer."""
    x1, x2, s1, s2 = -5, -2, 0, 0
    results = [(x1, x2)]
    for i in range(steps):
        if f_grad:
            x1, x2, s1, s2 = trainer(x1, x2, s1, s2, f_grad)
        else:
            x1, x2, s1, s2 = trainer(x1, x2, s1, s2)
        results.append((x1, x2))
    print(f'epoch {i + 1}, x1: {float(x1):f}, x2: {float(x2):f}')
    return results

def show_trace_2d(f, results):
    """Show the trace of 2D variables during optimization."""
    d2l.set_figsize()
    d2l.plt.plot(*zip(*results), '-o', color='#ff7f0e')
    x1, x2 = d2l.meshgrid(d2l.arange(-5.5, 1.0, 0.1),
                          d2l.arange(-3.0, 1.0, 0.1), indexing='ij')
    d2l.plt.contour(x1, x2, f(x1, x2), colors='#1f77b4')
    d2l.plt.xlabel('x1')
    d2l.plt.ylabel('x2')

Example: Quadratic Function

def f_2d(x1, x2):  # Objective function
    return x1 ** 2 + 2 * x2 ** 2

def f_2d_grad(x1, x2):  # Gradient of the objective function
    return (2 * x1, 4 * x2)

def gd_2d(x1, x2, s1, s2, f_grad):
    g1, g2 = f_grad(x1, x2)
    return (x1 - eta * g1, x2 - eta * g2, 0, 0)

eta = 0.1
show_trace_2d(f_2d, train_2d(gd_2d, f_grad=f_2d_grad))

epoch 20, x1: -0.057646, x2: -0.000073

Adaptive Methods

Newton’s Method

Uses second-order information
Taylor expansion with Hessian: \[f(\mathbf{x} + \boldsymbol{\epsilon}) = f(\mathbf{x}) + \boldsymbol{\epsilon}^\top \nabla f(\mathbf{x}) + \frac{1}{2} \boldsymbol{\epsilon}^\top \nabla^2 f(\mathbf{x}) \boldsymbol{\epsilon} + \mathcal{O}(\|\boldsymbol{\epsilon}\|^3)\]
Update rule: \(\boldsymbol{\epsilon} = -\mathbf{H}^{-1} \nabla f(\mathbf{x})\)

Implementation

c = d2l.tensor(0.5)

def f(x):  # Objective function
    return d2l.cosh(c * x)

def f_grad(x):  # Gradient of the objective function
    return c * d2l.sinh(c * x)

def f_hess(x):  # Hessian of the objective function
    return c**2 * d2l.cosh(c * x)

def newton(eta=1):
    x = 10.0
    results = [x]
    for i in range(10):
        x -= eta * f_grad(x) / f_hess(x)
        results.append(float(x))
    print('epoch 10, x:', x)
    return results

show_trace(newton(), f)

epoch 10, x: tensor(0.)

Nonconvex Example

c = d2l.tensor(0.15 * np.pi)

def f(x):  # Objective function
    return x * d2l.cos(c * x)

def f_grad(x):  # Gradient of the objective function
    return d2l.cos(c * x) - c * x * d2l.sin(c * x)

def f_hess(x):  # Hessian of the objective function
    return - 2 * c * d2l.sin(c * x) - x * c**2 * d2l.cos(c * x)

show_trace(newton(0.5), f)

epoch 10, x: tensor(7.2699)

Preconditioning

Key Concepts

Avoid full Hessian computation
Use diagonal entries only
Update rule: \(\mathbf{x} \leftarrow \mathbf{x} - \eta \textrm{diag}(\mathbf{H})^{-1} \nabla f(\mathbf{x})\)
Benefits:
- Different learning rates per variable
- Handles scale mismatches
- More efficient than full Newton’s method

Summary

Learning rate selection is crucial
Local minima can trap gradient descent
High dimensions require careful learning rate adjustment
Preconditioning helps with scale issues
Newton’s method:
- Fast convergence for convex problems
- Requires careful handling for nonconvex problems
- Computationally expensive for large problems

Exercises

Experiment with different learning rates and objective functions
Implement line search for convex optimization
Design a slow-converging 2D objective function
Implement lightweight Newton’s method with preconditioning
Test algorithms on rotated coordinate systems

Stochastic Gradient Descent

Previously used SGD without detailed explanation
Now diving deeper into its principles
Building on gradient descent fundamentals
Understanding why and how it works

Code

%matplotlib inline
import d2l
import math
import torch

Stochastic Gradient Updates

Objective Function

Training dataset with \(n\) examples
Loss function \(f_i(\mathbf{x})\) for example \(i\)
Overall objective: \[f(\mathbf{x}) = \frac{1}{n} \sum_{i = 1}^n f_i(\mathbf{x})\]
Full gradient: \[\nabla f(\mathbf{x}) = \frac{1}{n} \sum_{i = 1}^n \nabla f_i(\mathbf{x})\]

Computational Cost

Gradient descent: \(\mathcal{O}(n)\) per iteration
SGD: \(\mathcal{O}(1)\) per iteration
Update rule: \[\mathbf{x} \leftarrow \mathbf{x} - \eta \nabla f_i(\mathbf{x})\]
Unbiased estimate: \[\mathbb{E}_i \nabla f_i(\mathbf{x}) = \nabla f(\mathbf{x})\]

Implementation

def f(x1, x2):  # Objective function
    return x1 ** 2 + 2 * x2 ** 2

def f_grad(x1, x2):  # Gradient of the objective function
    return 2 * x1, 4 * x2

def sgd(x1, x2, s1, s2, f_grad):
    g1, g2 = f_grad(x1, x2)
    # Simulate noisy gradient
    g1 += torch.normal(0.0, 1, (1,)).item()
    g2 += torch.normal(0.0, 1, (1,)).item()
    eta_t = eta * lr()
    return (x1 - eta_t * g1, x2 - eta_t * g2, 0, 0)

def constant_lr():
    return 1

eta = 0.1
lr = constant_lr  # Constant learning rate
d2l.show_trace_2d(f, d2l.train_2d(sgd, steps=50, f_grad=f_grad))

epoch 50, x1: 0.107067, x2: -0.260986

Dynamic Learning Rate

Learning Rate Strategies

Piecewise constant: \(\eta(t) = \eta_i \textrm{ if } t_i \leq t \leq t_{i+1}\)
Exponential decay: \(\eta(t) = \eta_0 \cdot e^{-\lambda t}\)
Polynomial decay: \(\eta(t) = \eta_0 \cdot (\beta t + 1)^{-\alpha}\)

Exponential Decay Implementation

def exponential_lr():
    global t
    t += 1
    return math.exp(-0.1 * t)

t = 1
lr = exponential_lr
d2l.show_trace_2d(f, d2l.train_2d(sgd, steps=1000, f_grad=f_grad))

epoch 1000, x1: -0.798992, x2: -0.056491

Polynomial Decay Implementation

def polynomial_lr():
    global t
    t += 1
    return (1 + 0.1 * t) ** (-0.5)

t = 1
lr = polynomial_lr
d2l.show_trace_2d(f, d2l.train_2d(sgd, steps=50, f_grad=f_grad))

epoch 50, x1: 0.175969, x2: -0.064892

Stochastic Gradients and Finite Samples

Sampling Strategies

With replacement:
- Probability of choosing element: \(1 - e^{-1} \approx 0.63\)
- Increased variance
- Decreased data efficiency
Without replacement:
- Better variance properties
- More efficient data usage
- Default choice in practice

Summary

Key points:
- SGD reduces computational cost to \(\mathcal{O}(1)\)
- Learning rate scheduling is crucial
- Convergence guarantees for convex problems
- Sampling without replacement preferred
Practical considerations:
- Dynamic learning rates
- Trade-offs in sampling strategies
- Nonconvex optimization challenges

Exercises

Experiment with learning rate schedules
Analyze noise in gradient updates
Compare sampling strategies
Investigate gradient coordinate scaling
Study local minima in nonconvex functions

Minibatch Stochastic Gradient Descent

Two extremes in gradient-based learning:
- Full dataset (gradient descent)
- Single examples (stochastic gradient descent)
Each approach has drawbacks:
- Gradient descent: Not data efficient for similar data
- SGD: Not computationally efficient (poor vectorization)
Minibatch SGD offers a middle ground

Vectorization and Caches

Hardware Considerations

Multiple GPUs and servers require larger minibatches
- 8 GPUs × 16 servers = minimum batch size of 128
Single GPU/CPU considerations:
- Multiple memory types (registers, L1/L2/L3 cache)
- Different bandwidth constraints
- Memory access patterns matter

Performance Metrics

Modern CPU capabilities:
- 2GHz CPU with 16 cores and AVX-512
- Can process up to 10¹² bytes/second
GPU capabilities:
- 100× better than CPU
Memory bandwidth limitations:
- Midrange server: ~100 GB/s
- Memory access width: 64-384 bit

Matrix Multiplication Strategies

Different Approaches

Element-wise computation
Column-wise computation
Full matrix multiplication
Block-wise computation

Performance Comparison

import d2l
import torch
import time
import numpy as np

class Timer:
    """Record multiple running times."""
    def __init__(self):
        self.times = []
        self.start()

    def start(self):
        """Start the timer."""
        self.tik = time.time()

    def stop(self):
        """Stop the timer and record the time in a list."""
        self.times.append(time.time() - self.tik)
        return self.times[-1]

    def avg(self):
        """Return the average time."""
        return sum(self.times) / len(self.times)

    def sum(self):
        """Return the sum of time."""
        return sum(self.times)

    def cumsum(self):
        """Return the accumulated time."""
        return torch.tensor(self.times).cumsum().tolist()

# Initialize matrices
A = torch.zeros(256, 256)
B = torch.randn(256, 256)
C = torch.randn(256, 256)
timer = Timer()

Element-wise Computation

# Compute A = BC one element at a time
timer.start()
for i in range(256):
    for j in range(256):
        A[i, j] = torch.dot(B[i, :], C[:, j])
timer.stop()

2.177656650543213

Column-wise Computation

# Compute A = BC one column at a time
timer.start()
for j in range(256):
    A[:, j] = torch.mv(B, C[:, j])
timer.stop()

0.017212629318237305

Full Matrix Multiplication

# Compute A = BC in one go
timer.start()
A = torch.mm(B, C)
timer.stop()

gigaflops = [0.03 / i for i in timer.times]
print(f'performance in Gigaflops: element {gigaflops[0]:.3f}, '
      f'column {gigaflops[1]:.3f}, full {gigaflops[2]:.3f}')

performance in Gigaflops: element 0.014, column 1.743, full 33.698

Minibatch Processing

Why Use Minibatches?

Computational efficiency
Statistical properties:
- Maintains gradient expectation
- Reduces variance by factor of \(b^{-\frac{1}{2}}\)
- \(b\) = batch size

Batch Size Trade-offs

Too small:
- Poor computational efficiency
- High variance
Too large:
- Diminishing returns in variance reduction
- Memory constraints
Optimal: Balance between:
- Computational efficiency
- Statistical efficiency
- Available memory

Implementation

Data Loading

#@save
d2l.DATA_HUB['airfoil'] = (d2l.DATA_URL + 'airfoil_self_noise.dat',
                           '76e5be1548fd8222e5074cf0faae75edff8cf93f')

#@save
def get_data_ch11(batch_size=10, n=1500):
    data = np.genfromtxt(d2l.download('airfoil'),
                         dtype=np.float32, delimiter='\t')
    data = torch.from_numpy((data - data.mean(axis=0)) / data.std(axis=0))
    data_iter = d2l.load_array((data[:n, :-1], data[:n, -1]),
                               batch_size, is_train=True)
    return data_iter, data.shape[1]-1

Training Function

#@save
def sgd(params, states, hyperparams):
    for p in params:
        p.data.sub_(hyperparams['lr'] * p.grad)
        p.grad.data.zero_()

Training Function

def train_ch11(trainer_fn, states, hyperparams, data_iter,
               feature_dim, num_epochs=2):
    # Initialization
    w = torch.normal(mean=0.0, std=0.01, size=(feature_dim, 1),
                     requires_grad=True)
    b = torch.zeros((1), requires_grad=True)
    net, loss = lambda X: d2l.linreg(X, w, b), d2l.squared_loss
    # Train
    animator = d2l.Animator(xlabel='epoch', ylabel='loss',
                            xlim=[0, num_epochs], ylim=[0.22, 0.35])
    n, timer = 0, d2l.Timer()
    for _ in range(num_epochs):
        for X, y in data_iter:
            l = loss(net(X), y).mean()
            l.backward()
            trainer_fn([w, b], states, hyperparams)
            n += X.shape[0]
            if n % 200 == 0:
                timer.stop()
                animator.add(n/X.shape[0]/len(data_iter),
                             (d2l.evaluate_loss(net, data_iter, loss),))
                timer.start()
    print(f'loss: {animator.Y[0][-1]:.3f}, {timer.sum()/num_epochs:.3f} sec/epoch')
    return timer.cumsum(), animator.Y[0]

Performance Comparison

Different Batch Sizes

def train_sgd(lr, batch_size, num_epochs=2):
    data_iter, feature_dim = get_data_ch11(batch_size)
    return train_ch11(
        sgd, None, {'lr': lr}, data_iter, feature_dim, num_epochs)

# Compare different approaches
gd_res = train_sgd(1, 1500, 10)  # Full batch
sgd_res = train_sgd(0.005, 1)    # Single example
d2l.set_figsize([6, 3])
d2l.plot(*list(map(list, zip(gd_res, sgd_res))),
         'time (sec)', 'loss', xlim=[1e-2, 10],
         legend=['gd', 'sgd', 'batch size=100', 'batch size=10'])
d2l.plt.gca().set_xscale('log')

loss: 0.243, 0.539 sec/epoch

mini1_res = train_sgd(.4, 100)   # Medium batch
mini2_res = train_sgd(.05, 10)   # Small batch

# Plot results
d2l.set_figsize([6, 3])
d2l.plot(*list(map(list, zip( mini1_res, mini2_res))),
         'time (sec)', 'loss', xlim=[1e-2, 10],
         legend=['gd', 'sgd', 'batch size=100', 'batch size=10'])
d2l.plt.gca().set_xscale('log')

loss: 0.244, 0.079 sec/epoch

Summary

Vectorization benefits:
- Reduced framework overhead
- Better memory locality
- Improved caching
Minibatch SGD advantages:
- Computational efficiency
- Statistical efficiency
- Memory efficiency
Key considerations:
- Batch size selection
- Learning rate decay
- Hardware constraints

Exercises

Experiment with different batch sizes and learning rates
Implement learning rate decay
Compare with replacement sampling
Analyze behavior with duplicated data

Momentum in Optimization

Momentum is a key optimization technique in deep learning
Addresses challenges in stochastic gradient descent:
- Learning rate sensitivity
- Convergence issues
- Noise handling
Particularly effective for ill-conditioned problems

Basics of Momentum

Leaky Averages

Minibatch SGD averages gradients to reduce variance
Momentum extends this concept using “leaky averages”:
- Accumulates past gradients
- Weights recent gradients more heavily
- Formula: \(\mathbf{v}_t = \beta \mathbf{v}_{t-1} + \mathbf{g}_{t, t-1}\)
- \(\beta \in (0, 1)\) controls the “memory” of past gradients

Key Benefits

Accelerates convergence
Particularly effective for:
- Ill-conditioned problems
- Narrow canyons in optimization landscape
Provides more stable descent directions
Works well with both:
- Noise-free convex problems
- Stochastic gradient descent

Visualizing the Problem

Let’s examine an ill-conditioned problem:

%matplotlib inline
import d2l
import torch

eta = 0.4
def f_2d(x1, x2):
    return 0.1 * x1 ** 2 + 2 * x2 ** 2
def gd_2d(x1, x2, s1, s2):
    return (x1 - eta * 0.2 * x1, x2 - eta * 4 * x2, 0, 0)

d2l.show_trace_2d(f_2d, d2l.train_2d(gd_2d))

epoch 20, x1: -0.943467, x2: -0.000073

The Challenge

Function \(f(\mathbf{x}) = 0.1 x_1^2 + 2 x_2^2\) is very flat in \(x_1\) direction
Gradient in \(x_2\) direction:
- Much higher
- Changes more rapidly
Trade-off in learning rate:
- Small rate: Slow convergence in \(x_1\)
- Large rate: Divergence in \(x_2\)

Momentum Method

Update Equations

\[ \begin{aligned} \mathbf{v}_t &\leftarrow \beta \mathbf{v}_{t-1} + \mathbf{g}_{t, t-1}, \\ \mathbf{x}_t &\leftarrow \mathbf{x}_{t-1} - \eta_t \mathbf{v}_t. \end{aligned} \]

Implementation

def momentum_2d(x1, x2, v1, v2):
    v1 = beta * v1 + 0.2 * x1
    v2 = beta * v2 + 4 * x2
    return x1 - eta * v1, x2 - eta * v2, v1, v2

eta, beta = 0.6, 0.5
d2l.show_trace_2d(f_2d, d2l.train_2d(momentum_2d))

epoch 20, x1: 0.007188, x2: 0.002553

Effect of Momentum Parameter

\(\beta = 0.5\): Good convergence
\(\beta = 0.25\): Barely converges but better than no momentum
\(\beta = 0\): Reduces to regular gradient descent

eta, beta = 0.6, 0.25
d2l.show_trace_2d(f_2d, d2l.train_2d(momentum_2d))

epoch 20, x1: -0.126340, x2: -0.186632

Effective Sample Weight

Sum of weights: \(\sum_{\tau=0}^\infty \beta^\tau = \frac{1}{1-\beta}\)
Step size effectively becomes \(\frac{\eta}{1-\beta}\)
Better behaved descent direction

d2l.set_figsize()
betas = [0.95, 0.9, 0.6, 0]
for beta in betas:
    x = d2l.numpy(d2l.arange(40))
    d2l.plt.plot(x, beta ** x, label=f'beta = {beta:.2f}')
d2l.plt.xlabel('time')
d2l.plt.legend();

Practical Implementation

From Scratch

def init_momentum_states(feature_dim):
    v_w = d2l.zeros((feature_dim, 1))
    v_b = d2l.zeros(1)
    return (v_w, v_b)

def sgd_momentum(params, states, hyperparams):
    for p, v in zip(params, states):
        with torch.no_grad():
            v[:] = hyperparams['momentum'] * v + p.grad
            p[:] -= hyperparams['lr'] * v
        p.grad.data.zero_()

Training with Different Parameters

def train_momentum(lr, momentum, num_epochs=2):
    d2l.train_ch11(sgd_momentum, init_momentum_states(feature_dim),
                   {'lr': lr, 'momentum': momentum}, data_iter,
                   feature_dim, num_epochs)

data_iter, feature_dim = d2l.get_data_ch11(batch_size=10)
train_momentum(0.02, 0.5)

loss: 0.244, 0.090 sec/epoch

Theoretical Analysis

Quadratic Convex Functions

General form: \(h(\mathbf{x}) = \frac{1}{2} \mathbf{x}^\top \mathbf{Q} \mathbf{x} + \mathbf{x}^\top \mathbf{c} + b\)
For positive definite \(\mathbf{Q}\):
- Minimizer at \(\mathbf{x}^* = -\mathbf{Q}^{-1} \mathbf{c}\)
- Minimum value: \(b - \frac{1}{2} \mathbf{c}^\top \mathbf{Q}^{-1} \mathbf{c}\)
Gradient: \(\partial_{\mathbf{x}} h(\mathbf{x}) = \mathbf{Q} (\mathbf{x} - \mathbf{Q}^{-1} \mathbf{c})\)

Convergence Analysis

For scalar function \(f(x) = \frac{\lambda}{2} x^2\):
- Gradient descent: \(x_{t+1} = (1 - \eta \lambda) x_t\)
- Convergence when \(|1 - \eta \lambda| < 1\)
- Exponential convergence rate

lambdas = [0.1, 1, 10, 19]
eta = 0.1
d2l.set_figsize((6, 4))
for lam in lambdas:
    t = d2l.numpy(d2l.arange(20))
    d2l.plt.plot(t, (1 - eta * lam) ** t, label=f'lambda = {lam:.2f}')
d2l.plt.xlabel('time')
d2l.plt.legend();

Summary

Momentum replaces gradients with leaky averages
Key benefits:
- Accelerates convergence
- Works for both noise-free and noisy gradients
- Prevents optimization stalling
- Effective sample size: \(\frac{1}{1-\beta}\)
Implementation requires:
- Additional state vector (velocity)
- Careful parameter tuning

Exercises

Experiment with different momentum and learning rate combinations
Analyze gradient descent and momentum for quadratic problems with multiple eigenvalues
Derive minimum value and minimizer for \(h(\mathbf{x}) = \frac{1}{2} \mathbf{x}^\top \mathbf{Q} \mathbf{x} + \mathbf{x}^\top \mathbf{c} + b\)
Investigate behavior with stochastic gradient descent and minibatch variants

Adam Optimization

Adam combines multiple optimization techniques:
- Stochastic Gradient Descent (SGD)
- Minibatch processing
- Momentum
- Per-coordinate scaling (AdaGrad)
- Learning rate adjustment (RMSProp)
Popular in deep learning due to:
- Robustness
- Effectiveness
- Computational efficiency

Previous Optimization Methods

SGD: Efficient for redundant data
Minibatch SGD: Enables parallel processing
Momentum: Accelerates convergence
AdaGrad: Efficient preconditioning
RMSProp: Decoupled scaling

The Algorithm

State Variables

Uses exponential weighted moving averages
Momentum estimate: \[\mathbf{v}_t \leftarrow \beta_1 \mathbf{v}_{t-1} + (1 - \beta_1) \mathbf{g}_t\]
Second moment estimate: \[\mathbf{s}_t \leftarrow \beta_2 \mathbf{s}_{t-1} + (1 - \beta_2) \mathbf{g}_t^2\]
Typical values: \(\beta_1 = 0.9\), \(\beta_2 = 0.999\)

Bias Correction

Initial bias towards smaller values
Normalized state variables: \[\hat{\mathbf{v}}_t = \frac{\mathbf{v}_t}{1 - \beta_1^t}\] \[\hat{\mathbf{s}}_t = \frac{\mathbf{s}_t}{1 - \beta_2^t}\]

Update Rule

Rescaled gradient: \[\mathbf{g}_t' = \frac{\eta \hat{\mathbf{v}}_t}{\sqrt{\hat{\mathbf{s}}_t} + \epsilon}\]
Parameter update: \[\mathbf{x}_t \leftarrow \mathbf{x}_{t-1} - \mathbf{g}_t'\]
Typically \(\epsilon = 10^{-6}\)

Implementation

State Initialization

import d2l
import torch
#| label: init-states
def init_adam_states(feature_dim):
    v_w, v_b = d2l.zeros((feature_dim, 1)), d2l.zeros(1)
    s_w, s_b = d2l.zeros((feature_dim, 1)), d2l.zeros(1)
    return ((v_w, s_w), (v_b, s_b))

Adam Update

def adam(params, states, hyperparams):
    beta1, beta2, eps = 0.9, 0.999, 1e-6
    for p, (v, s) in zip(params, states):
        with torch.no_grad():
            # Update momentum
            v[:] = beta1 * v + (1 - beta1) * p.grad
            # Update second moment
            s[:] = beta2 * s + (1 - beta2) * torch.square(p.grad)
            # Bias correction
            v_bias_corr = v / (1 - beta1 ** hyperparams['t'])
            s_bias_corr = s / (1 - beta2 ** hyperparams['t'])
            # Parameter update
            p[:] -= hyperparams['lr'] * v_bias_corr / (
                torch.sqrt(s_bias_corr) + eps)
        p.grad.data.zero_()
    hyperparams['t'] += 1

Training

data_iter, feature_dim = d2l.get_data_ch11(batch_size=10)
d2l.train_ch11(adam, init_adam_states(feature_dim),
               {'lr': 0.01, 't': 1}, data_iter, feature_dim)

loss: 0.243, 0.121 sec/epoch

([0.016477108001708984,
  0.031662702560424805,
  0.05611300468444824,
  0.07136821746826172,
  0.0869436264038086,
  0.10216045379638672,
  0.11755108833312988,
  0.13338828086853027,
  0.1489114761352539,
  0.16463303565979004,
  0.1801464557647705,
  0.19553565979003906,
  0.21099257469177246,
  0.22623920440673828,
  0.24155426025390625],
 [0.38004374980926514,
  0.307484365940094,
  0.2759574576616287,
  0.2566526933511098,
  0.24764523722728093,
  0.24818307574590048,
  0.248671248793602,
  0.2432802002429962,
  0.24373280008633932,
  0.2444994897246361,
  0.24296954584121705,
  0.2428334228595098,
  0.24379130619764328,
  0.24406720107793808,
  0.24297383145491283])

Concise Implementation

trainer = torch.optim.Adam
d2l.train_concise_ch11(trainer, {'lr': 0.01}, data_iter)

loss: 0.242, 0.106 sec/epoch

Summary

Adam combines multiple optimization techniques
Key features:
- Momentum from RMSProp
- Bias correction
- Learning rate control
Yogi variant:
- Addresses convergence issues
- Modified second moment update
- Better variance control

Exercises

Experiment with learning rate adjustments
Rewrite momentum updates without bias correction
Analyze learning rate reduction during convergence
Construct divergence cases for Adam vs Yogi

Sensor Fusion as a Regression Problem

High-resolution chemical imaging in STEM is limited by inelastic scattering.
HAADF gives high SNR but lacks chemical specificity.
EDX/EELS gives chemistry but is noisy at low dose.
Goal: Fuse both signals for high-quality chemical maps.

Data Fusion as Inverse Problem

Reconstruction goal: \[ \hat{x} = \arg\min_{x \geq 0} \; \Psi_1(x) + \lambda_1 \Psi_2(x) + \lambda_2 \text{TV}(x) \]

Where:

\(\Psi_1\): HAADF model loss
\(\Psi_2\): spectroscopic data fidelity
\(\text{TV}(x)\): regularization term

First Term: HAADF Consistency

HAADF image model: \[ \Psi_1(x) = \frac{1}{2} \| b_H - A x^\gamma \|_2^2 \]

\(b_H\): measured HAADF signal
\(x^\gamma\): element-wise power (Z-contrast)
\(\gamma \approx 1.7\): approximates Z-contrast

Interpretation: Ensure the fused chemical map explains HAADF contrast.

Second Term: Spectroscopic Fidelity

Poisson noise model for EDX/EELS: \[ \Psi_2(x) = \sum_i 1^T x_i - b_i^T \log(x_i + \varepsilon) \]

\(x_i\): reconstructed map of element \(i\)
\(b_i\): measured EDX/EELS signal for element \(i\)
\(\varepsilon\): small constant to avoid \(\log(0)\)

Interpretation: Match fused maps with noisy spectroscopic measurements.

Third Term: Total Variation (TV)

Channel-wise total variation: \[ \text{TV}(x) = \sum_i \|x_i\|_{TV} \]

Purpose: - Promote piecewise smooth maps - Reduce noise - Preserve edges

Popular in: - Compressed sensing - Image denoising

Summary of Loss Terms

Term	Meaning	Benefit
\(\Psi_1\)	HAADF consistency	Uses high SNR elastic signal
\(\Psi_2\)	Spectroscopy fidelity	Honors noisy chemical data
\(\text{TV}(x)\)	Regularization	Noise suppression and smoothness

All terms are necessary for accurate low-dose chemical recovery.

Practical Results

Improves SNR by 300–500%.
Reduces required dose by >10×.
Recovers stoichiometry with <15% error.

Takeaways

Multi-modal fusion = better signal, lower dose.
Expressed as interpretable optimization.
Each term plays a distinct role.

Future outlook: Combine with additional modalities (e.g., ABF, ptychography).

Overview

Tutorial on fusing EELS/X-EDS maps with HAADF for improved chemical resolution
Part 1 of 2: Atomic resolution HAADF and X-EDS dataset of DyScO\(_3\)
Python-based workflow with minimal user input (<10 tunable lines)
Quick transformation of datasets into resolution-enhanced chemical maps

Example Output

Experimental Requirements

Need both elastic (e.g., HAADF) and inelastic (e.g., EELS/X-EDS) maps
Elastic signal must provide Z-contrast
Inelastic signal must map all chemistries
All maps must have same dimensionality
Recommendation: Use simultaneously collected HAADF signal

Step 1: Python Imports

import data.fusion_utils as utils
from scipy.sparse import spdiags
import matplotlib.pyplot as plt
from tqdm.notebook import tqdm 
import numpy as np

Step 2: Data Loading

data = np.load('data/PTO_Trilayer_dataset.npz')
# Define element names and their atomic weights
elem_names=['Sc', 'Dy', 'O']
elem_weights=[21,66,8]

# Parse elastic HAADF data and inelastic chemical maps
HAADF = data['HAADF']
xx = np.array([],dtype=np.float32)
for ee in elem_names:
    chemMap = data[ee]
    if chemMap.shape != HAADF.shape:
        raise ValueError(f"The dimensions of {ee} chemical map do not match HAADF dimensions.")
    chemMap -= np.min(chemMap); chemMap /= np.max(chemMap)
    xx = np.concatenate([xx,chemMap.flatten()])

Step 3: Data Reshaping

# Make Copy of Raw Measurements
xx0 = xx.copy()

# Parameters
gamma = 1.6  # Z-contrast scaling factor
(nx, ny) = chemMap.shape; nPix = nx * ny
nz = len(elem_names)

# Initialize TV Regularizers
reg = utils.tvlib(nx,ny)

# Normalize HAADF
HAADF -= np.min(HAADF); HAADF /= np.max(HAADF)
HAADF = HAADF.flatten()

# Create Measurement Matrix
A = utils.create_weighted_measurement_matrix(nx,ny,nz,elem_weights,gamma,1)

Step 4: Cost Function Parameters

# Convergence Parameters
lambdaHAADF = 1/nz # 1/nz (do not modify)
lambdaChem = 0.08 # 0.05-0.3 (data consistency)
lambdaTV = 0.15 # <0.2  Total Variation denoising
nIter = 30      # typically converges in 10-15
bkg = 2.4e-1    # background subtraction

# FGP TV Parameters
regularize = True
nIter_TV = 3

Step 5: Algorithm Execution

# Initialize
xx = xx0.copy()

# Cost Functions
lsqFun = lambda inData : 0.5 * np.linalg.norm(A.dot(inData**gamma) - HAADF) **2
poissonFun = lambda inData : np.sum(xx0 * np.log(inData + 1e-8) - inData)

# Initialize Cost Tracking
costHAADF = np.zeros(nIter,dtype=np.float32)
costChem = np.zeros(nIter, dtype=np.float32)
costTV = np.zeros(nIter, dtype=np.float32)

# Main Loop
for kk in tqdm(range(nIter)):
    # Optimization
    xx -= gamma * spdiags(xx**(gamma - 1), [0], nz*nx*ny, nz*nx*ny) * \
          lambdaHAADF * A.transpose() * (A.dot(xx**gamma) - HAADF) + \
          lambdaChem * (1 - xx0 / (xx + bkg))
    
    # Positivity Constraint
    xx[xx<0] = 0
    
    # TV Regularization
    if regularize:
        for zz in range(nz):
            xx[zz*nPix:(zz+1)*nPix] = reg.fgp_tv(
                xx[zz*nPix:(zz+1)*nPix].reshape(nx,ny), 
                lambdaTV, 
                nIter_TV
            ).flatten()
            costTV[kk] += reg.tv(xx[zz*nPix:(zz+1)*nPix].reshape(nx,ny))
    
    # Track Costs
    costHAADF[kk] = lsqFun(xx)
    costChem[kk] = poissonFun(xx)

Step 6: Convergence Assessment

Convergence Criteria

All 3 cost functions should asymptotically approach low values
Look for:
- Exponential decay
- Brief overshooting (Lennard-Jones-like)
- Avoid:
  - Incomplete convergence
  - Severe oscillations

Convergence Plot

TV Weighting Effects

TV Weighting Comparison

TV Weighting Guidelines

Under-weighting: Results in noisy reconstructions
Over-weighting: Causes blurring and feature loss
Best practice: Err on side of under-weighting
- Noise is familiar to data
- Oversmoothing creates unphysical artifacts

Results Visualization

# Display Cost Functions
utils.plot_convergence(costHAADF, lambdaHAADF, 
                      costChem, lambdaChem, 
                      costTV, lambdaTV)

# Show Reconstructed Signal
fig, ax = plt.subplots(2,len(elem_names)+1,figsize=(12,6.5))
ax = ax.flatten()
ax[0].imshow((A.dot(xx**gamma)).reshape(nx,ny),cmap='gray'); ax[0].set_title('HAADF'); ax[0].axis('off')
ax[1+len(elem_names)].imshow((A.dot(xx**gamma)).reshape(nx,ny)[70:130,25:85],cmap='gray'); ax[1+len(elem_names)].set_title('HAADF Cropped'); ax[1+len(elem_names)].axis('off')

for ii in range(len(elem_names)):
    ax[ii+1].imshow(xx[ii*(nx*ny):(ii+1)*(nx*ny)].reshape(nx,ny),cmap='gray')
    ax[ii+2+len(elem_names)].imshow(xx[ii*(nx*ny):(ii+1)*(nx*ny)].reshape(nx,ny)[70:130,25:85],cmap='gray')
    
    ax[ii+1].set_title(elem_names[ii])
    ax[ii+1].axis('off')
    ax[ii+2+len(elem_names)].set_title(elem_names[ii]+' Cropped')
    ax[ii+2+len(elem_names)].axis('off')

fig.tight_layout()

Best Practices Summary

Ensure proper data collection
Verify dimensional consistency
Start with recommended parameter ranges
Monitor convergence carefully
Validate results against physical expectations

References

Manassa, Jason, Miti Shah, Min Gee Cho, Zichao Wendy Di, Yi Jiang, Jeffrey A Fessler, Yu-Tsun Shao, Mary C Scott, Jonathan Schwartz, and Robert Hovden. 2024. “Fused Multi-Modal Electron Microscopy.” Elemental Microscopy. https://doi.org/10.69761/MXVR4353.

Schwartz, Jonathan, Zichao Wendy Di, Yi Jiang, Alyssa J. Fielitz, Don-Hyung Ha, Sanjaya D. Perera, Ismail El Baggari, et al. 2022. “Imaging Atomic-Scale Chemistry from Fused Multi-Modal Electron Microscopy.” Npj Computational Materials 8 (11): 1–8. https://doi.org/10.1038/s41524-021-00692-5.

Data Science for Electron Microscopy Lecture 2: Optimization, Regression, Sensor Fusion

Optimization and Deep Learning

Goals of Optimization vs Deep Learning

Optimization

Deep Learning

Visualizing the Difference

Key Challenges in Deep Learning Optimization

Local Minima

Impact of Local Minima

Saddle Points

Higher-Dimensional Saddle Points

Hessian Matrix Analysis

Vanishing Gradients

Summary

Exercises

Convexity

Definitions

Convex Sets

Visual Examples

Set Operations

Convex Functions

Definition

Examples

Jensen’s Inequality

Definition

Applications

Properties

Local vs Global Minima

Below Sets

Second Derivatives

Constraints

Constrained Optimization

Lagrangian

Penalties

Projections

Summary

Introduction

One-Dimensional Gradient Descent

Mathematical Foundation

Implementation

Basic Gradient Descent

Visualization

Learning Rate Effects

Too Small Learning Rate

Too Large Learning Rate

Local Minima

Multivariate Gradient Descent

Mathematical Foundation

Implementation

Example: Quadratic Function

Adaptive Methods

Newton’s Method

Implementation

Nonconvex Example

Preconditioning

Key Concepts

Summary

Exercises

Stochastic Gradient Descent

Stochastic Gradient Updates

Objective Function

Computational Cost

Implementation

Dynamic Learning Rate

Learning Rate Strategies

Exponential Decay Implementation

Polynomial Decay Implementation

Stochastic Gradients and Finite Samples

Sampling Strategies

Summary

Exercises

Minibatch Stochastic Gradient Descent

Vectorization and Caches

Hardware Considerations

Performance Metrics

Matrix Multiplication Strategies

Different Approaches

Performance Comparison

Element-wise Computation

Column-wise Computation

Data Science for Electron Microscopy
Lecture 2: Optimization, Regression, Sensor Fusion