Data Science for Electron Microscopy
Lecture 2: Optimization, Regression, Sensor Fusion

Prof. Dr. Philipp Pelz

FAU Erlangen-Nürnberg

Institute of Micro- and Nanostructure Research

Optimization and Deep Learning

  • Optimization and deep learning are closely related
  • Deep learning typically involves:
    • Defining a loss function
    • Using optimization to minimize the loss
  • Note: Most optimization algorithms minimize by convention
    • To maximize: simply flip the sign of the objective

Goals of Optimization vs Deep Learning

Optimization

  • Primary goal: Minimize objective function
  • Focus on training error
  • Direct mathematical approach

Deep Learning

  • Primary goal: Find suitable model
  • Focus on generalization error
  • Must handle finite data
  • Must prevent overfitting

Visualizing the Difference

Let’s examine empirical risk vs. risk:

Code
%matplotlib inline
import d2l
import numpy as np
from mpl_toolkits import mplot3d
import torch

def f(x):
    return x * d2l.cos(np.pi * x)

def g(x):
    return f(x) + 0.2 * d2l.cos(5 * np.pi * x)

def annotate(text, xy, xytext):
    d2l.plt.gca().annotate(text, xy=xy, xytext=xytext,
                           arrowprops=dict(arrowstyle='->'))

x = d2l.arange(0.5, 1.5, 0.01)
d2l.set_figsize((4.5, 2.5))
d2l.plot(x, [f(x), g(x)], 'x', 'risk')
annotate('min of\nempirical risk', (1.0, -1.2), (0.5, -1.1))
annotate('min of risk', (1.1, -1.05), (0.95, -0.5))

Key Challenges in Deep Learning Optimization

  1. Local Minima
  2. Saddle Points
  3. Vanishing Gradients

Local Minima

  • Definition: Point where function value is smaller than nearby points
  • Global minimum: Smallest value over entire domain
  • Example function: \(f(x) = x \cdot \textrm{cos}(\pi x)\)
Code
x = d2l.arange(-1.0, 2.0, 0.01)
d2l.plot(x, [f(x), ], 'x', 'f(x)')
annotate('local minimum', (-0.3, -0.25), (-0.77, -1.0))
annotate('global minimum', (1.1, -0.95), (0.6, 0.8))

Impact of Local Minima

  • Deep learning models often have many local optima
  • Gradient approaches zero near local minimum
  • Minibatch SGD can help escape local minima
    • Natural gradient variation provides “noise”
    • Can dislodge parameters from local minima

Saddle Points

  • Characteristics:
    • All gradients vanish
    • Neither global nor local minimum
  • Example: \(f(x) = x^3\)
    • First and second derivatives vanish at \(x=0\)
    • Optimization can stall here
Code
x = d2l.arange(-2.0, 2.0, 0.01)
d2l.plot(x, [x**3], 'x', 'f(x)')
annotate('saddle point', (0, -0.2), (-0.52, -5.0))

Higher-Dimensional Saddle Points

  • More complex in higher dimensions
  • Example: \(f(x,y) = x^2 - y^2\)
  • Has saddle point at \((0,0)\)
    • Maximum with respect to \(y\)
    • Minimum with respect to \(x\)
Code
x, y = d2l.meshgrid(
    d2l.linspace(-1.0, 1.0, 101), d2l.linspace(-1.0, 1.0, 101))
z = x**2 - y**2

ax = d2l.plt.figure().add_subplot(111, projection='3d')
ax.plot_wireframe(x, y, z, **{'rstride': 10, 'cstride': 10})
ax.plot([0], [0], [0], 'rx')
ticks = [-1, 0, 1]
d2l.plt.xticks(ticks)
d2l.plt.yticks(ticks)
ax.set_zticks(ticks)
d2l.plt.xlabel('x')
d2l.plt.ylabel('y')
Text(0.5, 0.5, 'y')

Hessian Matrix Analysis

For a \(k\)-dimensional input vector:

  • All positive eigenvalues → Local minimum
  • All negative eigenvalues → Local maximum
  • Mixed signs → Saddle point

Vanishing Gradients

  • Most insidious optimization problem
  • Example: \(f(x) = \tanh(x)\)
    • At \(x = 4\): gradient ≈ 0.0013
    • Optimization stalls
  • Historical context:
    • Major challenge before ReLU activation
    • Made deep learning training difficult
Code
x = d2l.arange(-2.0, 5.0, 0.01)
d2l.plot(x, [d2l.tanh(x)], 'x', 'f(x)')
annotate('vanishing gradient', (4, 1), (2, 0.0))

Summary

  • Key takeaways:
    • Training error minimization ≠ best generalization
    • Many local minima exist
    • Saddle points are common in non-convex problems
    • Vanishing gradients can stall optimization
  • Good news:
    • Robust algorithms exist
    • Perfect solutions not always necessary
    • Local optima can be useful
    • Many practical solutions available

Exercises

  1. Consider a simple MLP with a single hidden layer of \(d\) dimensions:

    • Show that for any local minimum there are at least \(d!\) equivalent solutions
    • Why does this happen?
  2. For a symmetric random matrix \(\mathbf{M}\):

    • Prove that eigenvalue distribution is symmetric
    • Why doesn’t this imply \(P(\lambda > 0) = 0.5\)?
  3. Additional challenges in deep learning optimization?

  4. Balancing a ball on a saddle:

    • Why is this hard?
    • How might this relate to optimization algorithms?

Convexity

  • Convexity is crucial for optimization algorithm design
  • Benefits:
    • Easier algorithm analysis and testing
    • Better understanding of deep learning optimization
    • Properties near local minima often resemble convex functions
  • Even nonconvex problems can benefit from convex analysis
Code
%matplotlib inline
import d2l
import numpy as np
from mpl_toolkits import mplot3d
import torch

Definitions

Convex Sets

  • A set \(\mathcal{X}\) is convex if:
    • For any \(a, b \in \mathcal{X}\)
    • Line segment connecting \(a\) and \(b\) is in \(\mathcal{X}\)
  • Mathematical definition: \[\lambda a + (1-\lambda) b \in \mathcal{X} \textrm{ whenever } a, b \in \mathcal{X}\] for all \(\lambda \in [0, 1]\)

Visual Examples

The first set is nonconvex and the other two are convex.

Set Operations

  • Intersections of convex sets are convex
  • Unions of convex sets need not be convex
  • Example: \(\mathbb{R}^d\) is convex
  • Bounded sets (e.g., balls) are often convex

The intersection between two convex sets is convex.

The union of two convex sets need not be convex.

Convex Functions

Definition

  • Function \(f: \mathcal{X} \to \mathbb{R}\) is convex if:
    • For all \(x, x' \in \mathcal{X}\)
    • For all \(\lambda \in [0, 1]\)
    • Satisfies: \(\lambda f(x) + (1-\lambda) f(x') \geq f(\lambda x + (1-\lambda) x')\)

Examples

Code
f = lambda x: 0.5 * x**2  # Convex
g = lambda x: d2l.cos(np.pi * x)  # Nonconvex
h = lambda x: d2l.exp(0.5 * x)  # Convex

x, segment = d2l.arange(-2, 2, 0.01), d2l.tensor([-1.5, 1])
d2l.use_svg_display()
_, axes = d2l.plt.subplots(1, 3, figsize=(9, 3))
for ax, func in zip(axes, [f, g, h]):
    d2l.plot([x, segment], [func(x), func(segment)], axes=ax)

Jensen’s Inequality

Definition

  • Generalization of convexity
  • For convex function \(f\): \[\sum_i \alpha_i f(x_i) \geq f\left(\sum_i \alpha_i x_i\right)\] \[E_X[f(X)] \geq f\left(E_X[X]\right)\]
  • Where \(\alpha_i \geq 0\) and \(\sum_i \alpha_i = 1\)

Applications

  • Bounding complex expressions
  • Log-likelihood of partially observed variables
  • Variational methods
  • Clustering algorithms

Properties

Local vs Global Minima

  • Local minima of convex functions are global minima
  • Proof by contradiction
  • Example: \(f(x) = (x-1)^2\)
Code
f = lambda x: (x - 1) ** 2
d2l.set_figsize((8,8))
d2l.plot([x, segment], [f(x), f(segment)], 'x', 'f(x)')

Below Sets

  • Given convex function \(f\) on convex set \(\mathcal{X}\)
  • Below set \(\mathcal{S}_b = \{x | x \in \mathcal{X} \textrm{ and } f(x) \leq b\}\) is convex
  • Proof uses definition of convexity

Second Derivatives

  • For twice-differentiable \(f: \mathbb{R}^n \rightarrow \mathbb{R}\)
  • Convex if and only if Hessian is positive semidefinite
  • One-dimensional case: \(f'' \geq 0\)
  • Multidimensional case: \(\nabla^2f \succeq 0\)

Constraints

Constrained Optimization

  • Form: \[\begin{aligned} \mathop{\textrm{minimize~}}_{\mathbf{x}} & f(\mathbf{x}) \\ \textrm{ subject to } & c_i(\mathbf{x}) \leq 0 \textrm{ for all } i \in \{1, \ldots, n\}\end{aligned}\]
  • Examples:
    • Unit ball constraint: \(c_1(\mathbf{x}) = \|\mathbf{x}\|_2 - 1\)
    • Half-space constraint: \(c_2(\mathbf{x}) = \mathbf{v}^\top \mathbf{x} + b\)

Lagrangian

  • Combines objective and constraints
  • Form: \(L(\mathbf{x}, \alpha_1, \ldots, \alpha_n) = f(\mathbf{x}) + \sum_{i=1}^n \alpha_i c_i(\mathbf{x})\)
  • Lagrange multipliers \(\alpha_i \geq 0\)
  • Saddle point optimization

Penalties

  • Alternative to exact constraint satisfaction
  • Add \(\alpha_i c_i(\mathbf{x})\) to objective
  • Example: weight decay
  • More robust than exact satisfaction
  • Works well for nonconvex problems

Projections

  • Projection on convex set \(\mathcal{X}\): \[\textrm{Proj}_\mathcal{X}(\mathbf{x}) = \mathop{\mathrm{argmin}}_{\mathbf{x}' \in \mathcal{X}} \|\mathbf{x} - \mathbf{x}'\|\]
  • Example: gradient clipping
  • Applications:
    • Sparse weight vectors
    • \(\ell_1\) ball projections

Convex Projections.

Summary

  • Key properties:
    • Intersections of convex sets are convex
    • Jensen’s inequality for expectations
    • Hessian positive semidefinite for convex functions
    • Local minima are global minima
  • Constraint handling:
    • Lagrangian approach
    • Penalty methods
    • Projections
  • Applications in deep learning:
    • Algorithm motivation
    • Understanding optimization
    • Gradient descent analysis

Introduction

  • Gradient descent is fundamental to understanding optimization
  • Key concepts apply to more advanced algorithms
  • Important considerations:
    • Learning rate selection
    • Divergence issues
    • Preconditioning techniques

One-Dimensional Gradient Descent

Mathematical Foundation

  • For continuously differentiable \(f: \mathbb{R} \rightarrow \mathbb{R}\)
  • Taylor expansion: \[f(x + \epsilon) = f(x) + \epsilon f'(x) + \mathcal{O}(\epsilon^2)\]
  • Moving in negative gradient direction:
    • Choose \(\epsilon = -\eta f'(x)\)
    • Fixed step size \(\eta > 0\)
    • Results in: \(f(x - \eta f'(x)) \lessapprox f(x)\)

Implementation

%matplotlib inline
import d2l
import numpy as np
import torch

def f(x):  # Objective function
    return x ** 2

def f_grad(x):  # Gradient (derivative) of the objective function
    return 2 * x

Basic Gradient Descent

def gd(eta, f_grad):
    x = 10.0
    results = [x]
    for i in range(10):
        x -= eta * f_grad(x)
        results.append(float(x))
    print(f'epoch 10, x: {x:f}')
    return results

results = gd(0.2, f_grad)
epoch 10, x: 0.060466

Visualization

def show_trace(results, f):
    n = max(abs(min(results)), abs(max(results)))
    f_line = d2l.arange(-n, n, 0.01)
    d2l.set_figsize()
    d2l.plot([f_line, results], [[f(x) for x in f_line], [
        f(x) for x in results]], 'x', 'f(x)', fmts=['-', '-o'])

show_trace(results, f)

Learning Rate Effects

Too Small Learning Rate

  • Slow convergence
  • More iterations needed
  • Example with \(\eta = 0.05\):
show_trace(gd(0.05, f_grad), f)
epoch 10, x: 3.486784

Too Large Learning Rate

  • Solution oscillates
  • May diverge
  • Example with \(\eta = 1.1\):
show_trace(gd(1.1, f_grad), f)
epoch 10, x: 61.917364

Local Minima

  • Nonconvex functions have multiple minima
  • Example: \(f(x) = x \cdot \cos(cx)\)
  • High learning rates can lead to poor local minima
c = d2l.tensor(0.15 * np.pi)

def f(x):  # Objective function
    return x * d2l.cos(c * x)

def f_grad(x):  # Gradient of the objective function
    return d2l.cos(c * x) - c * x * d2l.sin(c * x)

show_trace(gd(2, f_grad), f)
epoch 10, x: -1.528166

Multivariate Gradient Descent

Mathematical Foundation

  • For \(f: \mathbb{R}^d \to \mathbb{R}\)
  • Gradient vector: \(\nabla f(\mathbf{x}) = [\frac{\partial f(\mathbf{x})}{\partial x_1}, \ldots, \frac{\partial f(\mathbf{x})}{\partial x_d}]^\top\)
  • Taylor expansion: \[f(\mathbf{x} + \boldsymbol{\epsilon}) = f(\mathbf{x}) + \mathbf{\boldsymbol{\epsilon}}^\top \nabla f(\mathbf{x}) + \mathcal{O}(\|\boldsymbol{\epsilon}\|^2)\]
  • Update rule: \(\mathbf{x} \leftarrow \mathbf{x} - \eta \nabla f(\mathbf{x})\)

Implementation

def train_2d(trainer, steps=20, f_grad=None):
    """Optimize a 2D objective function with a customized trainer."""
    x1, x2, s1, s2 = -5, -2, 0, 0
    results = [(x1, x2)]
    for i in range(steps):
        if f_grad:
            x1, x2, s1, s2 = trainer(x1, x2, s1, s2, f_grad)
        else:
            x1, x2, s1, s2 = trainer(x1, x2, s1, s2)
        results.append((x1, x2))
    print(f'epoch {i + 1}, x1: {float(x1):f}, x2: {float(x2):f}')
    return results

def show_trace_2d(f, results):
    """Show the trace of 2D variables during optimization."""
    d2l.set_figsize()
    d2l.plt.plot(*zip(*results), '-o', color='#ff7f0e')
    x1, x2 = d2l.meshgrid(d2l.arange(-5.5, 1.0, 0.1),
                          d2l.arange(-3.0, 1.0, 0.1), indexing='ij')
    d2l.plt.contour(x1, x2, f(x1, x2), colors='#1f77b4')
    d2l.plt.xlabel('x1')
    d2l.plt.ylabel('x2')

Example: Quadratic Function

def f_2d(x1, x2):  # Objective function
    return x1 ** 2 + 2 * x2 ** 2

def f_2d_grad(x1, x2):  # Gradient of the objective function
    return (2 * x1, 4 * x2)

def gd_2d(x1, x2, s1, s2, f_grad):
    g1, g2 = f_grad(x1, x2)
    return (x1 - eta * g1, x2 - eta * g2, 0, 0)

eta = 0.1
show_trace_2d(f_2d, train_2d(gd_2d, f_grad=f_2d_grad))
epoch 20, x1: -0.057646, x2: -0.000073

Adaptive Methods

Newton’s Method

  • Uses second-order information
  • Taylor expansion with Hessian: \[f(\mathbf{x} + \boldsymbol{\epsilon}) = f(\mathbf{x}) + \boldsymbol{\epsilon}^\top \nabla f(\mathbf{x}) + \frac{1}{2} \boldsymbol{\epsilon}^\top \nabla^2 f(\mathbf{x}) \boldsymbol{\epsilon} + \mathcal{O}(\|\boldsymbol{\epsilon}\|^3)\]
  • Update rule: \(\boldsymbol{\epsilon} = -\mathbf{H}^{-1} \nabla f(\mathbf{x})\)

Implementation

c = d2l.tensor(0.5)

def f(x):  # Objective function
    return d2l.cosh(c * x)

def f_grad(x):  # Gradient of the objective function
    return c * d2l.sinh(c * x)

def f_hess(x):  # Hessian of the objective function
    return c**2 * d2l.cosh(c * x)

def newton(eta=1):
    x = 10.0
    results = [x]
    for i in range(10):
        x -= eta * f_grad(x) / f_hess(x)
        results.append(float(x))
    print('epoch 10, x:', x)
    return results

show_trace(newton(), f)
epoch 10, x: tensor(0.)

Nonconvex Example

c = d2l.tensor(0.15 * np.pi)

def f(x):  # Objective function
    return x * d2l.cos(c * x)

def f_grad(x):  # Gradient of the objective function
    return d2l.cos(c * x) - c * x * d2l.sin(c * x)

def f_hess(x):  # Hessian of the objective function
    return - 2 * c * d2l.sin(c * x) - x * c**2 * d2l.cos(c * x)

show_trace(newton(0.5), f)
epoch 10, x: tensor(7.2699)

Preconditioning

Key Concepts

  • Avoid full Hessian computation
  • Use diagonal entries only
  • Update rule: \(\mathbf{x} \leftarrow \mathbf{x} - \eta \textrm{diag}(\mathbf{H})^{-1} \nabla f(\mathbf{x})\)
  • Benefits:
    • Different learning rates per variable
    • Handles scale mismatches
    • More efficient than full Newton’s method

Summary

  • Learning rate selection is crucial
  • Local minima can trap gradient descent
  • High dimensions require careful learning rate adjustment
  • Preconditioning helps with scale issues
  • Newton’s method:
    • Fast convergence for convex problems
    • Requires careful handling for nonconvex problems
    • Computationally expensive for large problems

Exercises

  1. Experiment with different learning rates and objective functions
  2. Implement line search for convex optimization
  3. Design a slow-converging 2D objective function
  4. Implement lightweight Newton’s method with preconditioning
  5. Test algorithms on rotated coordinate systems

Stochastic Gradient Descent

  • Previously used SGD without detailed explanation
  • Now diving deeper into its principles
  • Building on gradient descent fundamentals
  • Understanding why and how it works
Code
%matplotlib inline
import d2l
import math
import torch

Stochastic Gradient Updates

Objective Function

  • Training dataset with \(n\) examples
  • Loss function \(f_i(\mathbf{x})\) for example \(i\)
  • Overall objective: \[f(\mathbf{x}) = \frac{1}{n} \sum_{i = 1}^n f_i(\mathbf{x})\]
  • Full gradient: \[\nabla f(\mathbf{x}) = \frac{1}{n} \sum_{i = 1}^n \nabla f_i(\mathbf{x})\]

Computational Cost

  • Gradient descent: \(\mathcal{O}(n)\) per iteration
  • SGD: \(\mathcal{O}(1)\) per iteration
  • Update rule: \[\mathbf{x} \leftarrow \mathbf{x} - \eta \nabla f_i(\mathbf{x})\]
  • Unbiased estimate: \[\mathbb{E}_i \nabla f_i(\mathbf{x}) = \nabla f(\mathbf{x})\]

Implementation

def f(x1, x2):  # Objective function
    return x1 ** 2 + 2 * x2 ** 2

def f_grad(x1, x2):  # Gradient of the objective function
    return 2 * x1, 4 * x2
def sgd(x1, x2, s1, s2, f_grad):
    g1, g2 = f_grad(x1, x2)
    # Simulate noisy gradient
    g1 += torch.normal(0.0, 1, (1,)).item()
    g2 += torch.normal(0.0, 1, (1,)).item()
    eta_t = eta * lr()
    return (x1 - eta_t * g1, x2 - eta_t * g2, 0, 0)
def constant_lr():
    return 1

eta = 0.1
lr = constant_lr  # Constant learning rate
d2l.show_trace_2d(f, d2l.train_2d(sgd, steps=50, f_grad=f_grad))
epoch 50, x1: 0.212983, x2: 0.064037

Dynamic Learning Rate

Learning Rate Strategies

  • Piecewise constant: \(\eta(t) = \eta_i \textrm{ if } t_i \leq t \leq t_{i+1}\)
  • Exponential decay: \(\eta(t) = \eta_0 \cdot e^{-\lambda t}\)
  • Polynomial decay: \(\eta(t) = \eta_0 \cdot (\beta t + 1)^{-\alpha}\)

Exponential Decay Implementation

def exponential_lr():
    global t
    t += 1
    return math.exp(-0.1 * t)

t = 1
lr = exponential_lr
d2l.show_trace_2d(f, d2l.train_2d(sgd, steps=1000, f_grad=f_grad))
epoch 1000, x1: -0.717273, x2: -0.023917

Polynomial Decay Implementation

def polynomial_lr():
    global t
    t += 1
    return (1 + 0.1 * t) ** (-0.5)

t = 1
lr = polynomial_lr
d2l.show_trace_2d(f, d2l.train_2d(sgd, steps=50, f_grad=f_grad))
epoch 50, x1: 0.060004, x2: -0.020514

Stochastic Gradients and Finite Samples

Sampling Strategies

  • With replacement:
    • Probability of choosing element: \(1 - e^{-1} \approx 0.63\)
    • Increased variance
    • Decreased data efficiency
  • Without replacement:
    • Better variance properties
    • More efficient data usage
    • Default choice in practice

Summary

  • Key points:
    • SGD reduces computational cost to \(\mathcal{O}(1)\)
    • Learning rate scheduling is crucial
    • Convergence guarantees for convex problems
    • Sampling without replacement preferred
  • Practical considerations:
    • Dynamic learning rates
    • Trade-offs in sampling strategies
    • Nonconvex optimization challenges

Exercises

  1. Experiment with learning rate schedules
  2. Analyze noise in gradient updates
  3. Compare sampling strategies
  4. Investigate gradient coordinate scaling
  5. Study local minima in nonconvex functions

Minibatch Stochastic Gradient Descent

  • Two extremes in gradient-based learning:
    • Full dataset (gradient descent)
    • Single examples (stochastic gradient descent)
  • Each approach has drawbacks:
    • Gradient descent: Not data efficient for similar data
    • SGD: Not computationally efficient (poor vectorization)
  • Minibatch SGD offers a middle ground

Vectorization and Caches

Hardware Considerations

  • Multiple GPUs and servers require larger minibatches
    • 8 GPUs × 16 servers = minimum batch size of 128
  • Single GPU/CPU considerations:
    • Multiple memory types (registers, L1/L2/L3 cache)
    • Different bandwidth constraints
    • Memory access patterns matter

Performance Metrics

  • Modern CPU capabilities:
    • 2GHz CPU with 16 cores and AVX-512
    • Can process up to 10¹² bytes/second
  • GPU capabilities:
    • 100× better than CPU
  • Memory bandwidth limitations:
    • Midrange server: ~100 GB/s
    • Memory access width: 64-384 bit

Matrix Multiplication Strategies

Different Approaches

  1. Element-wise computation
  2. Column-wise computation
  3. Full matrix multiplication
  4. Block-wise computation

Performance Comparison

import d2l
import torch
import time
import numpy as np

class Timer:
    """Record multiple running times."""
    def __init__(self):
        self.times = []
        self.start()

    def start(self):
        """Start the timer."""
        self.tik = time.time()

    def stop(self):
        """Stop the timer and record the time in a list."""
        self.times.append(time.time() - self.tik)
        return self.times[-1]

    def avg(self):
        """Return the average time."""
        return sum(self.times) / len(self.times)

    def sum(self):
        """Return the sum of time."""
        return sum(self.times)

    def cumsum(self):
        """Return the accumulated time."""
        return torch.tensor(self.times).cumsum().tolist()

# Initialize matrices
A = torch.zeros(256, 256)
B = torch.randn(256, 256)
C = torch.randn(256, 256)
timer = Timer()

Element-wise Computation

# Compute A = BC one element at a time
timer.start()
for i in range(256):
    for j in range(256):
        A[i, j] = torch.dot(B[i, :], C[:, j])
timer.stop()
2.174051523208618

Column-wise Computation

# Compute A = BC one column at a time
timer.start()
for j in range(256):
    A[:, j] = torch.mv(B, C[:, j])
timer.stop()
0.016129732131958008

Full Matrix Multiplication

# Compute A = BC in one go
timer.start()
A = torch.mm(B, C)
timer.stop()

gigaflops = [0.03 / i for i in timer.times]
print(f'performance in Gigaflops: element {gigaflops[0]:.3f}, '
      f'column {gigaflops[1]:.3f}, full {gigaflops[2]:.3f}')
performance in Gigaflops: element 0.014, column 1.860, full 35.217

Minibatch Processing

Why Use Minibatches?

  • Computational efficiency
  • Statistical properties:
    • Maintains gradient expectation
    • Reduces variance by factor of \(b^{-\frac{1}{2}}\)
    • \(b\) = batch size

Batch Size Trade-offs

  • Too small:
    • Poor computational efficiency
    • High variance
  • Too large:
    • Diminishing returns in variance reduction
    • Memory constraints
  • Optimal: Balance between:
    • Computational efficiency
    • Statistical efficiency
    • Available memory

Implementation

Data Loading

#@save
d2l.DATA_HUB['airfoil'] = (d2l.DATA_URL + 'airfoil_self_noise.dat',
                           '76e5be1548fd8222e5074cf0faae75edff8cf93f')

#@save
def get_data_ch11(batch_size=10, n=1500):
    data = np.genfromtxt(d2l.download('airfoil'),
                         dtype=np.float32, delimiter='\t')
    data = torch.from_numpy((data - data.mean(axis=0)) / data.std(axis=0))
    data_iter = d2l.load_array((data[:n, :-1], data[:n, -1]),
                               batch_size, is_train=True)
    return data_iter, data.shape[1]-1

Training Function

#@save
def sgd(params, states, hyperparams):
    for p in params:
        p.data.sub_(hyperparams['lr'] * p.grad)
        p.grad.data.zero_()

Training Function

def train_ch11(trainer_fn, states, hyperparams, data_iter,
               feature_dim, num_epochs=2):
    # Initialization
    w = torch.normal(mean=0.0, std=0.01, size=(feature_dim, 1),
                     requires_grad=True)
    b = torch.zeros((1), requires_grad=True)
    net, loss = lambda X: d2l.linreg(X, w, b), d2l.squared_loss
    # Train
    animator = d2l.Animator(xlabel='epoch', ylabel='loss',
                            xlim=[0, num_epochs], ylim=[0.22, 0.35])
    n, timer = 0, d2l.Timer()
    for _ in range(num_epochs):
        for X, y in data_iter:
            l = loss(net(X), y).mean()
            l.backward()
            trainer_fn([w, b], states, hyperparams)
            n += X.shape[0]
            if n % 200 == 0:
                timer.stop()
                animator.add(n/X.shape[0]/len(data_iter),
                             (d2l.evaluate_loss(net, data_iter, loss),))
                timer.start()
    print(f'loss: {animator.Y[0][-1]:.3f}, {timer.sum()/num_epochs:.3f} sec/epoch')
    return timer.cumsum(), animator.Y[0]

Performance Comparison

Different Batch Sizes

def train_sgd(lr, batch_size, num_epochs=2):
    data_iter, feature_dim = get_data_ch11(batch_size)
    return train_ch11(
        sgd, None, {'lr': lr}, data_iter, feature_dim, num_epochs)

# Compare different approaches
gd_res = train_sgd(1, 1500, 10)  # Full batch
sgd_res = train_sgd(0.005, 1)    # Single example
d2l.set_figsize([6, 3])
d2l.plot(*list(map(list, zip(gd_res, sgd_res))),
         'time (sec)', 'loss', xlim=[1e-2, 10],
         legend=['gd', 'sgd', 'batch size=100', 'batch size=10'])
d2l.plt.gca().set_xscale('log')
loss: 0.243, 0.520 sec/epoch

mini1_res = train_sgd(.4, 100)   # Medium batch
mini2_res = train_sgd(.05, 10)   # Small batch

# Plot results
d2l.set_figsize([6, 3])
d2l.plot(*list(map(list, zip( mini1_res, mini2_res))),
         'time (sec)', 'loss', xlim=[1e-2, 10],
         legend=['gd', 'sgd', 'batch size=100', 'batch size=10'])
d2l.plt.gca().set_xscale('log')
loss: 0.245, 0.072 sec/epoch

Summary

  • Vectorization benefits:
    • Reduced framework overhead
    • Better memory locality
    • Improved caching
  • Minibatch SGD advantages:
    • Computational efficiency
    • Statistical efficiency
    • Memory efficiency
  • Key considerations:
    • Batch size selection
    • Learning rate decay
    • Hardware constraints

Exercises

  1. Experiment with different batch sizes and learning rates
  2. Implement learning rate decay
  3. Compare with replacement sampling
  4. Analyze behavior with duplicated data

Momentum in Optimization

  • Momentum is a key optimization technique in deep learning
  • Addresses challenges in stochastic gradient descent:
    • Learning rate sensitivity
    • Convergence issues
    • Noise handling
  • Particularly effective for ill-conditioned problems

Basics of Momentum

Leaky Averages

  • Minibatch SGD averages gradients to reduce variance
  • Momentum extends this concept using “leaky averages”:
    • Accumulates past gradients
    • Weights recent gradients more heavily
    • Formula: \(\mathbf{v}_t = \beta \mathbf{v}_{t-1} + \mathbf{g}_{t, t-1}\)
    • \(\beta \in (0, 1)\) controls the “memory” of past gradients

Key Benefits

  • Accelerates convergence
  • Particularly effective for:
    • Ill-conditioned problems
    • Narrow canyons in optimization landscape
  • Provides more stable descent directions
  • Works well with both:
    • Noise-free convex problems
    • Stochastic gradient descent

Visualizing the Problem

Let’s examine an ill-conditioned problem:

%matplotlib inline
import d2l
import torch

eta = 0.4
def f_2d(x1, x2):
    return 0.1 * x1 ** 2 + 2 * x2 ** 2
def gd_2d(x1, x2, s1, s2):
    return (x1 - eta * 0.2 * x1, x2 - eta * 4 * x2, 0, 0)

d2l.show_trace_2d(f_2d, d2l.train_2d(gd_2d))
epoch 20, x1: -0.943467, x2: -0.000073

The Challenge

  • Function \(f(\mathbf{x}) = 0.1 x_1^2 + 2 x_2^2\) is very flat in \(x_1\) direction
  • Gradient in \(x_2\) direction:
    • Much higher
    • Changes more rapidly
  • Trade-off in learning rate:
    • Small rate: Slow convergence in \(x_1\)
    • Large rate: Divergence in \(x_2\)

Momentum Method

Update Equations

\[ \begin{aligned} \mathbf{v}_t &\leftarrow \beta \mathbf{v}_{t-1} + \mathbf{g}_{t, t-1}, \\ \mathbf{x}_t &\leftarrow \mathbf{x}_{t-1} - \eta_t \mathbf{v}_t. \end{aligned} \]

Implementation

def momentum_2d(x1, x2, v1, v2):
    v1 = beta * v1 + 0.2 * x1
    v2 = beta * v2 + 4 * x2
    return x1 - eta * v1, x2 - eta * v2, v1, v2

eta, beta = 0.6, 0.5
d2l.show_trace_2d(f_2d, d2l.train_2d(momentum_2d))
epoch 20, x1: 0.007188, x2: 0.002553

Effect of Momentum Parameter

  • \(\beta = 0.5\): Good convergence
  • \(\beta = 0.25\): Barely converges but better than no momentum
  • \(\beta = 0\): Reduces to regular gradient descent
eta, beta = 0.6, 0.25
d2l.show_trace_2d(f_2d, d2l.train_2d(momentum_2d))
epoch 20, x1: -0.126340, x2: -0.186632

Effective Sample Weight

  • Sum of weights: \(\sum_{\tau=0}^\infty \beta^\tau = \frac{1}{1-\beta}\)
  • Step size effectively becomes \(\frac{\eta}{1-\beta}\)
  • Better behaved descent direction
d2l.set_figsize()
betas = [0.95, 0.9, 0.6, 0]
for beta in betas:
    x = d2l.numpy(d2l.arange(40))
    d2l.plt.plot(x, beta ** x, label=f'beta = {beta:.2f}')
d2l.plt.xlabel('time')
d2l.plt.legend();

Practical Implementation

From Scratch

def init_momentum_states(feature_dim):
    v_w = d2l.zeros((feature_dim, 1))
    v_b = d2l.zeros(1)
    return (v_w, v_b)

def sgd_momentum(params, states, hyperparams):
    for p, v in zip(params, states):
        with torch.no_grad():
            v[:] = hyperparams['momentum'] * v + p.grad
            p[:] -= hyperparams['lr'] * v
        p.grad.data.zero_()

Training with Different Parameters

def train_momentum(lr, momentum, num_epochs=2):
    d2l.train_ch11(sgd_momentum, init_momentum_states(feature_dim),
                   {'lr': lr, 'momentum': momentum}, data_iter,
                   feature_dim, num_epochs)

data_iter, feature_dim = d2l.get_data_ch11(batch_size=10)
train_momentum(0.02, 0.5)
loss: 0.244, 0.089 sec/epoch

Theoretical Analysis

Quadratic Convex Functions

  • General form: \(h(\mathbf{x}) = \frac{1}{2} \mathbf{x}^\top \mathbf{Q} \mathbf{x} + \mathbf{x}^\top \mathbf{c} + b\)
  • For positive definite \(\mathbf{Q}\):
    • Minimizer at \(\mathbf{x}^* = -\mathbf{Q}^{-1} \mathbf{c}\)
    • Minimum value: \(b - \frac{1}{2} \mathbf{c}^\top \mathbf{Q}^{-1} \mathbf{c}\)
  • Gradient: \(\partial_{\mathbf{x}} h(\mathbf{x}) = \mathbf{Q} (\mathbf{x} - \mathbf{Q}^{-1} \mathbf{c})\)

Convergence Analysis

  • For scalar function \(f(x) = \frac{\lambda}{2} x^2\):
    • Gradient descent: \(x_{t+1} = (1 - \eta \lambda) x_t\)
    • Convergence when \(|1 - \eta \lambda| < 1\)
    • Exponential convergence rate
lambdas = [0.1, 1, 10, 19]
eta = 0.1
d2l.set_figsize((6, 4))
for lam in lambdas:
    t = d2l.numpy(d2l.arange(20))
    d2l.plt.plot(t, (1 - eta * lam) ** t, label=f'lambda = {lam:.2f}')
d2l.plt.xlabel('time')
d2l.plt.legend();

Summary

  • Momentum replaces gradients with leaky averages
  • Key benefits:
    • Accelerates convergence
    • Works for both noise-free and noisy gradients
    • Prevents optimization stalling
    • Effective sample size: \(\frac{1}{1-\beta}\)
  • Implementation requires:
    • Additional state vector (velocity)
    • Careful parameter tuning

Exercises

  1. Experiment with different momentum and learning rate combinations
  2. Analyze gradient descent and momentum for quadratic problems with multiple eigenvalues
  3. Derive minimum value and minimizer for \(h(\mathbf{x}) = \frac{1}{2} \mathbf{x}^\top \mathbf{Q} \mathbf{x} + \mathbf{x}^\top \mathbf{c} + b\)
  4. Investigate behavior with stochastic gradient descent and minibatch variants

Adam Optimization

  • Adam combines multiple optimization techniques:
    • Stochastic Gradient Descent (SGD)
    • Minibatch processing
    • Momentum
    • Per-coordinate scaling (AdaGrad)
    • Learning rate adjustment (RMSProp)
  • Popular in deep learning due to:
    • Robustness
    • Effectiveness
    • Computational efficiency

Previous Optimization Methods

  • SGD: Efficient for redundant data
  • Minibatch SGD: Enables parallel processing
  • Momentum: Accelerates convergence
  • AdaGrad: Efficient preconditioning
  • RMSProp: Decoupled scaling

The Algorithm

State Variables

  • Uses exponential weighted moving averages
  • Momentum estimate: \[\mathbf{v}_t \leftarrow \beta_1 \mathbf{v}_{t-1} + (1 - \beta_1) \mathbf{g}_t\]
  • Second moment estimate: \[\mathbf{s}_t \leftarrow \beta_2 \mathbf{s}_{t-1} + (1 - \beta_2) \mathbf{g}_t^2\]
  • Typical values: \(\beta_1 = 0.9\), \(\beta_2 = 0.999\)

Bias Correction

  • Initial bias towards smaller values
  • Normalized state variables: \[\hat{\mathbf{v}}_t = \frac{\mathbf{v}_t}{1 - \beta_1^t}\] \[\hat{\mathbf{s}}_t = \frac{\mathbf{s}_t}{1 - \beta_2^t}\]

Update Rule

  • Rescaled gradient: \[\mathbf{g}_t' = \frac{\eta \hat{\mathbf{v}}_t}{\sqrt{\hat{\mathbf{s}}_t} + \epsilon}\]
  • Parameter update: \[\mathbf{x}_t \leftarrow \mathbf{x}_{t-1} - \mathbf{g}_t'\]
  • Typically \(\epsilon = 10^{-6}\)

Implementation

State Initialization

import d2l
import torch
#| label: init-states
def init_adam_states(feature_dim):
    v_w, v_b = d2l.zeros((feature_dim, 1)), d2l.zeros(1)
    s_w, s_b = d2l.zeros((feature_dim, 1)), d2l.zeros(1)
    return ((v_w, s_w), (v_b, s_b))

Adam Update

def adam(params, states, hyperparams):
    beta1, beta2, eps = 0.9, 0.999, 1e-6
    for p, (v, s) in zip(params, states):
        with torch.no_grad():
            # Update momentum
            v[:] = beta1 * v + (1 - beta1) * p.grad
            # Update second moment
            s[:] = beta2 * s + (1 - beta2) * torch.square(p.grad)
            # Bias correction
            v_bias_corr = v / (1 - beta1 ** hyperparams['t'])
            s_bias_corr = s / (1 - beta2 ** hyperparams['t'])
            # Parameter update
            p[:] -= hyperparams['lr'] * v_bias_corr / (
                torch.sqrt(s_bias_corr) + eps)
        p.grad.data.zero_()
    hyperparams['t'] += 1

Training

data_iter, feature_dim = d2l.get_data_ch11(batch_size=10)
d2l.train_ch11(adam, init_adam_states(feature_dim),
               {'lr': 0.01, 't': 1}, data_iter, feature_dim)
loss: 0.242, 0.115 sec/epoch
([0.0162503719329834,
  0.031459808349609375,
  0.04641532897949219,
  0.06131386756896973,
  0.07648658752441406,
  0.09164547920227051,
  0.10674500465393066,
  0.12213468551635742,
  0.1371297836303711,
  0.15255403518676758,
  0.1675577163696289,
  0.18274903297424316,
  0.19857430458068848,
  0.21396255493164062,
  0.22929930686950684],
 [0.38719990106423696,
  0.31384372727076215,
  0.27214715286095936,
  0.2551165070931117,
  0.24796205945809682,
  0.24465824242432913,
  0.24373534599939983,
  0.2437932614684105,
  0.2439471165339152,
  0.24570534584919612,
  0.24246593608458836,
  0.24531665662924448,
  0.2463795498808225,
  0.24286572392781575,
  0.24233181184530259])

Concise Implementation

trainer = torch.optim.Adam
d2l.train_concise_ch11(trainer, {'lr': 0.01}, data_iter)
loss: 0.243, 0.108 sec/epoch

Summary

  • Adam combines multiple optimization techniques
  • Key features:
    • Momentum from RMSProp
    • Bias correction
    • Learning rate control
  • Yogi variant:
    • Addresses convergence issues
    • Modified second moment update
    • Better variance control

Exercises

  1. Experiment with learning rate adjustments
  2. Rewrite momentum updates without bias correction
  3. Analyze learning rate reduction during convergence
  4. Construct divergence cases for Adam vs Yogi

Sensor Fusion as a Regression Problem

  • High-resolution chemical imaging in STEM is limited by inelastic scattering.
  • HAADF gives high SNR but lacks chemical specificity.
  • EDX/EELS gives chemistry but is noisy at low dose.
  • Goal: Fuse both signals for high-quality chemical maps.

Data Fusion as Inverse Problem

Reconstruction goal: \[ \hat{x} = \arg\min_{x \geq 0} \; \Psi_1(x) + \lambda_1 \Psi_2(x) + \lambda_2 \text{TV}(x) \]

Where:

  • \(\Psi_1\): HAADF model loss
  • \(\Psi_2\): spectroscopic data fidelity
  • \(\text{TV}(x)\): regularization term

First Term: HAADF Consistency

HAADF image model: \[ \Psi_1(x) = \frac{1}{2} \| b_H - A x^\gamma \|_2^2 \]

  • \(b_H\): measured HAADF signal
  • \(x^\gamma\): element-wise power (Z-contrast)
  • \(\gamma \approx 1.7\): approximates Z-contrast

Interpretation: Ensure the fused chemical map explains HAADF contrast.

Second Term: Spectroscopic Fidelity

Poisson noise model for EDX/EELS: \[ \Psi_2(x) = \sum_i 1^T x_i - b_i^T \log(x_i + \varepsilon) \]

  • \(x_i\): reconstructed map of element \(i\)
  • \(b_i\): measured EDX/EELS signal for element \(i\)
  • \(\varepsilon\): small constant to avoid \(\log(0)\)

Interpretation: Match fused maps with noisy spectroscopic measurements.

Third Term: Total Variation (TV)

Channel-wise total variation: \[ \text{TV}(x) = \sum_i \|x_i\|_{TV} \]

Purpose: - Promote piecewise smooth maps - Reduce noise - Preserve edges

Popular in: - Compressed sensing - Image denoising

Summary of Loss Terms

Term Meaning Benefit
\(\Psi_1\) HAADF consistency Uses high SNR elastic signal
\(\Psi_2\) Spectroscopy fidelity Honors noisy chemical data
\(\text{TV}(x)\) Regularization Noise suppression and smoothness

All terms are necessary for accurate low-dose chemical recovery.

Practical Results

  • Improves SNR by 300–500%.
  • Reduces required dose by >10×.
  • Recovers stoichiometry with <15% error.

Takeaways

  • Multi-modal fusion = better signal, lower dose.
  • Expressed as interpretable optimization.
  • Each term plays a distinct role.

Future outlook: Combine with additional modalities (e.g., ABF, ptychography).

Overview

  • Tutorial on fusing EELS/X-EDS maps with HAADF for improved chemical resolution
  • Part 1 of 2: Atomic resolution HAADF and X-EDS dataset of DyScO\(_3\)
  • Python-based workflow with minimal user input (<10 tunable lines)
  • Quick transformation of datasets into resolution-enhanced chemical maps

Example Output

Raw vs Fused DyScO\(_3\)

Experimental Requirements

  • Need both elastic (e.g., HAADF) and inelastic (e.g., EELS/X-EDS) maps
  • Elastic signal must provide Z-contrast
  • Inelastic signal must map all chemistries
  • All maps must have same dimensionality
  • Recommendation: Use simultaneously collected HAADF signal

Step 1: Python Imports

import data.fusion_utils as utils
from scipy.sparse import spdiags
import matplotlib.pyplot as plt
from tqdm.notebook import tqdm 
import numpy as np

Step 2: Data Loading

data = np.load('data/PTO_Trilayer_dataset.npz')
# Define element names and their atomic weights
elem_names=['Sc', 'Dy', 'O']
elem_weights=[21,66,8]

# Parse elastic HAADF data and inelastic chemical maps
HAADF = data['HAADF']
xx = np.array([],dtype=np.float32)
for ee in elem_names:
    chemMap = data[ee]
    if chemMap.shape != HAADF.shape:
        raise ValueError(f"The dimensions of {ee} chemical map do not match HAADF dimensions.")
    chemMap -= np.min(chemMap); chemMap /= np.max(chemMap)
    xx = np.concatenate([xx,chemMap.flatten()])

Step 3: Data Reshaping

# Make Copy of Raw Measurements
xx0 = xx.copy()

# Parameters
gamma = 1.6  # Z-contrast scaling factor
(nx, ny) = chemMap.shape; nPix = nx * ny
nz = len(elem_names)

# Initialize TV Regularizers
reg = utils.tvlib(nx,ny)

# Normalize HAADF
HAADF -= np.min(HAADF); HAADF /= np.max(HAADF)
HAADF = HAADF.flatten()

# Create Measurement Matrix
A = utils.create_weighted_measurement_matrix(nx,ny,nz,elem_weights,gamma,1)

Step 4: Cost Function Parameters

# Convergence Parameters
lambdaHAADF = 1/nz # 1/nz (do not modify)
lambdaChem = 0.08 # 0.05-0.3 (data consistency)
lambdaTV = 0.15 # <0.2  Total Variation denoising
nIter = 30      # typically converges in 10-15
bkg = 2.4e-1    # background subtraction

# FGP TV Parameters
regularize = True
nIter_TV = 3

Step 5: Algorithm Execution

# Initialize
xx = xx0.copy()

# Cost Functions
lsqFun = lambda inData : 0.5 * np.linalg.norm(A.dot(inData**gamma) - HAADF) **2
poissonFun = lambda inData : np.sum(xx0 * np.log(inData + 1e-8) - inData)

# Initialize Cost Tracking
costHAADF = np.zeros(nIter,dtype=np.float32)
costChem = np.zeros(nIter, dtype=np.float32)
costTV = np.zeros(nIter, dtype=np.float32)

# Main Loop
for kk in tqdm(range(nIter)):
    # Optimization
    xx -= gamma * spdiags(xx**(gamma - 1), [0], nz*nx*ny, nz*nx*ny) * \
          lambdaHAADF * A.transpose() * (A.dot(xx**gamma) - HAADF) + \
          lambdaChem * (1 - xx0 / (xx + bkg))
    
    # Positivity Constraint
    xx[xx<0] = 0
    
    # TV Regularization
    if regularize:
        for zz in range(nz):
            xx[zz*nPix:(zz+1)*nPix] = reg.fgp_tv(
                xx[zz*nPix:(zz+1)*nPix].reshape(nx,ny), 
                lambdaTV, 
                nIter_TV
            ).flatten()
            costTV[kk] += reg.tv(xx[zz*nPix:(zz+1)*nPix].reshape(nx,ny))
    
    # Track Costs
    costHAADF[kk] = lsqFun(xx)
    costChem[kk] = poissonFun(xx)

Step 6: Convergence Assessment

Convergence Criteria

  • All 3 cost functions should asymptotically approach low values
  • Look for:
    • Exponential decay
    • Brief overshooting (Lennard-Jones-like)
    • Avoid:
      • Incomplete convergence
      • Severe oscillations

Convergence Plot

TV Weighting Effects

TV Weighting Comparison

TV Weighting Guidelines

  • Under-weighting: Results in noisy reconstructions
  • Over-weighting: Causes blurring and feature loss
  • Best practice: Err on side of under-weighting
    • Noise is familiar to data
    • Oversmoothing creates unphysical artifacts

Results Visualization

# Display Cost Functions
utils.plot_convergence(costHAADF, lambdaHAADF, 
                      costChem, lambdaChem, 
                      costTV, lambdaTV)
# Show Reconstructed Signal
fig, ax = plt.subplots(2,len(elem_names)+1,figsize=(12,6.5))
ax = ax.flatten()
ax[0].imshow((A.dot(xx**gamma)).reshape(nx,ny),cmap='gray'); ax[0].set_title('HAADF'); ax[0].axis('off')
ax[1+len(elem_names)].imshow((A.dot(xx**gamma)).reshape(nx,ny)[70:130,25:85],cmap='gray'); ax[1+len(elem_names)].set_title('HAADF Cropped'); ax[1+len(elem_names)].axis('off')

for ii in range(len(elem_names)):
    ax[ii+1].imshow(xx[ii*(nx*ny):(ii+1)*(nx*ny)].reshape(nx,ny),cmap='gray')
    ax[ii+2+len(elem_names)].imshow(xx[ii*(nx*ny):(ii+1)*(nx*ny)].reshape(nx,ny)[70:130,25:85],cmap='gray')
    
    ax[ii+1].set_title(elem_names[ii])
    ax[ii+1].axis('off')
    ax[ii+2+len(elem_names)].set_title(elem_names[ii]+' Cropped')
    ax[ii+2+len(elem_names)].axis('off')

fig.tight_layout()

Best Practices Summary

  1. Ensure proper data collection
  2. Verify dimensional consistency
  3. Start with recommended parameter ranges
  4. Monitor convergence carefully
  5. Validate results against physical expectations

References

Manassa, Jason, Miti Shah, Min Gee Cho, Zichao Wendy Di, Yi Jiang, Jeffrey A Fessler, Yu-Tsun Shao, Mary C Scott, Jonathan Schwartz, and Robert Hovden. 2024. “Fused Multi-Modal Electron Microscopy.” Elemental Microscopy. https://doi.org/10.69761/MXVR4353.
Schwartz, Jonathan, Zichao Wendy Di, Yi Jiang, Alyssa J. Fielitz, Don-Hyung Ha, Sanjaya D. Perera, Ismail El Baggari, et al. 2022. “Imaging Atomic-Scale Chemistry from Fused Multi-Modal Electron Microscopy.” Npj Computational Materials 8 (11): 1–8. https://doi.org/10.1038/s41524-021-00692-5.