Data Science for Electron Microscopy
Lecture 6: Gaussian Processes 1

Philipp Pelz

FAU Erlangen-Nürnberg

Learning Objectives

What You Will Learn Today

By the end of this lecture, you should be able to:

Understand the intuition behind Gaussian Processes as distributions over functions
Explain the key concepts of GP priors, kernels, and posterior inference
Implement basic GP regression from scratch using Python
Interpret GP predictions including uncertainty quantification
Apply GPs to real-world problems in electron microscopy and beyond
Compare GPs with other methods and understand when to use them

Course Outline

Part 1: Introduction

Intuitive understanding of GPs
From data to functions
Prior and posterior distributions
Uncertainty quantification

Part 2: GP Priors

Mathematical foundations
Kernel functions (RBF, etc.)
From weight space to function space
Sampling from GP priors

Part 3: GP Inference

Posterior inference for regression
Hyperparameter learning
Implementation from scratch
Advanced topics and applications

Why Gaussian Processes? Key Advantages for Electron Microscopy

Uncertainty Quantification: GPs provide principled uncertainty estimates, crucial for scientific applications
Small Data Performance: Excel with limited training data common in EM experiments
Interpretability: Kernel parameters have clear physical meaning
Flexibility: Can incorporate domain knowledge through kernel design
Bayesian Framework: Natural handling of experimental uncertainties

Introduction to Gaussian Processes 1

Gaussian processes provide a mechanism for directly reasoning about the high-level properties of functions that could fit our data.

may have a sense of whether these functions are quickly varying, periodic, involve conditional independencies, or translation invariance.

Gaussian processes: easily incorporate these properties into our model, by directly specifying a Gaussian distribution over the function values that could fit our data.

Introduction to Gaussian Processes 2

Suppose we observe the following dataset, of regression targets (outputs), \(y\), indexed by inputs, \(x\).
example: targets could be changes in carbon dioxide concentrations, inputs could be the times at which these targets have been recorded

What are some features of the data? How quickly does it seem to varying? Do we have data points collected at regular intervals, or are there missing inputs? How would you imagine filling in the missing regions, or forecasting up until \(x=25\)?

Introduction to Gaussian Processes 3

start by specifying a prior distribution over what types of functions we might believe to be reasonable.

show several sample functions from a Gaussian process. Does this prior look reasonable? we are not looking for functions that fit our dataset, but instead for specifying reasonable high-level properties of the solutions, such as how quickly they vary with inputs. Note that we will see code for reproducing all of the plots in this notebook, in the next notebooks on priors and inference.

Sample prior functions that we may want to represent with our model.

Introduction to Gaussian Processes 3.5

Bayes theorem

Introduction to Gaussian Processes 4

Once we condition on data, we can use this prior to infer a posterior distribution over functions that could fit the data. Here we show sample posterior functions.

Sample posterior functions, once we have observed the data.

each of these functions are entirely consistent with our data, perfectly running through each observation.

In order to use these posterior samples to make predictions, we can average the values of every possible sample function from the posterior, to create the curve below, in thick blue.

Note that we do not actually have to take an infinite number of samples to compute this expectation; as we will see later, we can compute the expectation in closed form.

Introduction to Gaussian Processes 5

Posterior samples, alongside posterior mean, which can be used for point predictions, in blue.

may also want a representation of uncertainty, so we know how confident we should be in our predictions.
Intuitively: more variability in the sample posterior functions –> more uncertainty
epistemic uncertainty, which is the reducible uncertainty associated with lack of information.
acquire more data –> this type of uncertainty disappears, as there will be increasingly fewer solutions consistent with what we observe.
Like with the posterior mean, we can compute the posterior variance (the variability of these functions in the posterior) in closed form.

Introduction to Gaussian Processes 6

Posterior samples, including 95% credible set.

shade: two times the posterior standard deviation on either side of the mean, creating a credible interval that has a 95% probability of containing the true value of the function for any input \(x\).
plot looks somewhat cleaner if we remove the posterior samples, simply visualizing the data, posterior mean, and 95% credible set.
Notice how the uncertainty grows away from the data, a property of epistemic uncertainty.

Introduction to Gaussian Processes 7

properties of the Gaussian process that we used to fit the data are strongly controlled by what’s called a covariance function, also known as a kernel.
covariance function we used is called the RBF (Radial Basis Function) kernel, which has the form \[ k_{\text{RBF}}(x,x') = \mathrm{Cov}(f(x),f(x')) = a^2 \exp\left(-\frac{1}{2\ell^2}\|x-x'\|^2\right) \]
The hyperparameters of this kernel are interpretable. The amplitude parameter \(a\) controls the vertical scale over which the function is varying, and the length-scale parameter \(\ell\) controls the rate of variation (the wiggliness) of the function.
Larger \(a\) means larger function values, and larger \(\ell\) means more slowly varying functions. Let’s see what happens to our sample prior and posterior functions as we vary \(a\) and \(\ell\).

Introduction to Gaussian Processes 8

\[ k_{\text{RBF}}(x,x') = \mathrm{Cov}(f(x),f(x')) = a^2 \exp\left(-\frac{1}{2\ell^2}\|x-x'\|^2\right) \]

The length-scale has a particularly pronounced effect on the predictions and uncertainty of a GP. At \(\|x-x'\| = \ell\) , the covariance between a pair of function values is \(a^2\exp(-0.5)\).
At larger distances than \(\ell\) , the values of the function values becomes nearly uncorrelated. This means that if we want to make a prediction at a point \(x_*\), then function values with inputs \(x\) such that \(\|x-x'\|>\ell\) will not have a strong effect on our predictions.

Introduction to Gaussian Processes 9

how changing the lengthscale affects sample prior and posterior functions, and credible sets. The above fits use a length-scale of \(2\). Let’s now consider \(\ell = 0.1, 0.5, 2, 5, 10\) .

priorpoint1 postpoint1 priorpoint5 postpoint5

A length-scale of \(0.1\) is very small relative to the range of the input domain we are considering, \(25\). For example, the values of the function at \(x=5\) and \(x=10\) will have essentially no correlation at such a length-scale.
On the other hand, for a length-scale of \(10\), the function values at these inputs will be highly correlated.
Note that the vertical scale changes in the following figures.

Introduction to Gaussian Processes 10

prior2 post2 prior5 post5

length-scale of \(0.1\) is very small relative to the range of the input domain we are considering, \(25\).
For example, the values of the function at \(x=5\) and \(x=10\) will have essentially no correlation at such a length-scale.
On the other hand, for a length-scale of \(10\), the function values at these inputs will be highly correlated.
Note that the vertical scale changes in the following figures.

as the length-scale increases the ‘wiggliness’ of the functions decrease, and our uncertainty decreases.
If the length-scale is small, the uncertainty will quickly increase as we move away from the data, as the datapoints become less informative about the function values.

Introduction to Gaussian Processes 11

priorap1 postapoint1 priora2 posta2 priora8 posta8

now vary the amplitude parameter, holding the length-scale fixed at \(2\).
Note the vertical scale is held fixed for the prior samples, and varies for the posterior samples, so you can clearly see both the increasing scale of the function, and the fits to the data.

Introduction to Gaussian Processes 12

\[ k_{\text{RBF}}(x,x') = \mathrm{Cov}(f(x),f(x')) = a^2 \exp\left(-\frac{1}{2\ell^2}\|x-x'\|^2\right) \]

amplitude parameter affects the scale of the function, but not the rate of variation. . . .
generalization performance of our procedure will depend on having reasonable values for these hyperparameters. . . .
Values of \(\ell=2\) and \(a=1\) appeared to provide reasonable fits, while some of the other values did not.

Fortunately, there is a robust and automatic way to specify these hyperparameters, using what is called the marginal likelihood, which we will return to in the notebook on inference.

Interactive Visualization

So what is a GP, really?

GP: any collection of function values \(f(x_1),\dots,f(x_n)\), indexed by any collection of inputs \(x_1,\dots,x_n\) has a joint multivariate Gaussian distribution.

mean vector \(\boldsymbol{\mu}\) of this distribution is given by a mean function, which is typically taken to be a constant or zero.
covariance matrix of this distribution is given by the kernel evaluated at all pairs of the inputs \(x\).

\[\begin{bmatrix}f(x) \\f(x_1) \\ \vdots \\ f(x_n) \end{bmatrix}\sim \mathcal{N}\left(\boldsymbol{\mu}, \begin{bmatrix}k(x,x) & k(x, x_1) & \dots & k(x,x_n) \\ k(x_1,x) & k(x_1,x_1) & \dots & k(x_1,x_n) \\ \vdots & \vdots & \ddots & \vdots \\ k(x_n,x) & k(x_n,x_1) & \dots & k(x_n,x_n) \end{bmatrix}\right)\]
Equation (1) specifies a GP prior. We can compute the conditional distribution of \(f(x)\) for any \(x\) given \(f(x_1), \dots, f(x_n)\), the function values we have observed.

This conditional distribution is also Gaussian, with mean \(m\) and variance \(s^2\):

\[f(x) | f(x_1), \dots, f(x_n) \sim \mathcal{N}(m,s^2)\]

where \(m\) and \(s^2\) are functions of \(x\) and the observed function values.

if we want to create an interval with a 95% probability that \(f(x)\) is in the interval, we would use \(m \pm 2s\).

This is the essence of Gaussian process inference: we specify a prior over functions, and then we condition on observed function values to get a posterior over functions.

A Simple Example

suppose we observe a single datapoint, \(f(x_1)\), and we want to determine the value of \(f(x)\) at some \(x\).
\(f(x)\) described by Gaussian process –> joint distribution over \((f(x), f(x_1))\) is Gaussian:

\[\begin{bmatrix} f(x) \\ f(x_1) \end{bmatrix} \sim \mathcal{N}\left(\begin{bmatrix} 0 \\ 0 \end{bmatrix}, \begin{bmatrix} k(x,x) & k(x,x_1) \\ k(x_1,x) & k(x_1,x_1) \end{bmatrix}\right)\]

The covariance \(k(x,x_1)\) tells us how correlated the function values will be — how strongly determined \(f(x)\) is by \(f(x_1)\).

visualize the process of determining \(f(x)\) from \(f(x_1)\) both in the space of functions, and in the joint distribution over \(f(x_1), f(x)\).
initially consider an \(x\) such that \(k(x,x_1) = 0.9\), and \(k(x,x)=1\), meaning that the value of \(f(x)\) is moderately correlated with the value of \(f(x_1)\).

If we observe \(f(x_1) = 1.2\), then we can draw a horizontal line at \(1.2\) on our plot of the density, and see that the value of \(f(x)\) is likely to be around \(1.08\).
The orange point shows the observed point \(f(x_1)\) in orange, and 1 standard deviation of the Gaussian process predictive distribution for \(f(x)\) is shown in blue.

Contours of constant probability of a bivariate Gaussian density over \(f(x_1)\) and \(f(x)\) with \(k(x,x_1) = 0.9\).

Gaussian process predictive distribution in function space at \(f(x)\), with \(k(x,x_1) = 0.9\).

If we increase the correlation to \(k(x,x_1) = 0.95\), then the ellipses have narrowed further, and the value of \(f(x)\) is more strongly determined by \(f(x_1)\).
Drawing a horizontal line at \(1.2\), we see the contours for \(f(x)\) are more concentrated around \(1.14\).

Contours of constant probability of a bivariate Gaussian density over f(x_1) and f(x) with k(x,x_1) = 0.95. Gaussian process predictive distribution in function space at f(x), with k(x,x_1) = 0.95.

This procedure can give us a posterior on \(f(x)\) for any \(x\), for any number of points we have observed.

visualize the posterior for \(f(x)\) at a particular \(x=x'\) in function space.
exact distribution for \(f(x)\) is given by the above equations. \(f(x)\) is Gaussian distributed, with mean

\[m = k(x,x_1) k(x_1,x_1)^{-1} f(x_1)\]

and variance

\[s^2 = k(x,x) - k(x,x_1) k(x_1,x_1)^{-1} k(x_1,x)\]

easy to include observation noise. If we assume that the data are generated from a latent noise free function \(f(x)\) plus iid Gaussian noise \(\epsilon(x) \sim \mathcal{N}(0,\sigma^2)\), then we can write

\[y(x) = f(x) + \epsilon(x)\]

The joint distribution over \(y(x_1)\) and \(f(x)\) is then

\[\begin{bmatrix} y(x_1) \\ f(x) \end{bmatrix} \sim \mathcal{N}\left(\begin{bmatrix} 0 \\ 0 \end{bmatrix}, \begin{bmatrix} k(x_1,x_1) + \sigma^2 & k(x_1,x) \\ k(x,x_1) & k(x,x) \end{bmatrix}\right)\]

The conditional distribution of \(f(x)\) given \(y(x_1)\) is then

\[f(x) | y(x_1) \sim \mathcal{N}(m,s^2)\]

where

\[m = k(x,x_1) (k(x_1,x_1) + \sigma^2)^{-1} y(x_1)\]

and

\[s^2 = k(x,x) - k(x,x_1) (k(x_1,x_1) + \sigma^2)^{-1} k(x_1,x)\]

This is the basic form of Gaussian process regression. We can extend this to multiple observations, and we can also learn the hyperparameters of the kernel from the data.

Exercise

Suppose we observe two points, \(y(x_1) = 1.2\) and \(y(x_2) = 0.8\), with noise variance \(\sigma^2 = 0.1\).
Are we more or less certain about the value of \(f(x)\), than when we had only observed \(f(x_1)\)? What is the mean and 95% credible set for our value of \(f(x)\) now?

Gaussian Process Priors

Understanding GPs is important for reasoning about model construction and generalization, and for achieving state-of-the-art performance in a variety of applications, including active learning, and hyperparameter tuning in deep learning.
GPs are everywhere, and it is in our interests to know what they are and how we can use them.
this section: Gaussian process priors over functions.

import numpy as np
from scipy.spatial import distance_matrix
import d2l

d2l.set_figsize()

Definition 1

GP is defined as a collection of random variables, any finite number of which have a joint Gaussian distribution.
If a function \(f(x)\) is a Gaussian process, with mean function \(m(x)\) and covariance function or kernel \(k(x,x')\), \(f(x) \sim \mathcal{GP}(m, k)\),
–> any collection of function values queried at any collection of input points \(x\) (times, spatial locations, image pixels, etc.), has a joint multivariate Gaussian distribution with mean vector \(\mu\) and covariance matrix \(K\): \(f(x_1),\dots,f(x_n) \sim \mathcal{N}(\mu, K)\), where \(\mu_i = E[f(x_i)] = m(x_i)\) and \(K_{ij} = \mathrm{Cov}(f(x_i),f(x_j)) = k(x_i,x_j)\).

Definition 2

Any function

\[f(x) = w^{\top} \phi(x) = \langle w, \phi(x) \rangle,\](1)

with \(w\) drawn from a Gaussian (normal) distribution, and \(\phi\) being any vector of basis functions, for example \(\phi(x) = (1, x, x^2, ..., x^d)^{\top}\), is a Gaussian process.

Moreover, any Gaussian process f(x) can be expressed in the form of equation (1).

A Simple Gaussian Process 1

consider a few concrete examples
Suppose \(f(x) = w_0 + w_1 x\), and \(w_0, w_1 \sim \mathcal{N}(0,1)\), with \(w_0, w_1, x\) all in one dimension.
can equivalently write this function as the inner product \(f(x) = (w_0, w_1)(1, x)^{\top}\). In (1) above, \(w = (w_0, w_1)^{\top}\) and \(\phi(x) = (1,x)^{\top}\).

For any \(x\), \(f(x)\) is a sum of two Gaussian random variables.
Gaussians are closed under addition –> \(f(x)\) is also a Gaussian random variable for any \(x\).
In fact, we can compute for any particular \(x\) that \(f(x)\) is \(\mathcal{N}(0,1+x^2)\).
Similarly, the joint distribution for any collection of function values, \((f(x_1),\dots,f(x_n))\), for any collection of inputs \(x_1,\dots,x_n\), is a multivariate Gaussian distribution. Therefore \(f(x)\) is a Gaussian process.

\(f(x)\) is a random function, or a distribution over functions.
gain some insights into this distribution by repeatedly sampling values for \(w_0, w_1\), and visualizing the corresponding functions \(f(x)\), which are straight lines with slopes and different intercepts, as follows:

A Simple Gaussian Process 2

def lin_func(x, n_sample):
    preds = np.zeros((n_sample, x.shape[0]))
    for ii in range(n_sample):
        w = np.random.normal(0, 1, 2)
        y = w[0] + w[1] * x
        preds[ii, :] = y
    return preds

x_points = np.linspace(-5, 5, 50)
outs = lin_func(x_points, 10)
lw_bd = -2 * np.sqrt((1 + x_points ** 2))
up_bd = 2 * np.sqrt((1 + x_points ** 2))

d2l.set_figsize((12,5))
d2l.plt.fill_between(x_points, lw_bd, up_bd, alpha=0.25)
d2l.plt.plot(x_points, np.zeros(len(x_points)), linewidth=4, color='black')
d2l.plt.plot(x_points, outs.T)
d2l.plt.xlabel("x", fontsize=20)
d2l.plt.ylabel("f(x)", fontsize=20)
d2l.plt.show()

If \(w_0\) and \(w_1\) are instead drawn from \(\mathcal{N}(0,\alpha^2)\), how do you imagine varying \(\alpha\) affects the distribution over functions?

From Weight Space to Function Space 1

we saw how a distribution over parameters in a modelinduces a distribution over functions.
often have ideas about the functions we want to model — whether they’re smooth, periodic, quickly varying, etc. — relatively tedious to reason about the parameters, which are largely uninterpretable.
GPs provide an easy mechanism to reason directly about functions.
Gaussian distribution is entirely defined by its first two moments, its mean and covariance matrix, a Gaussian process by extension is defined by its mean function and covariance function.

In the above example, the mean function

\[m(x) = E[f(x)] = E[w_0 + w_1x] = E[w_0] + E[w_1]x = 0+0 = 0.\]

Similarly, the covariance function is

\[k(x,x') = \mathrm{Cov}(f(x),f(x')) = E[f(x)f(x')]-E[f(x)]E[f(x')] = \\ E[w_0^2 + w_0w_1x' + w_1w_0x + w_1^2xx'] = 1 + xx'.\]

From Weight Space to Function Space 2

distribution over functions can now be directly specified and sampled from, without needing to sample from the distribution over parameters.
For example, to draw from \(f(x)\), we can simply form our multivariate Gaussian distribution associated with any collection of \(x\) we want to query, and sample from it directly.
very advantageous
same derivation for the simple straight line model above can be applied to find the mean and covariance function for any model of the form \(f(x) = w^{\top} \phi(x)\), with \(w \sim \mathcal{N}(u,S)\).
In this case, the mean function \(m(x) = u^{\top}\phi(x)\), and the covariance function \(k(x,x') = \phi(x)^{\top}S\phi(x')\). Since \(\phi(x)\) can represent a vector of any non-linear basis functions, we are considering a very general model class, including models with an even an infinite number of parameters.

The Radial Basis Function (RBF) Kernel 1

radial basis function (RBF) kernel is the most popular covariance function for Gaussian processes
kernel has the form \(k_{\text{RBF}}(x,x') = a^2\exp\left(-\frac{1}{2\ell^2}||x-x'||^2\right)\), where \(a\) is an amplitude parameter, and \(\ell\) is a lengthscale hyperparameter.

Let’s derive this kernel starting from weight space. Consider the function

\[f(x) = \sum_{i=1}^J w_i \phi_i(x), w_i \sim \mathcal{N}\left(0,\frac{\sigma^2}{J}\right), \phi_i(x) = \exp\left(-\frac{(x-c_i)^2}{2\ell^2 }\right).\]

\(f(x)\) is a sum of radial basis functions, with width \(\ell\), centred at the points \(c_i\), as shown in the following figure.

We can recognize \(f(x)\) as having the form \(w^{\top} \phi(x)\), where \(w = (w_1,\dots,w_J)^{\top}\) and \(\phi(x)\) is a vector containing each of the radial basis functions. The covariance function of this Gaussian process is then

\[k(x,x') = \frac{\sigma^2}{J} \sum_{i=1}^{J} \phi_i(x)\phi_i(x').\]

The Radial Basis Function (RBF) Kernel 2

what happens as we take the number of parameters (and basis functions) to infinity. Let \(c_J = \log J\), \(c_1 = -\log J\), and \(c_{i+1}-c_{i} = \Delta c = 2\frac{\log J}{J}\), and \(J \to \infty\). The covariance function becomes the Riemann sum:

\[k(x,x') = \lim_{J \to \infty} \frac{\sigma^2}{J} \sum_{i=1}^{J} \phi_i(x)\phi_i(x') = \int_{c_0}^{c_\infty} \phi_c(x)\phi_c(x') dc.\]

By setting \(c_0 = -\infty\) and \(c_\infty = \infty\), we spread the infinitely many basis functions across the whole real line, each a distance \(\Delta c \to 0\) apart:

\[k(x,x') = \int_{-\infty}^{\infty} \exp(-\frac{(x-c)^2}{2\ell^2}) \exp(-\frac{(x'-c)^2}{2\ell^2 }) dc = \sqrt{\pi}\ell \sigma^2 \exp(-\frac{(x-x')^2}{2(\sqrt{2} \ell)^2}) \propto k_{\text{RBF}}(x,x').\]

By moving into the function space representation, we have derived how to represent a model with an infinite number of parameters, using a finite amount of computation.
GP with an RBF kernel is a universal approximator, capable of representing any continuous function to arbitrary precision.

The Radial Basis Function (RBF) Kernel 3

We can intuitively see why from the above derivation.
We can collapse each radial basis function to a point mass taking \(\ell \to 0\), and give each point mass any height we wish.
GP with an RBF kernel is a model with an infinite number of parameters and much more flexibility than any finite neural network
all the fuss about overparametrized neural networks is misplaced?
GPs with RBF kernels do not overfit, and in fact provide especially compelling generalization performance on small datasets.
examples in Zhang 2021, such as the ability to fit images with random labels perfectly, but still generalize well on structured problems, (can be perfectly reproduced using Gaussian processes) Wilson 2020.
Neural networks are not as distinct as we make them out to be.

The Radial Basis Function (RBF) Kernel 4

build further intuition about GPs with RBF kernels, and hyperparameters such as length-scale, by sampling directly from the distribution over functions.
simple procedure:

Choose the input \(x\) points we want to query the GP: \(x_1,\dots,x_n\).
Evaluate \(m(x_i)\), \(i = 1,\dots,n\), and \(k(x_i,x_j)\) for \(i,j = 1,\dots,n\) to respectively form the mean vector and covariance matrix \(\mu\) and \(K\), where \((f(x_1),\dots,f(x_n)) \sim \mathcal{N}(\mu, K)\).
Sample from this multivariate Gaussian distribution to obtain the sample function values.
Sample more times to visualize more sample functions queried at those points.

We illustrate this process in the figure below.

The Radial Basis Function (RBF) Kernel 5

def rbfkernel(x1, x2, ls=4.):  #@save
    dist = distance_matrix(np.expand_dims(x1, 1), np.expand_dims(x2, 1))
    return np.exp(-(1. / ls / 2) * (dist ** 2))

x_points = np.linspace(0, 5, 50)
meanvec = np.zeros(len(x_points))
covmat = rbfkernel(x_points,x_points, 1)

prior_samples= np.random.multivariate_normal(meanvec, covmat, size=5);
d2l.plt.plot(x_points, prior_samples.T, alpha=0.5)
d2l.plt.show()

The Neural Network Kernel 1

Research on Gaussian processes in machine learning was triggered by research on neural networks.
We can derive the neural network kernel as follows.

Consider a neural network function \(f(x)\) with one hidden layer:

\[f(x) = b + \sum_{i=1}^{J} v_i h(x; u_i).\]

\(b\) is a bias, \(v_i\) are the hidden to output weights, \(h\) is any bounded hidden unit transfer function, \(u_i\) are the input to hidden weights, and \(J\) is the number of hidden units.

Let \(b\) and \(v_i\) be independent with zero mean and variances \(\sigma_b^2\) and \(\sigma_v^2/J\), respectively, and let the \(u_i\) have independent identical distributions.
use the central limit theorem to show that any collection of function values \(f(x_1),\dots,f(x_n)\) has a joint multivariate Gaussian distribution.

The Neural Network Kernel 2

The mean and covariance function of the corresponding Gaussian process are:

\[m(x) = E[f(x)] = 0\]

\[k(x,x') = \text{cov}[f(x),f(x')] = E[f(x)f(x')] = \sigma_b^2 + \frac{1}{J} \sum_{i=1}^{J} \sigma_v^2 E[h_i(x; u_i)h_i(x'; u_i)]\]

In some cases, we can essentially evaluate this covariance function in closed form. Let \(h(x; u) = \text{erf}(u_0 + \sum_{j=1}^{P} u_j x_j)\), where \(\text{erf}(z) = \frac{2}{\sqrt{\pi}} \int_{0}^{z} e^{-t^2} dt\), and \(u \sim \mathcal{N}(0,\Sigma)\). Then \(k(x,x') = \frac{2}{\pi} \text{sin}(\frac{2 \tilde{x}^{\top} \Sigma \tilde{x}'}{\sqrt{(1 + 2 \tilde{x}^{\top} \Sigma \tilde{x})(1 + 2 \tilde{x}'^{\top} \Sigma \tilde{x}')}})\).

RBF kernel is stationary, meaning that it is translation invariant, and therefore can be written as a function of \(\tau = x-x'\).
Intuitively, stationarity means that the high-level properties of the function, such as rate of variation, do not change as we move in input space.
The neural network kernel, however, is non-stationary.

Summary

first step in performing Bayesian inference involves specifying a prior
GPs can be used to specify a whole prior over functions.
Starting from a traditional “weight space” view of modelling, induce a prior over functions by starting with the functional form of a model, and introducing a distribution over its parameters.
alternatively specify a prior distribution directly in function space, with properties controlled by a kernel.
function-space approach has many advantages. We can build models that actually correspond to an infinite number of parameters, but use a finite amount of computation!
models have a great amount of flexibility, but also make strong assumptions about what types of functions are a priori likely, leading to relatively good generalization on small datasets.
assumptions of models in function space controlled by kernels: encode higher level properties of functions, such as smoothness and periodicity

Many kernels are stationary: they are translation invariant.
Functions drawn from GP with a stationary kernel have roughly the same high-level properties regardless of where we look in the input space.
GPs a relatively general model class including polynomials, Fourier series, and so on, as long as we have a Gaussian prior over the parameters.
also include neural networks with an infinite number of parameters, even without Gaussian distributions over the parameters.

Gaussian Process Inference

this section: show how to perform posterior inference and make predictions using the GP priors

start with regression, where we can perform inference in closed form.

start coding all the basic operations from scratch
then introduce GPyTorch –> convenient SOTA GPs.

consider more advanced topics in depth in the next section
settings where approximate inference is required — classification, point processes, or any non-Gaussian likelihoods.

Posterior Inference for Regression 1

Observation model: relates \(f(x)\) to observations \(y(x)\)
- \(x\): input (e.g., image pixels)
- \(y\): output (e.g., class label, temperature, concentration)
Regression model: \[y(x) = f(x) + \epsilon(x), \quad \epsilon(x) \sim \mathcal{N}(0,\sigma^2)\]
Notation:
- \(\mathbf{y} = (y(x_1),\dots,y(x_n))^{\top}\): training observations
- \(\textbf{f} = (f(x_1),\dots,f(x_n))^{\top}\): latent function values
- \(X = \{x_1, \dots, x_n\}\): training inputs

Posterior Inference for Regression 2

GP prior: \(f(x) \sim \mathcal{GP}(m,k)\)
- Mean vector: \(\mu_i = m(x_i)\)
- Covariance matrix: \(K_{ij} = k(x_i,x_j)\)
Standard choices:
- RBF kernel: \(k(x_i,x_j) = a^2 \exp\left(-\frac{1}{2\ell^2}||x_i-x_j||^2\right)\)
- Mean function: \(m(x)=0\) (for simplicity)
Goal: Predict at test inputs \(X_* = \{x_{*1},x_{*2},\dots,x_{*m}\}\)
- Find \(p(\mathbf{f}_* | \mathbf{y}, X)\)
- Use Gaussian identities for joint distribution

Posterior Inference for Regression 3

Training data: \(\mathbf{y} = \mathbf{f} + \mathbf{\epsilon}\)
- \(\mathbf{f} \sim \mathcal{N}(0,K(X,X))\)
- \(\mathbf{\epsilon} \sim \mathcal{N}(0,\sigma^2I)\)
- \(\mathbf{y} \sim \mathcal{N}(0, K(X,X) + \sigma^2I)\)
Covariance structure:
- \(\mathrm{cov}(\mathbf{f}_*, \mathbf{y}) = K(X_*,X)\)
- Joint distribution: \[ \begin{bmatrix} \mathbf{y} \\ \mathbf{f}_* \end{bmatrix} \sim \mathcal{N}\left(0, \begin{bmatrix} K(X,X)+\sigma^2I & K(X,X_*) \\ K(X_*,X) & K(X_*,X_*) \end{bmatrix} \right) \]

Posterior Inference for Regression 4

Kernel parameters \(\theta\) (e.g., \(a\), \(\ell\) in RBF)
Use marginal likelihood \(p(\textbf{y} | \theta, X)\) for learning
Marginal likelihood properties:
- Balances model fit and complexity
- Implements Occam’s razor
- See MacKay Ch. 28 and Rasmussen and Williams Ch. 5

Equations for Making Predictions and Learning Kernel Hyperparameters in GP Regression

Two-step procedure:
1. Learn \(\hat{\theta}\) via marginal likelihood maximization
2. Use predictive mean & 2×std for 95% credible set
Log marginal likelihood: \[\log p(\textbf{y} | \theta, X) = -\frac{1}{2}\textbf{y}^{\top}[K_{\theta}(X,X) + \sigma^2I]^{-1}\textbf{y} - \frac{1}{2}\log|K_{\theta}(X,X)| + c\]
Predictive distribution: \[p(y_* | x_*, \textbf{y}, \theta) = \mathcal{N}(a_*,v_*)\] \[a_* = k_{\theta}(x_*,X)[K_{\theta}(X,X)+\sigma^2I]^{-1}(\textbf{y}-\mu) + \mu\] \[v_* = k_{\theta}(x_*,x_*) - K_{\theta}(x_*,X)[K_{\theta}(X,X)+\sigma^2I]^{-1}k_{\theta}(X,x_*)\]

Interpreting Equations for Learning and Predictions 1

Key advantages:
- Exact Bayesian inference in closed form
- No training beyond hyperparameter learning
- Explicit predictive equations
- Exceptional convenience & versatility
Predictive mean:
- Linear combination of training targets
- Weights determined by kernel
- Kernel crucial for generalization
Predictive variance:
- Independent of target values
- Grows with distance from training points
- Implicitly depends on \(\theta\) learned from data

Interpreting Equations for Learning and Predictions 2

Computational considerations:
- Bottleneck: \(n \times n\) matrix operations
- Naive complexity: \(\mathcal{O}(n^3)\) compute, \(\mathcal{O}(n^2)\) storage
- Historical limit: ~10,000 points
- Modern scaling: millions of points possible
Numerical stability:
- \(K(X,X)\) often near-singular
- \(\sigma^2I\) term improves conditioning
- “Jitter” (\(\sim 10^{-6}\)) for noise-free cases

Worked Example: GP Regression from Scratch

1. Data Generation

True function: \(f(x) = \sin(x) + \frac{1}{2}\sin(4x)\)
Observations: \(y(x) = f(x) + \epsilon, \quad \epsilon \sim \mathcal{N}(0,0.25^2)\)
Goal: Recover \(f(x)\) from noisy observations

2. GP Prior Specification

Mean function: \(m(x) = 0\)
Kernel: RBF \[k(x_i,x_j) = a^2\exp\left(-\frac{1}{2\ell^2}||x-x'||^2\right)\]
Initial length-scale: 0.2

Code

import numpy as np
import d2l
import torch 
import gpytorch
import math
import os 
import matplotlib.pyplot as plt 
import torch
from scipy import optimize
 

d2l.set_figsize()
def data_maker1(x, sig):
    return np.sin(x) + 0.5 * np.sin(4 * x) + np.random.randn(x.shape[0]) * sig

sig = 0.25
train_x, test_x = np.linspace(0, 5, 50), np.linspace(0, 5, 500)
train_y, test_y = data_maker1(train_x, sig=sig), data_maker1(test_x, sig=0.)

d2l.plt.scatter(train_x, train_y)
d2l.plt.plot(test_x, test_y)
d2l.plt.xlabel("x", fontsize=20)
d2l.plt.ylabel("Observations y", fontsize=20)
d2l.plt.show()

mean = np.zeros(test_x.shape[0])
cov = d2l.rbfkernel(test_x, test_x, ls=0.2)

Visualize prior:
- Sample functions
- 95% credible set
- Assess reasonableness

prior_samples = np.random.multivariate_normal(mean=mean, cov=cov, size=5)
d2l.plt.plot(test_x, prior_samples.T, color='black', alpha=0.5)
d2l.plt.plot(test_x, mean, linewidth=2.)
d2l.plt.fill_between(test_x, mean - 2 * np.diag(cov), mean + 2 * np.diag(cov), 
                 alpha=0.25)
d2l.plt.show()

3. Hyperparameter Learning

Initial values:
- Length-scale: 0.4
- Noise std: 0.75
Optimize via marginal likelihood: \[\log p(y | X) = -\frac{1}{2}y^T(K + \sigma^2 I)^{-1}y - \frac{1}{2}\log |K + \sigma^2 I| - \frac{n}{2}\log 2\pi\]

ell_est = 0.4
post_sig_est = 0.5

def neg_MLL(pars):
    K = d2l.rbfkernel(train_x, train_x, ls=pars[0])
    kernel_term = -0.5 * train_y @ \
        np.linalg.inv(K + pars[1] ** 2 * np.eye(train_x.shape[0])) @ train_y
    logdet = -0.5 * np.log(np.linalg.det(K + pars[1] ** 2 * \
                                         np.eye(train_x.shape[0])))
    const = -train_x.shape[0] / 2. * np.log(2 * np.pi)
    
    return -(kernel_term + logdet + const)


learned_hypers = optimize.minimize(neg_MLL, x0=np.array([ell_est,post_sig_est]), 
                                   bounds=((0.01, 10.), (0.01, 10.)))
ell = learned_hypers.x[0]
post_sig_est = learned_hypers.x[1]

Learned parameters:
- Length-scale: 0.299
- Noise std: 0.24
- Close to true noise → well-specified model

4. Posterior Inference

Predictive distribution:
- Mean: \(\bar{f}_{*} = K(x, x_*)^T (K + \sigma^2 I)^{-1}y\)
- Variance: \(V(f_{*}) = K(x_*, x_*) - K(x, x_*)^T (K + \sigma^2 I)^{-1}K(x, x_*)\)

K_x_xstar = d2l.rbfkernel(train_x, test_x, ls=ell)
K_x_x = d2l.rbfkernel(train_x, train_x, ls=ell)
K_xstar_xstar = d2l.rbfkernel(test_x, test_x, ls=ell)

post_mean = K_x_xstar.T @ np.linalg.inv((K_x_x + \
                post_sig_est ** 2 * np.eye(train_x.shape[0]))) @ train_y
post_cov = K_xstar_xstar - K_x_xstar.T @ np.linalg.inv((K_x_x + \
                post_sig_est ** 2 * np.eye(train_x.shape[0]))) @ K_x_xstar

5. Uncertainty Analysis

Two types of uncertainty:
- Epistemic (reducible):
  - Uncertainty about true function
  - Captured by np.diag(post_cov)
  - Grows away from data
- Aleatoric (irreducible):
  - Observation noise
  - Captured by post_sig_est**2

95% credible sets:

For true function:

lw_bd = post_mean - 2 * np.sqrt(np.diag(post_cov))
up_bd = post_mean + 2 * np.sqrt(np.diag(post_cov))

For observations:

lw_bd_observed = post_mean - 2 * np.sqrt(np.diag(post_cov) + post_sig_est ** 2)
up_bd_observed = post_mean + 2 * np.sqrt(np.diag(post_cov) + post_sig_est ** 2)

6. Posterior Samples

Visualize uncertainty:
- 20 posterior samples
- Show function space consistent with data
- Helps understand model fit

post_samples = np.random.multivariate_normal(post_mean, post_cov, size=20)
d2l.plt.scatter(train_x, train_y)
d2l.plt.plot(test_x, test_y, linewidth=2.)
d2l.plt.plot(test_x, post_mean, linewidth=2.)
d2l.plt.plot(test_x, post_samples.T, color='gray', alpha=0.25)
d2l.plt.fill_between(test_x, lw_bd, up_bd, alpha=0.25)
plt.legend(['Observed Data', 'True Function', 'Predictive Mean', 'Posterior Samples'])
d2l.plt.show()

GPyTorch: Modern GP Implementation

Why GPyTorch?

Advanced Features

Multiple kernel choices
Approximate inference
Neural network integration
Scalability (>10k points)
Advanced methods (SKI/KISS-GP)

Implementation Benefits

No manual implementation
Efficient numerical routines
GPU acceleration
Modern PyTorch ecosystem

Model Definition

Exact GP Implementation:
- Zero mean function
- RBF kernel
- Gaussian likelihood

# Data preparation
train_x = torch.tensor(train_x)
train_y = torch.tensor(train_y)

# Model definition
class ExactGPModel(gpytorch.models.ExactGP):
    def __init__(self, train_x, train_y, likelihood):
        super(ExactGPModel, self).__init__(train_x, train_y, likelihood)
        self.mean_module = gpytorch.means.ZeroMean()
        self.covar_module = gpytorch.kernels.ScaleKernel(
            gpytorch.kernels.RBFKernel())
    
    def forward(self, x):
        mean_x = self.mean_module(x)
        covar_x = self.covar_module(x)
        return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)

Training Setup

Key Components:
- Gaussian likelihood
- Exact marginal log likelihood
- Adam optimizer
- Full-batch training

# Initialize components
likelihood = gpytorch.likelihoods.GaussianLikelihood()
model = ExactGPModel(train_x, train_y, likelihood)

# Training configuration
model.train()
likelihood.train()
optimizer = torch.optim.Adam(model.parameters(), lr=0.1)
mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)

Training Process

Important Notes:
- Full-batch optimization required
- No mini-batches (marginal likelihood)
- L-BFGS recommended for final optimization
- Good optimization → good generalization

training_iter = 50
for i in range(training_iter):
    optimizer.zero_grad()
    output = model(train_x)
    loss = -mll(output, train_y)
    loss.backward()
    
    if i % 10 == 0:
        print(f'Iter {i+1:d}/{training_iter:d} - Loss: {loss.item():.3f}')
    
    optimizer.step()

Iter 1/50 - Loss: 0.989
Iter 11/50 - Loss: 0.713
Iter 21/50 - Loss: 0.479
Iter 31/50 - Loss: 0.406
Iter 41/50 - Loss: 0.420

Key Advantages

Implementation

Clean, modular code
Easy kernel switching
Automatic differentiation
GPU support

Performance

Efficient matrix operations
Modern optimization methods
Scalable to large datasets
State-of-the-art inference

Extensibility

Custom kernels
Custom likelihoods
Neural network integration
Advanced inference methods

Mathematical Notation Guide

Consistent Notation Throughout This Lecture

This guide establishes the mathematical notation used consistently across all sections.

Functions and Variables

Symbol	Type	Meaning	Example
\(f(x)\)	Function	Latent function (noise-free)	\(f(x) \sim \mathcal{GP}(m, k)\)
\(y(x)\)	Function	Observed function (with noise)	\(y(x) = f(x) + \epsilon(x)\)
\(m(x)\)	Function	Mean function of GP	\(m(x) = \mathbb{E}[f(x)]\)
\(k(x,x')\)	Function	Covariance/kernel function	\(k(x,x') = \text{Cov}(f(x), f(x'))\)
\(\epsilon(x)\)	Function	Observation noise	\(\epsilon(x) \sim \mathcal{N}(0, \sigma^2)\)

Vectors and Matrices

Symbol	Type	Meaning	Example
\(\mathbf{f}\)	Vector	Function values at training points	\(\mathbf{f} = [f(x_1), \ldots, f(x_n)]^\top\)
\(\mathbf{y}\)	Vector	Observations at training points	\(\mathbf{y} = [y(x_1), \ldots, y(x_n)]^\top\)
\(\mathbf{f}_*\)	Vector	Function values at test points	\(\mathbf{f}_* = [f(x_{1}), \ldots, f(x_{m})]^\top\)
\(\mathbf{\epsilon}\)	Vector	Noise vector	\(\mathbf{\epsilon} = [\epsilon(x_1), \ldots, \epsilon(x_n)]^\top\)
\(\mathbf{w}\)	Vector	Weight vector	\(\mathbf{w} \sim \mathcal{N}(0, I)\)
\(\mathbf{\phi}(x)\)	Vector	Feature vector	\(\mathbf{\phi}(x) = [\phi_1(x), \ldots, \phi_d(x)]^\top\)
\(K(X,X)\)	Matrix	Kernel matrix	\(K_{ij} = k(x_i, x_j)\)
\(K(X,X_*)\)	Matrix	Cross-covariance matrix	\(K_{ij} = k(x_i, x_{*j})\)
\(K(X_,X_)\)	Matrix	Test covariance matrix	\(K_{ij} = k(x_{i}, x_{j})\)

Sets and Collections

Symbol	Type	Meaning	Example
\(X\)	Set	Training inputs	\(X = \{x_1, \ldots, x_n\}\)
\(X_*\)	Set	Test inputs	\(X_* = \{x_{1}, \ldots, x_{m}\}\)
\(\mathcal{D}\)	Set	Training dataset	\(\mathcal{D} = \{(x_i, y_i)\}_{i=1}^n\)

Parameters and Hyperparameters

Symbol	Type	Meaning	Example
\(\ell\)	Scalar	Length-scale parameter	\(k(x,x') = \exp(-\frac{\\|x-x'\\|^2}{2\ell^2})\)
\(a\)	Scalar	Amplitude parameter	\(k(x,x') = a^2 \exp(-\frac{\\|x-x'\\|^2}{2\ell^2})\)
\(\sigma^2\)	Scalar	Observation noise variance	\(\epsilon(x) \sim \mathcal{N}(0, \sigma^2)\)
\(\boldsymbol{\theta}\)	Vector	Kernel hyperparameters	\(\boldsymbol{\theta} = [\ell, a, \sigma^2]^\top\)

Distributions

Symbol	Type	Meaning	Example
\(\mathcal{GP}(m, k)\)	Distribution	Gaussian Process	\(f(x) \sim \mathcal{GP}(m, k)\)
\(\mathcal{N}(\mu, \Sigma)\)	Distribution	Multivariate Normal	\(\mathbf{f} \sim \mathcal{N}(\mathbf{0}, K)\)
\(\mathcal{N}(\mu, \sigma^2)\)	Distribution	Univariate Normal	\(\epsilon(x) \sim \mathcal{N}(0, \sigma^2)\)

Key Equations

GP Prior: \[f(x) \sim \mathcal{GP}(m, k)\]

Observation Model: \[y(x) = f(x) + \epsilon(x), \quad \epsilon(x) \sim \mathcal{N}(0, \sigma^2)\]

Joint Distribution: \[\begin{bmatrix} \mathbf{y} \\ \mathbf{f}_* \end{bmatrix} \sim \mathcal{N}\left(\mathbf{0}, \begin{bmatrix} K(X,X) + \sigma^2I & K(X,X_*) \\ K(X_*,X)^\top & K(X_*,X_*) \end{bmatrix}\right)\]

Predictive Distribution: \[p(\mathbf{f}_* | \mathbf{y}, X) = \mathcal{N}(\boldsymbol{\mu}_*, \boldsymbol{\Sigma}_*)\]

Notation Rules

Functions use regular font: \(f(x)\), \(k(x,x')\)
Vectors use bold font: \(\mathbf{f}\), \(\mathbf{y}\)
Matrices use bold capital letters: \(K\), \(\boldsymbol{\Sigma}\)
Scalars use regular font: \(\ell\), \(a\), \(\sigma^2\)
Sets use capital letters: \(X\), \(X_*\)
Distributions use calligraphic font: \(\mathcal{GP}\), \(\mathcal{N}\)

Data Science for Electron Microscopy Lecture 6: Gaussian Processes 1