Data Science for Electron Microscopy
Lecture 6: Gaussian Processes 1
FAU Erlangen-Nürnberg
By the end of this lecture, you should be able to:
Uncertainty Quantification: GPs provide principled uncertainty estimates, crucial for scientific applications
Small Data Performance: Excel with limited training data common in EM experiments
Interpretability: Kernel parameters have clear physical meaning
Flexibility: Can incorporate domain knowledge through kernel design
Bayesian Framework: Natural handling of experimental uncertainties
Suppose we observe the following dataset, of regression targets (outputs), \(y\), indexed by inputs, \(x\).
example: targets could be changes in carbon dioxide concentrations, inputs could be the times at which these targets have been recorded
Bayes theorem
Once we condition on data, we can use this prior to infer a posterior distribution over functions that could fit the data. Here we show sample posterior functions.
Sample posterior functions, once we have observed the data.
may also want a representation of uncertainty, so we know how confident we should be in our predictions.
Intuitively: more variability in the sample posterior functions –> more uncertainty
epistemic uncertainty, which is the reducible uncertainty associated with lack of information.
acquire more data –> this type of uncertainty disappears, as there will be increasingly fewer solutions consistent with what we observe.
Like with the posterior mean, we can compute the posterior variance (the variability of these functions in the posterior) in closed form.
properties of the Gaussian process that we used to fit the data are strongly controlled by what’s called a covariance function, also known as a kernel.
covariance function we used is called the RBF (Radial Basis Function) kernel, which has the form \[ k_{\text{RBF}}(x,x') = \mathrm{Cov}(f(x),f(x')) = a^2 \exp\left(-\frac{1}{2\ell^2}\|x-x'\|^2\right) \]
The hyperparameters of this kernel are interpretable. The amplitude parameter \(a\) controls the vertical scale over which the function is varying, and the length-scale parameter \(\ell\) controls the rate of variation (the wiggliness) of the function.
Larger \(a\) means larger function values, and larger \(\ell\) means more slowly varying functions. Let’s see what happens to our sample prior and posterior functions as we vary \(a\) and \(\ell\).
\[ k_{\text{RBF}}(x,x') = \mathrm{Cov}(f(x),f(x')) = a^2 \exp\left(-\frac{1}{2\ell^2}\|x-x'\|^2\right) \]
The length-scale has a particularly pronounced effect on the predictions and uncertainty of a GP. At \(\|x-x'\| = \ell\) , the covariance between a pair of function values is \(a^2\exp(-0.5)\).
At larger distances than \(\ell\) , the values of the function values becomes nearly uncorrelated. This means that if we want to make a prediction at a point \(x_*\), then function values with inputs \(x\) such that \(\|x-x'\|>\ell\) will not have a strong effect on our predictions.
\[ k_{\text{RBF}}(x,x') = \mathrm{Cov}(f(x),f(x')) = a^2 \exp\left(-\frac{1}{2\ell^2}\|x-x'\|^2\right) \]
amplitude parameter affects the scale of the function, but not the rate of variation. . . .
generalization performance of our procedure will depend on having reasonable values for these hyperparameters. . . .
Values of \(\ell=2\) and \(a=1\) appeared to provide reasonable fits, while some of the other values did not.
mean vector \(\boldsymbol{\mu}\) of this distribution is given by a mean function, which is typically taken to be a constant or zero.
covariance matrix of this distribution is given by the kernel evaluated at all pairs of the inputs \(x\).
\[\begin{bmatrix}f(x) \\f(x_1) \\ \vdots \\ f(x_n) \end{bmatrix}\sim \mathcal{N}\left(\boldsymbol{\mu}, \begin{bmatrix}k(x,x) & k(x, x_1) & \dots & k(x,x_n) \\ k(x_1,x) & k(x_1,x_1) & \dots & k(x_1,x_n) \\ \vdots & \vdots & \ddots & \vdots \\ k(x_n,x) & k(x_n,x_1) & \dots & k(x_n,x_n) \end{bmatrix}\right)\]
Equation (1)
specifies a GP prior. We can compute the conditional distribution of \(f(x)\) for any \(x\) given \(f(x_1), \dots, f(x_n)\), the function values we have observed.
\[f(x) | f(x_1), \dots, f(x_n) \sim \mathcal{N}(m,s^2)\]
\[\begin{bmatrix} f(x) \\ f(x_1) \end{bmatrix} \sim \mathcal{N}\left(\begin{bmatrix} 0 \\ 0 \end{bmatrix}, \begin{bmatrix} k(x,x) & k(x,x_1) \\ k(x_1,x) & k(x_1,x_1) \end{bmatrix}\right)\]
If we observe \(f(x_1) = 1.2\), then we can draw a horizontal line at \(1.2\) on our plot of the density, and see that the value of \(f(x)\) is likely to be around \(1.08\).
The orange point shows the observed point \(f(x_1)\) in orange, and 1 standard deviation of the Gaussian process predictive distribution for \(f(x)\) is shown in blue.
If we increase the correlation to \(k(x,x_1) = 0.95\), then the ellipses have narrowed further, and the value of \(f(x)\) is more strongly determined by \(f(x_1)\).
Drawing a horizontal line at \(1.2\), we see the contours for \(f(x)\) are more concentrated around \(1.14\).
\[m = k(x,x_1) k(x_1,x_1)^{-1} f(x_1)\]
and variance
\[s^2 = k(x,x) - k(x,x_1) k(x_1,x_1)^{-1} k(x_1,x)\]
\[y(x) = f(x) + \epsilon(x)\]
\[\begin{bmatrix} y(x_1) \\ f(x) \end{bmatrix} \sim \mathcal{N}\left(\begin{bmatrix} 0 \\ 0 \end{bmatrix}, \begin{bmatrix} k(x_1,x_1) + \sigma^2 & k(x_1,x) \\ k(x,x_1) & k(x,x) \end{bmatrix}\right)\]
\[f(x) | y(x_1) \sim \mathcal{N}(m,s^2)\]
where
\[m = k(x,x_1) (k(x_1,x_1) + \sigma^2)^{-1} y(x_1)\]
and
\[s^2 = k(x,x) - k(x,x_1) (k(x_1,x_1) + \sigma^2)^{-1} k(x_1,x)\]
Understanding GPs is important for reasoning about model construction and generalization, and for achieving state-of-the-art performance in a variety of applications, including active learning, and hyperparameter tuning in deep learning.
GPs are everywhere, and it is in our interests to know what they are and how we can use them.
this section: Gaussian process priors over functions.
GP is defined as a collection of random variables, any finite number of which have a joint Gaussian distribution.
If a function \(f(x)\) is a Gaussian process, with mean function \(m(x)\) and covariance function or kernel \(k(x,x')\), \(f(x) \sim \mathcal{GP}(m, k)\),
–> any collection of function values queried at any collection of input points \(x\) (times, spatial locations, image pixels, etc.), has a joint multivariate Gaussian distribution with mean vector \(\mu\) and covariance matrix \(K\): \(f(x_1),\dots,f(x_n) \sim \mathcal{N}(\mu, K)\), where \(\mu_i = E[f(x_i)] = m(x_i)\) and \(K_{ij} = \mathrm{Cov}(f(x_i),f(x_j)) = k(x_i,x_j)\).
\[f(x) = w^{\top} \phi(x) = \langle w, \phi(x) \rangle,\](1)
with \(w\) drawn from a Gaussian (normal) distribution, and \(\phi\) being any vector of basis functions, for example \(\phi(x) = (1, x, x^2, ..., x^d)^{\top}\), is a Gaussian process.
(1)
.consider a few concrete examples
Suppose \(f(x) = w_0 + w_1 x\), and \(w_0, w_1 \sim \mathcal{N}(0,1)\), with \(w_0, w_1, x\) all in one dimension.
can equivalently write this function as the inner product \(f(x) = (w_0, w_1)(1, x)^{\top}\). In (1)
above, \(w = (w_0, w_1)^{\top}\) and \(\phi(x) = (1,x)^{\top}\).
For any \(x\), \(f(x)\) is a sum of two Gaussian random variables.
Gaussians are closed under addition –> \(f(x)\) is also a Gaussian random variable for any \(x\).
In fact, we can compute for any particular \(x\) that \(f(x)\) is \(\mathcal{N}(0,1+x^2)\).
Similarly, the joint distribution for any collection of function values, \((f(x_1),\dots,f(x_n))\), for any collection of inputs \(x_1,\dots,x_n\), is a multivariate Gaussian distribution. Therefore \(f(x)\) is a Gaussian process.
def lin_func(x, n_sample):
preds = np.zeros((n_sample, x.shape[0]))
for ii in range(n_sample):
w = np.random.normal(0, 1, 2)
y = w[0] + w[1] * x
preds[ii, :] = y
return preds
x_points = np.linspace(-5, 5, 50)
outs = lin_func(x_points, 10)
lw_bd = -2 * np.sqrt((1 + x_points ** 2))
up_bd = 2 * np.sqrt((1 + x_points ** 2))
d2l.set_figsize((12,5))
d2l.plt.fill_between(x_points, lw_bd, up_bd, alpha=0.25)
d2l.plt.plot(x_points, np.zeros(len(x_points)), linewidth=4, color='black')
d2l.plt.plot(x_points, outs.T)
d2l.plt.xlabel("x", fontsize=20)
d2l.plt.ylabel("f(x)", fontsize=20)
d2l.plt.show()
we saw how a distribution over parameters in a modelinduces a distribution over functions.
often have ideas about the functions we want to model — whether they’re smooth, periodic, quickly varying, etc. — relatively tedious to reason about the parameters, which are largely uninterpretable.
GPs provide an easy mechanism to reason directly about functions.
Gaussian distribution is entirely defined by its first two moments, its mean and covariance matrix, a Gaussian process by extension is defined by its mean function and covariance function.
In the above example, the mean function
\[m(x) = E[f(x)] = E[w_0 + w_1x] = E[w_0] + E[w_1]x = 0+0 = 0.\]
Similarly, the covariance function is
\[k(x,x') = \mathrm{Cov}(f(x),f(x')) = E[f(x)f(x')]-E[f(x)]E[f(x')] = \\ E[w_0^2 + w_0w_1x' + w_1w_0x + w_1^2xx'] = 1 + xx'.\]
distribution over functions can now be directly specified and sampled from, without needing to sample from the distribution over parameters.
For example, to draw from \(f(x)\), we can simply form our multivariate Gaussian distribution associated with any collection of \(x\) we want to query, and sample from it directly.
very advantageous
same derivation for the simple straight line model above can be applied to find the mean and covariance function for any model of the form \(f(x) = w^{\top} \phi(x)\), with \(w \sim \mathcal{N}(u,S)\).
In this case, the mean function \(m(x) = u^{\top}\phi(x)\), and the covariance function \(k(x,x') = \phi(x)^{\top}S\phi(x')\). Since \(\phi(x)\) can represent a vector of any non-linear basis functions, we are considering a very general model class, including models with an even an infinite number of parameters.
Let’s derive this kernel starting from weight space. Consider the function
\[f(x) = \sum_{i=1}^J w_i \phi_i(x), w_i \sim \mathcal{N}\left(0,\frac{\sigma^2}{J}\right), \phi_i(x) = \exp\left(-\frac{(x-c_i)^2}{2\ell^2 }\right).\]
\(f(x)\) is a sum of radial basis functions, with width \(\ell\), centred at the points \(c_i\), as shown in the following figure.
\[k(x,x') = \frac{\sigma^2}{J} \sum_{i=1}^{J} \phi_i(x)\phi_i(x').\]
\[k(x,x') = \lim_{J \to \infty} \frac{\sigma^2}{J} \sum_{i=1}^{J} \phi_i(x)\phi_i(x') = \int_{c_0}^{c_\infty} \phi_c(x)\phi_c(x') dc.\]
By setting \(c_0 = -\infty\) and \(c_\infty = \infty\), we spread the infinitely many basis functions across the whole real line, each a distance \(\Delta c \to 0\) apart:
\[k(x,x') = \int_{-\infty}^{\infty} \exp(-\frac{(x-c)^2}{2\ell^2}) \exp(-\frac{(x'-c)^2}{2\ell^2 }) dc = \sqrt{\pi}\ell \sigma^2 \exp(-\frac{(x-x')^2}{2(\sqrt{2} \ell)^2}) \propto k_{\text{RBF}}(x,x').\]
By moving into the function space representation, we have derived how to represent a model with an infinite number of parameters, using a finite amount of computation.
GP with an RBF kernel is a universal approximator, capable of representing any continuous function to arbitrary precision.
We can intuitively see why from the above derivation.
We can collapse each radial basis function to a point mass taking \(\ell \to 0\), and give each point mass any height we wish.
GP with an RBF kernel is a model with an infinite number of parameters and much more flexibility than any finite neural network
all the fuss about overparametrized neural networks is misplaced?
GPs with RBF kernels do not overfit, and in fact provide especially compelling generalization performance on small datasets.
examples in Zhang 2021, such as the ability to fit images with random labels perfectly, but still generalize well on structured problems, (can be perfectly reproduced using Gaussian processes) Wilson 2020.
Neural networks are not as distinct as we make them out to be.
build further intuition about GPs with RBF kernels, and hyperparameters such as length-scale, by sampling directly from the distribution over functions.
simple procedure:
We illustrate this process in the figure below.
def rbfkernel(x1, x2, ls=4.): #@save
dist = distance_matrix(np.expand_dims(x1, 1), np.expand_dims(x2, 1))
return np.exp(-(1. / ls / 2) * (dist ** 2))
x_points = np.linspace(0, 5, 50)
meanvec = np.zeros(len(x_points))
covmat = rbfkernel(x_points,x_points, 1)
prior_samples= np.random.multivariate_normal(meanvec, covmat, size=5);
d2l.plt.plot(x_points, prior_samples.T, alpha=0.5)
d2l.plt.show()
Consider a neural network function \(f(x)\) with one hidden layer:
\[f(x) = b + \sum_{i=1}^{J} v_i h(x; u_i).\]
\(b\) is a bias, \(v_i\) are the hidden to output weights, \(h\) is any bounded hidden unit transfer function, \(u_i\) are the input to hidden weights, and \(J\) is the number of hidden units.
Let \(b\) and \(v_i\) be independent with zero mean and variances \(\sigma_b^2\) and \(\sigma_v^2/J\), respectively, and let the \(u_i\) have independent identical distributions.
use the central limit theorem to show that any collection of function values \(f(x_1),\dots,f(x_n)\) has a joint multivariate Gaussian distribution.
The mean and covariance function of the corresponding Gaussian process are:
\[m(x) = E[f(x)] = 0\]
\[k(x,x') = \text{cov}[f(x),f(x')] = E[f(x)f(x')] = \sigma_b^2 + \frac{1}{J} \sum_{i=1}^{J} \sigma_v^2 E[h_i(x; u_i)h_i(x'; u_i)]\]
In some cases, we can essentially evaluate this covariance function in closed form. Let \(h(x; u) = \text{erf}(u_0 + \sum_{j=1}^{P} u_j x_j)\), where \(\text{erf}(z) = \frac{2}{\sqrt{\pi}} \int_{0}^{z} e^{-t^2} dt\), and \(u \sim \mathcal{N}(0,\Sigma)\). Then \(k(x,x') = \frac{2}{\pi} \text{sin}(\frac{2 \tilde{x}^{\top} \Sigma \tilde{x}'}{\sqrt{(1 + 2 \tilde{x}^{\top} \Sigma \tilde{x})(1 + 2 \tilde{x}'^{\top} \Sigma \tilde{x}')}})\).
first step in performing Bayesian inference involves specifying a prior
GPs can be used to specify a whole prior over functions.
Starting from a traditional “weight space” view of modelling, induce a prior over functions by starting with the functional form of a model, and introducing a distribution over its parameters.
alternatively specify a prior distribution directly in function space, with properties controlled by a kernel.
function-space approach has many advantages. We can build models that actually correspond to an infinite number of parameters, but use a finite amount of computation!
models have a great amount of flexibility, but also make strong assumptions about what types of functions are a priori likely, leading to relatively good generalization on small datasets.
assumptions of models in function space controlled by kernels: encode higher level properties of functions, such as smoothness and periodicity
Many kernels are stationary: they are translation invariant.
Functions drawn from GP with a stationary kernel have roughly the same high-level properties regardless of where we look in the input space.
GPs a relatively general model class including polynomials, Fourier series, and so on, as long as we have a Gaussian prior over the parameters.
also include neural networks with an infinite number of parameters, even without Gaussian distributions over the parameters.
Observation model: relates \(f(x)\) to observations \(y(x)\)
Regression model: \[y(x) = f(x) + \epsilon(x), \quad \epsilon(x) \sim \mathcal{N}(0,\sigma^2)\]
Notation:
Two-step procedure:
Log marginal likelihood: \[\log p(\textbf{y} | \theta, X) = -\frac{1}{2}\textbf{y}^{\top}[K_{\theta}(X,X) + \sigma^2I]^{-1}\textbf{y} - \frac{1}{2}\log|K_{\theta}(X,X)| + c\]
Predictive distribution: \[p(y_* | x_*, \textbf{y}, \theta) = \mathcal{N}(a_*,v_*)\] \[a_* = k_{\theta}(x_*,X)[K_{\theta}(X,X)+\sigma^2I]^{-1}(\textbf{y}-\mu) + \mu\] \[v_* = k_{\theta}(x_*,x_*) - K_{\theta}(x_*,X)[K_{\theta}(X,X)+\sigma^2I]^{-1}k_{\theta}(X,x_*)\]
import numpy as np
import d2l
import torch
import gpytorch
import math
import os
import matplotlib.pyplot as plt
import torch
from scipy import optimize
d2l.set_figsize()
def data_maker1(x, sig):
return np.sin(x) + 0.5 * np.sin(4 * x) + np.random.randn(x.shape[0]) * sig
sig = 0.25
train_x, test_x = np.linspace(0, 5, 50), np.linspace(0, 5, 500)
train_y, test_y = data_maker1(train_x, sig=sig), data_maker1(test_x, sig=0.)
d2l.plt.scatter(train_x, train_y)
d2l.plt.plot(test_x, test_y)
d2l.plt.xlabel("x", fontsize=20)
d2l.plt.ylabel("Observations y", fontsize=20)
d2l.plt.show()
ell_est = 0.4
post_sig_est = 0.5
def neg_MLL(pars):
K = d2l.rbfkernel(train_x, train_x, ls=pars[0])
kernel_term = -0.5 * train_y @ \
np.linalg.inv(K + pars[1] ** 2 * np.eye(train_x.shape[0])) @ train_y
logdet = -0.5 * np.log(np.linalg.det(K + pars[1] ** 2 * \
np.eye(train_x.shape[0])))
const = -train_x.shape[0] / 2. * np.log(2 * np.pi)
return -(kernel_term + logdet + const)
learned_hypers = optimize.minimize(neg_MLL, x0=np.array([ell_est,post_sig_est]),
bounds=((0.01, 10.), (0.01, 10.)))
ell = learned_hypers.x[0]
post_sig_est = learned_hypers.x[1]
K_x_xstar = d2l.rbfkernel(train_x, test_x, ls=ell)
K_x_x = d2l.rbfkernel(train_x, train_x, ls=ell)
K_xstar_xstar = d2l.rbfkernel(test_x, test_x, ls=ell)
post_mean = K_x_xstar.T @ np.linalg.inv((K_x_x + \
post_sig_est ** 2 * np.eye(train_x.shape[0]))) @ train_y
post_cov = K_xstar_xstar - K_x_xstar.T @ np.linalg.inv((K_x_x + \
post_sig_est ** 2 * np.eye(train_x.shape[0]))) @ K_x_xstar
np.diag(post_cov)
post_sig_est**2
For true function:
For observations:
post_samples = np.random.multivariate_normal(post_mean, post_cov, size=20)
d2l.plt.scatter(train_x, train_y)
d2l.plt.plot(test_x, test_y, linewidth=2.)
d2l.plt.plot(test_x, post_mean, linewidth=2.)
d2l.plt.plot(test_x, post_samples.T, color='gray', alpha=0.25)
d2l.plt.fill_between(test_x, lw_bd, up_bd, alpha=0.25)
plt.legend(['Observed Data', 'True Function', 'Predictive Mean', 'Posterior Samples'])
d2l.plt.show()
Advanced Features * Multiple kernel choices * Approximate inference * Neural network integration * Scalability (>10k points) * Advanced methods (SKI/KISS-GP)
Implementation Benefits * No manual implementation * Efficient numerical routines * GPU acceleration * Modern PyTorch ecosystem
# Data preparation
train_x = torch.tensor(train_x)
train_y = torch.tensor(train_y)
# Model definition
class ExactGPModel(gpytorch.models.ExactGP):
def __init__(self, train_x, train_y, likelihood):
super(ExactGPModel, self).__init__(train_x, train_y, likelihood)
self.mean_module = gpytorch.means.ZeroMean()
self.covar_module = gpytorch.kernels.ScaleKernel(
gpytorch.kernels.RBFKernel())
def forward(self, x):
mean_x = self.mean_module(x)
covar_x = self.covar_module(x)
return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)
# Initialize components
likelihood = gpytorch.likelihoods.GaussianLikelihood()
model = ExactGPModel(train_x, train_y, likelihood)
# Training configuration
model.train()
likelihood.train()
optimizer = torch.optim.Adam(model.parameters(), lr=0.1)
mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)
training_iter = 50
for i in range(training_iter):
optimizer.zero_grad()
output = model(train_x)
loss = -mll(output, train_y)
loss.backward()
if i % 10 == 0:
print(f'Iter {i+1:d}/{training_iter:d} - Loss: {loss.item():.3f}')
optimizer.step()
Iter 1/50 - Loss: 0.986
Iter 11/50 - Loss: 0.708
Iter 21/50 - Loss: 0.467
Iter 31/50 - Loss: 0.379
Iter 41/50 - Loss: 0.394
Implementation * Clean, modular code * Easy kernel switching * Automatic differentiation * GPU support
Performance * Efficient matrix operations * Modern optimization methods * Scalable to large datasets * State-of-the-art inference
Extensibility * Custom kernels * Custom likelihoods * Neural network integration * Advanced inference methods
This guide establishes the mathematical notation used consistently across all sections.
Symbol | Type | Meaning | Example |
---|---|---|---|
\(f(x)\) | Function | Latent function (noise-free) | \(f(x) \sim \mathcal{GP}(m, k)\) |
\(y(x)\) | Function | Observed function (with noise) | \(y(x) = f(x) + \epsilon(x)\) |
\(m(x)\) | Function | Mean function of GP | \(m(x) = \mathbb{E}[f(x)]\) |
\(k(x,x')\) | Function | Covariance/kernel function | \(k(x,x') = \text{Cov}(f(x), f(x'))\) |
\(\epsilon(x)\) | Function | Observation noise | \(\epsilon(x) \sim \mathcal{N}(0, \sigma^2)\) |
Symbol | Type | Meaning | Example |
---|---|---|---|
\(\mathbf{f}\) | Vector | Function values at training points | \(\mathbf{f} = [f(x_1), \ldots, f(x_n)]^\top\) |
\(\mathbf{y}\) | Vector | Observations at training points | \(\mathbf{y} = [y(x_1), \ldots, y(x_n)]^\top\) |
\(\mathbf{f}_*\) | Vector | Function values at test points | \(\mathbf{f}_* = [f(x_{*1}), \ldots, f(x_{*m})]^\top\) |
\(\mathbf{\epsilon}\) | Vector | Noise vector | \(\mathbf{\epsilon} = [\epsilon(x_1), \ldots, \epsilon(x_n)]^\top\) |
\(\mathbf{w}\) | Vector | Weight vector | \(\mathbf{w} \sim \mathcal{N}(0, I)\) |
\(\mathbf{\phi}(x)\) | Vector | Feature vector | \(\mathbf{\phi}(x) = [\phi_1(x), \ldots, \phi_d(x)]^\top\) |
\(K(X,X)\) | Matrix | Kernel matrix | \(K_{ij} = k(x_i, x_j)\) |
\(K(X,X_*)\) | Matrix | Cross-covariance matrix | \(K_{ij} = k(x_i, x_{*j})\) |
\(K(X_*,X_*)\) | Matrix | Test covariance matrix | \(K_{ij} = k(x_{*i}, x_{*j})\) |
Symbol | Type | Meaning | Example |
---|---|---|---|
\(X\) | Set | Training inputs | \(X = \{x_1, \ldots, x_n\}\) |
\(X_*\) | Set | Test inputs | \(X_* = \{x_{*1}, \ldots, x_{*m}\}\) |
\(\mathcal{D}\) | Set | Training dataset | \(\mathcal{D} = \{(x_i, y_i)\}_{i=1}^n\) |
Symbol | Type | Meaning | Example |
---|---|---|---|
\(\ell\) | Scalar | Length-scale parameter | \(k(x,x') = \exp(-\frac{\|x-x'\|^2}{2\ell^2})\) |
\(a\) | Scalar | Amplitude parameter | \(k(x,x') = a^2 \exp(-\frac{\|x-x'\|^2}{2\ell^2})\) |
\(\sigma^2\) | Scalar | Observation noise variance | \(\epsilon(x) \sim \mathcal{N}(0, \sigma^2)\) |
\(\boldsymbol{\theta}\) | Vector | Kernel hyperparameters | \(\boldsymbol{\theta} = [\ell, a, \sigma^2]^\top\) |
Symbol | Type | Meaning | Example |
---|---|---|---|
\(\mathcal{GP}(m, k)\) | Distribution | Gaussian Process | \(f(x) \sim \mathcal{GP}(m, k)\) |
\(\mathcal{N}(\mu, \Sigma)\) | Distribution | Multivariate Normal | \(\mathbf{f} \sim \mathcal{N}(\mathbf{0}, K)\) |
\(\mathcal{N}(\mu, \sigma^2)\) | Distribution | Univariate Normal | \(\epsilon(x) \sim \mathcal{N}(0, \sigma^2)\) |
GP Prior: \[f(x) \sim \mathcal{GP}(m, k)\]
Observation Model: \[y(x) = f(x) + \epsilon(x), \quad \epsilon(x) \sim \mathcal{N}(0, \sigma^2)\]
Joint Distribution: \[\begin{bmatrix} \mathbf{y} \\ \mathbf{f}_* \end{bmatrix} \sim \mathcal{N}\left(\mathbf{0}, \begin{bmatrix} K(X,X) + \sigma^2I & K(X,X_*) \\ K(X_*,X)^\top & K(X_*,X_*) \end{bmatrix}\right)\]
Predictive Distribution: \[p(\mathbf{f}_* | \mathbf{y}, X) = \mathcal{N}(\boldsymbol{\mu}_*, \boldsymbol{\Sigma}_*)\]
©Philipp Pelz - FAU Erlangen-Nürnberg - Data Science for Electron Microscopy