FAU Erlangen-Nürnberg
Behind us:
Today (Unit 11):
Missing piece: a controllable distribution over \(z\) — and a tractable way to sample from it.
A generative model represents (or approximates) the data distribution \(p(x)\). It must support:
Today: VAEs (sampling + approximate likelihood) and diffusion (sampling, no exact likelihood, but extraordinary quality).

By the end of this unit, students can:
Vanilla AE:
\[ x \xrightarrow{f_\phi} z \xrightarrow{g_\theta} \hat x \]
Encoder produces a single point \(z\).
VAE:
\[ x \xrightarrow{f_\phi} (\mu, \sigma) \xrightarrow{\text{sample}} z \xrightarrow{g_\theta} \hat x \]
Encoder produces a distribution \(\mathcal{N}(\mu, \sigma^2 I)\). Sample \(z\) from it.
The latent \(z\) is now a random variable, not a deterministic function of \(x\).

This is the core trick: train the encoder to push latents toward a known prior, so we can sample from that prior at test time.

For each training point \(x\), the VAE loss has two terms:
\[ \mathcal{L}(\theta, \phi; x) = \underbrace{-\mathbb{E}_{z \sim q_\phi(z|x)}[\log p_\theta(x \mid z)]}_{\text{reconstruction}} + \underbrace{\mathrm{KL}(q_\phi(z \mid x) \,\|\, p(z))}_{\text{prior-matching}} \]
The actual quantity we want to maximize is \(\log p_\theta(x)\) — but it is intractable (an integral over \(z\)).
Trick: derive a tractable lower bound. \[ \log p_\theta(x) \geq \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x \mid z)] - \mathrm{KL}(q_\phi(z \mid x) \,\|\, p(z)) = \mathrm{ELBO}. \]
Maximizing the ELBO maximizes a lower bound on the log-likelihood. We are almost doing maximum likelihood.
Start with \(\log p(x) = \log \int p(x, z) dz\). Multiply and divide by \(q_\phi(z|x)\):
\[ \log p(x) = \log \int q_\phi(z|x) \frac{p(x, z)}{q_\phi(z|x)} dz \geq \int q_\phi(z|x) \log \frac{p(x, z)}{q_\phi(z|x)} dz \quad \text{(Jensen)}. \]
Expanding \(p(x, z) = p(x|z) p(z)\) and rearranging: \[ \log p(x) \geq \mathbb{E}_q[\log p(x|z)] - \mathrm{KL}(q(z|x) \| p(z)). \]
The randomness sits in \(\epsilon\) (no parameters); \(\mu, \sigma\) are deterministic functions of \(\phi\). Now gradients flow from the loss through \(z\) back to \(\phi\).

When \(q_\phi(z \mid x) = \mathcal{N}(\mu, \mathrm{diag}(\sigma^2))\) and \(p(z) = \mathcal{N}(0, I)\):
\[ \mathrm{KL}(q \,\|\, p) = \frac{1}{2}\sum_{j=1}^{k}\!\left(\mu_j^2 + \sigma_j^2 - \log \sigma_j^2 - 1\right). \]
~10 lines. Encoder predicts \(\log \sigma\) for numerical stability. Sample once per training step (more samples = lower-variance gradient estimate but more compute).
That’s it. No labels needed; no special procedure. The trained encoder/decoder pair is now also a sampler.

For materials: interpolate between two micrographs to see how phases transition; interpolate between two compositions to traverse phase space.
This is the diffusion idea: start from pure noise, gradually denoise to produce a sample.

Analogy: diffusion also iterates many small denoising steps. Unlike MCMC, the denoiser is a learned neural network.

A Markov chain that progressively adds Gaussian noise:
\[ q(x_t \mid x_{t-1}) = \mathcal{N}\!\left(x_t;\; \sqrt{1 - \beta_t}\, x_{t-1},\; \beta_t I\right). \]
The composition of \(t\) Gaussian steps is itself Gaussian:
\[ q(x_t \mid x_0) = \mathcal{N}\!\left(x_t;\; \sqrt{\bar\alpha_t}\, x_0,\; (1 - \bar\alpha_t) I\right), \]
where \(\alpha_t = 1 - \beta_t\) and \(\bar\alpha_t = \prod_{s=1}^{t} \alpha_s\).
Equivalently: \[ x_t = \sqrt{\bar\alpha_t}\, x_0 + \sqrt{1 - \bar\alpha_t}\, \epsilon, \quad \epsilon \sim \mathcal{N}(0, I). \]

Ideally:
\[ p(x_{t-1} \mid x_t) = ? \]
The cleanest parameterization: train a network \(\epsilon_\theta(x_t, t)\) to predict the noise that was added:
\[ \epsilon_\theta(x_t, t) \approx \epsilon \quad \text{where} \quad x_t = \sqrt{\bar\alpha_t}\, x_0 + \sqrt{1 - \bar\alpha_t}\, \epsilon. \]
Why predict noise (instead of \(x_0\) or \(x_{t-1}\))? Empirically: noise prediction trains more stably and produces better samples. Also: noise has unit variance everywhere, so the network output is well-scaled.
The simplified loss is just MSE on noise:
\[ \mathcal{L}_{\text{simple}} = \mathbb{E}_{t, x_0, \epsilon}\!\left[\left\| \epsilon - \epsilon_\theta\!\left(\sqrt{\bar\alpha_t} x_0 + \sqrt{1 - \bar\alpha_t} \epsilon,\; t\right) \right\|^2\right]. \]
Algorithm (training): 1. Sample \(x_0\) from data, \(t \sim \mathrm{Uniform}\{1, \ldots, T\}\), \(\epsilon \sim \mathcal{N}(0, I)\). 2. Compute \(x_t = \sqrt{\bar\alpha_t} x_0 + \sqrt{1 - \bar\alpha_t}\, \epsilon\). 3. Predict \(\hat\epsilon = \epsilon_\theta(x_t, t)\). 4. Loss: \(\|\epsilon - \hat\epsilon\|^2\). 5. Backpropagate.

Input: trained ε_θ
1. x_T ← sample from N(0, I)
2. for t = T, T-1, ..., 1:
3. ε̂ ← ε_θ(x_t, t)
4. compute mean μ_t and variance σ_t² from ε̂
5. x_{t-1} ← μ_t + σ_t · z (z=0 at t=1)
6. return x_0
\(T\) network calls per sample. With \(T = 1000\), this is slow compared to a VAE (1 call) — the dominant practical limitation.

For text-to-image: \(c\) is a text embedding (often from a frozen CLIP encoder; Unit 9).
The trick that makes conditional diffusion actually work well:

From SDE to ODE:
Three flavours, one idea:
“In 2026, training a new image generator from scratch: start with flow matching, not DDPM. Same neural network shape, simpler loss, faster inference.”
The cost of diffusion is the step count:
Two training routes:
“Default in 2026 for real-time / interactive generation: a consistency-distilled student of a flow-matching teacher. The flow-matching teacher itself is the high-quality reference.”
| VAE | Diffusion | Flow | |
|---|---|---|---|
| Sampling speed | fast | slow (fast w/ consistency) | fast |
| Sample quality | low (blurry) | very high | medium |
| Training stability | good | very good | good |
| Exact likelihood | no (lower bound) | no | yes |
| Best for | fast prototyping, latent geometry | high-quality generation, conditioning | exact likelihood |
Historical footnote: GANs (Goodfellow et al. 2014) dominated image generation 2015–2021 but have been superseded by diffusion and flow matching in 2026.
This is the materials-design loop in 2026: generative + simulator + experiment, in a closed loop.
Generative + physics is one of the most active research frontiers in materials ML.
Note
Reading for Unit 12 (Uncertainty Quantification). Skim Bishop Ch. 6 (kernel methods, Gaussian processes) and Murphy 2nd ed. Ch. 17 (Bayesian deep learning). Background reading: Rasmussen & Williams “Gaussian Processes for Machine Learning.”
Unit 12: today we learned to generate. Next, we learn to say what we don’t know. Gaussian processes give us calibrated uncertainty bands; deep ensembles and conformal prediction approximate this for neural networks.
Week 11 notebooks (in example_notebooks/ once added)
Strongly recommended: Lilian Weng’s blog post “What are diffusion models?” — the best free overview, with derivations and intuition.
By the end of this unit, students can:

© Philipp Pelz - Mathematical Foundations of AI & ML