FAU Erlangen-Nürnberg
Behind us:
Today (Unit 9):
The reconstruction loss says only one thing: encode enough information to undo. It says nothing about whether similar inputs end up close, whether categories cluster, or whether interpolation is meaningful.

Four desiderata, with examples that fail each:

By the end of this unit, students can:
We have \(N\) high-dimensional points \(\{x_1, \ldots, x_N\} \subset \mathbb{R}^d\). We want low-dimensional points \(\{y_1, \ldots, y_N\} \subset \mathbb{R}^2\) such that:
points that are close in \(\mathbb{R}^d\) stay close in \(\mathbb{R}^2\).
This is harder than it sounds — distances in \(\mathbb{R}^d\) generally cannot all be preserved in \(\mathbb{R}^2\). t-SNE makes a specific trade-off: preserve local structure, sacrifice global structure.

For each pair \((i, j)\) in the high-dim space, define a conditional probability:
\[ p_{j|i} = \frac{\exp\!\left(-\|x_i - x_j\|^2 / 2\sigma_i^2\right)}{\sum_{k \neq i} \exp\!\left(-\|x_i - x_k\|^2 / 2\sigma_i^2\right)}. \]
Read this as: “if you’re \(x_i\) and you pick a neighbor according to a Gaussian, with what probability would you pick \(x_j\)?” Symmetrize: \(p_{ij} = (p_{j|i} + p_{i|j}) / (2N)\).
Perplexity \(= 2^{H(P_i)}\) with \(H(P_i) = -\sum_j p_{j|i} \log_2 p_{j|i}\): the effective number of neighbors of \(x_i\). The user picks perplexity (typical 5-50); t-SNE binary-searches \(\sigma_i\) per point to match it.
A naive low-dim Gaussian \(q_{ij}^{\text{Gauss}} \propto \exp(-\|y_i - y_j\|^2)\) matched to \(P\) via \(\mathrm{KL}(P\|Q)\) breaks: in \(\mathbb{R}^2\) there is no room for the many moderate-distance neighbors that a high-dim shell affords (only ~6 fit at the same distance). The Gaussian’s tail forces an attractive collapse.
Fix: a heavy-tailed Student-t (1 d.o.f. = Cauchy) in low-dim,
\[ q_{ij} = \frac{(1 + \|y_i - y_j\|^2)^{-1}}{\sum_{k \neq l} (1 + \|y_k - y_l\|^2)^{-1}}. \]
Moderate distances stay possible without infinite force — the “t” in t-SNE.

Optimize the embedding \(\{y_i\}\) to minimize:
\[ C(\{y_i\}) = \mathrm{KL}(P \,\|\, Q) = \sum_{i \neq j} p_{ij} \log \frac{p_{ij}}{q_{ij}}. \]
Read t-SNE plots qualitatively. Never quantitatively.

Analogous benchmark: Fashion MNIST (10 clothing classes, 6000 images, 784 dims → 2 dims, perplexity 40)
For materials: same logic applies to alloy-composition or spectral data.

Contrast with t-SNE: same spirit (match high- and low-dim affinities), but UMAP uses cross-entropy on a graph rather than KL on distributions, plus a spectral initial layout — which is why it preserves global structure better and runs faster.
| t-SNE | UMAP | |
|---|---|---|
| Local structure | excellent | excellent |
| Global structure | poor | better |
| Speed | slow (\(O(N \log N)\) Barnes-Hut) | fast (linear in edges) |
| Stability | stochastic, random init | more stable, spectral init |
| Theoretical grounding | KL on distributions | cross-entropy on fuzzy graph |
| Scales to | \(\sim 10^4\) points | \(\sim 10^6\)+ points |
2026 default: reach for UMAP first for any embedding/visualisation of materials data. Use t-SNE only when you specifically want to inspect local cluster shape (e.g. is this island one phase or two?) — its local exaggeration is then a feature, not a bug.
Never use either method to make quantitative distance claims.
n_neighbors (analog of perplexity): local vs global trade-off. Low → local; high → global.min_dist: how tight clusters can be in the output.metric: cosine for normalised embeddings, Euclidean for raw features.n_neighbors=15, min_dist=0.1) usually work.
Setup. ~10 k SEM micrographs from a public steel-microstructure corpus; ResNet-50 (ImageNet) embeddings \(\in \mathbb{R}^{2048}\); UMAP to \(\mathbb{R}^2\) with n_neighbors=30, min_dist=0.05, cosine metric.

Insight: instead of asking “can you reconstruct?”, ask “can you tell that this and that are the same thing?”
The augmentation choice defines what invariances the latent learns.

For a positive pair \((i, j)\):
\[ \mathcal{L}_{i,j} = -\log \frac{\exp(\mathrm{sim}(z_i, z_j) / \tau)}{\sum_{k \neq i} \exp(\mathrm{sim}(z_i, z_k) / \tau)}, \]
where \(\mathrm{sim}(u, v) = u^T v / (\|u\| \|v\|)\) (cosine similarity), and \(\tau\) is a temperature hyperparameter.
Read this as: a \((2N-1)\)-way classification — given \(z_i\), identify \(z_j\) among all candidates. Positives win when their similarity exceeds all negatives.
This is the modern self-supervised pre-training story: 2018-2022 saw a string of contrastive methods (MoCo, SimCLR, BYOL, DINO) compete with and eventually surpass supervised pre-training on ImageNet.
| AE (Unit 5) | Contrastive | |
|---|---|---|
| Supervisory signal | reconstruct \(x\) from \(z\) | \(z\)-similarity for augmented pairs |
| Invariances learned | none enforced | augmentation-defined |
| Latent organization | reconstruction-driven | semantic |
| Typical use | compression, anomaly | features for downstream tasks |
For materials: contrastive embeddings often beat AE embeddings on downstream classification, especially when labels are scarce.
Two families.
Where the field is in 2026.
Default in 2026 for vision SSL: DINOv2 if compute allows; MAE if simplicity matters; SimCLR for teaching the contrastive idea.
This is the dominant paradigm in modern ML. GPT, BERT, ViT, DINO, CLIP — all foundation models.

This is often the strongest baseline in 2026 for any classification task with limited labels. Try it first.
Setup: pre-train on (image, caption) pairs.
Result: a shared embedding space where text and images are comparable.

This works because low-level visual features (edges, textures) transfer across domains. The encoder doesn’t know what “martensite” is; it knows what “high-contrast textured region” is, and that turns out to be enough.
Missing from all three: an explicit, controllable distribution over \(z\) — necessary for sampling new data.


Note
Reading for Unit 10 (Attention & Transformers). Skim Vaswani et al. (2017) “Attention is All You Need” — at minimum the abstract and Section 3 (Model Architecture). Read Bishop 2nd ed. Ch. 12 if available.
Unit 10: today we praised “the encoder” as if it were a black box. Tomorrow we open it up. The architecture behind every modern foundation model — text, image, audio, materials — is the transformer. Self-attention as content-based addressing.
Week 9 notebooks (in example_notebooks/ once added)
By the end of this unit, students can:

© Philipp Pelz - Mathematical Foundations of AI & ML