FAU Erlangen-Nürnberg
Behind us:
Today (Unit 10):
For all of these, locality is the wrong inductive bias. We need a layer where any position can attend to any other.
The sequence problem:

“Unrolling” repeats the RNN block for each step — a chain \(z_0 \to z_1 \to \cdots \to z_T\) with the same recurrent weight \(v\) reused every step.
\[ \frac{\partial \mathcal{L}}{\partial \theta}\;\propto\;\prod_{t=1}^{T}\frac{\partial z_t}{\partial z_{t-1}}\;\sim\;v^{\,T} \]


LSTM adds a separate “state” track \(C_t\) — a long-term memory highway:


This is why the transformer title “Attention is All You Need” is provocative: attention replaces the recurrence entirely.
By the end of this unit, students can:
d["key"] → value. Match must be exact.For each query, return \(\sum_i \mathrm{similarity}(\text{query}, \text{key}_i) \cdot \text{value}_i\).
For each position \(i\) in a sequence of length \(n\) with embeddings \(x_i \in \mathbb{R}^{d_{\text{model}}}\), compute three vectors:
\(W^Q, W^K, W^V \in \mathbb{R}^{d_k \times d_{\text{model}}}\) are learned projection matrices.
For each query \(q_i\):
The output \(z_i\) is a content-weighted blend of all values in the sequence.
Sequence: tokens \(A, B, C, D\) with \(d_{\text{model}} = 2\).
After applying \(W^Q, W^K, W^V\) (small toy values):
| Q | K | V | |
|---|---|---|---|
| A | (1, 0) | (1, 0) | (10, 0) |
| B | (0, 1) | (0, 1) | (0, 10) |
| C | (1, 1) | (1, 1) | (5, 5) |
| D | (1,-1) | (1,-1) | (3,-3) |
Compute \(s_{A, \cdot} = q_A^T k_j\):
Softmax of \((1, 0, 1, 1)\) ≈ \((0.31, 0.11, 0.31, 0.31)\). So \(A\) attends mostly to \(A\), \(C\), \(D\).
\(z_A = 0.31 \cdot v_A + 0.11 \cdot v_B + 0.31 \cdot v_C + 0.31 \cdot v_D\).
Stack queries, keys, values into matrices:
\[ Q = X W^Q \in \mathbb{R}^{n \times d_k}, \quad K = X W^K \in \mathbb{R}^{n \times d_k}, \quad V = X W^V \in \mathbb{R}^{n \times d_v}. \]
All scores at once: \(S = Q K^T \in \mathbb{R}^{n \times n}\). Each row is one query’s scores against all keys.
Output: \(Z = \mathrm{softmax}(S) V \in \mathbb{R}^{n \times d_v}\).

The full self-attention layer:
\[ \boxed{\,\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right) V.\,} \]
The only difference from what we just derived: the scaling factor \(\sqrt{d_k}\).

For each head \(i = 1, \ldots, h\):
\[ \mathrm{head}_i = \mathrm{Attention}(Q W_i^Q, K W_i^K, V W_i^V), \]
with each head’s projection matrices independent. Concatenate and project:
\[ \mathrm{MultiHead}(Q, K, V) = \mathrm{Concat}(\mathrm{head}_1, \ldots, \mathrm{head}_h) W^O. \]

Standard recipe (Vaswani et al. 2017):
We need to inject positional information explicitly.
For position \(\mathrm{pos}\) and embedding dimension \(i\):
\[ \mathrm{PE}(\mathrm{pos}, 2i) = \sin(\mathrm{pos} / 10000^{2i/d_{\text{model}}}), \] \[ \mathrm{PE}(\mathrm{pos}, 2i+1) = \cos(\mathrm{pos} / 10000^{2i/d_{\text{model}}}). \]
Add \(\mathrm{PE}\) to the token embeddings before the first attention layer: \(\tilde x_i = x_i + \mathrm{PE}(i, \cdot)\).
![Sinusoidal positional encoding matrix for 50 positions and 128 dimensions. Low dimensions oscillate rapidly (fine-grained position); high dimensions change slowly (coarse-grained). Each row is a unique positional fingerprint. (Generated locally.)]

For this course: assume sinusoidal or learned. RoPE is for the curious.
x ──┐
├─► LayerNorm ─► Multi-Head Attn ──► (+) ─►
└────────────────────────────────────┘ │
│
┌────────────────────────────────────────┘
│
├─► LayerNorm ─► MLP (2-layer FF) ─────► (+) ─► output
└─────────────────────────────────────────┘
Two sublayers: attention (mixes positions) and MLP (transforms each position independently). Each wrapped in a residual connection and LayerNorm.

That’s it. ViT is “transformer on patches.” No convolutions, no spatial inductive bias.

For a \(224 \times 224\) image with \(16 \times 16\) patches:
The [CLS] token is a learned vector prepended to the sequence; after all blocks its final state is the image embedding — the same trick BERT uses for sentence classification.

For materials work: a pre-trained ViT (e.g., DINOv2) frozen + linear probe is often the strongest baseline. (Unit 9 told you this; now you know what’s inside the encoder.)



Three directions that move past 2017-vanilla attention:
What if attention isn’t the right operator at all?
Defaults in 2026: Flash Attention always; MoE for frontier LLMs; SSMs increasingly for very long sequences (genomes, audio, time series). Dense attention with quadratic memory is now a teaching reference, not a production target.
This is the dominant paradigm in 2026. For materials, true domain-specific foundation models are emerging (Materials Project’s MatBench Discovery, MoLFormer for molecules, MaterialsAtlas).
Unit 11: today’s transformer can be an encoder. What if we want a generator — something that produces new molecules, microstructures, designs? VAEs add a probabilistic latent; diffusion learns to denoise. The U-Net inside Stable Diffusion is, increasingly, a transformer.
Warning
Reading for Unit 11 (Generative Models: VAE & Diffusion). Skim Kingma & Welling (2013) “Auto-Encoding Variational Bayes” for VAE, and Lilian Weng’s blog post “What are diffusion models?” for diffusion intuition. Murphy (2023) Ch. 28 is the textbook reference.
Strongly recommended take-home: Karpathy’s “Let’s build GPT” video walkthrough — code: karpathy/ng-video-lecture (the 200-line nanoGPT). Demystifies decoder-only transformers entirely.
Week 10 notebooks (in example_notebooks/ once added)
timm or HuggingFace): attention map visualization on micrographs.By the end of this unit, students can:

© Philipp Pelz - Mathematical Foundations of AI & ML