FAU Erlangen-Nürnberg
Behind us:
Today (Unit 10):
For all of these, locality is the wrong inductive bias. We need a layer where any position can attend to any other.
The sequence problem:



LSTM adds a separate “state” track \(C_t\) — a long-term memory highway:


This is why the transformer title “Attention is All You Need” is provocative: attention replaces the recurrence entirely.
By the end of this unit, students can:
d["key"] → value. Match must be exact.For each query, return \(\sum_i \mathrm{similarity}(\text{query}, \text{key}_i) \cdot \text{value}_i\).
For each position \(i\) in a sequence of length \(n\) with embeddings \(x_i \in \mathbb{R}^{d_{\text{model}}}\), compute three vectors:
\(W^Q, W^K, W^V \in \mathbb{R}^{d_k \times d_{\text{model}}}\) are learned projection matrices.
For each query \(q_i\):
The output \(z_i\) is a content-weighted blend of all values in the sequence.
Sequence: tokens \(A, B, C, D\) with \(d_{\text{model}} = 2\).
After applying \(W^Q, W^K, W^V\) (small toy values):
| Q | K | V | |
|---|---|---|---|
| A | (1, 0) | (1, 0) | (10, 0) |
| B | (0, 1) | (0, 1) | (0, 10) |
| C | (1, 1) | (1, 1) | (5, 5) |
| D | (1,-1) | (1,-1) | (3,-3) |
Compute \(s_{A, \cdot} = q_A^T k_j\):
Softmax of \((1, 0, 1, 1)\) ≈ \((0.31, 0.11, 0.31, 0.31)\). So \(A\) attends mostly to \(A\), \(C\), \(D\).
\(z_A = 0.31 \cdot v_A + 0.11 \cdot v_B + 0.31 \cdot v_C + 0.31 \cdot v_D\).
Stack queries, keys, values into matrices:
\[ Q = X W^Q \in \mathbb{R}^{n \times d_k}, \quad K = X W^K \in \mathbb{R}^{n \times d_k}, \quad V = X W^V \in \mathbb{R}^{n \times d_v}. \]
All scores at once: \(S = Q K^T \in \mathbb{R}^{n \times n}\). Each row is one query’s scores against all keys.
Output: \(Z = \mathrm{softmax}(S) V \in \mathbb{R}^{n \times d_v}\).

The full self-attention layer:
\[ \boxed{\,\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right) V.\,} \]
The only difference from what we just derived: the scaling factor \(\sqrt{d_k}\).

For each head \(i = 1, \ldots, h\):
\[ \mathrm{head}_i = \mathrm{Attention}(Q W_i^Q, K W_i^K, V W_i^V), \]
with each head’s projection matrices independent. Concatenate and project:
\[ \mathrm{MultiHead}(Q, K, V) = \mathrm{Concat}(\mathrm{head}_1, \ldots, \mathrm{head}_h) W^O. \]

Standard recipe (Vaswani et al. 2017):
We need to inject positional information explicitly.
For position \(\mathrm{pos}\) and embedding dimension \(i\):
\[ \mathrm{PE}(\mathrm{pos}, 2i) = \sin(\mathrm{pos} / 10000^{2i/d_{\text{model}}}), \] \[ \mathrm{PE}(\mathrm{pos}, 2i+1) = \cos(\mathrm{pos} / 10000^{2i/d_{\text{model}}}). \]
Add \(\mathrm{PE}\) to the token embeddings before the first attention layer: \(\tilde x_i = x_i + \mathrm{PE}(i, \cdot)\).
![Sinusoidal positional encoding matrix for 50 positions and 128 dimensions. Low dimensions oscillate rapidly (fine-grained position); high dimensions change slowly (coarse-grained). Each row is a unique positional fingerprint. (Generated locally.)]

For this course: assume sinusoidal or learned. RoPE is for the curious.
x ──┐
├─► LayerNorm ─► Multi-Head Attn ──► (+) ─►
└────────────────────────────────────┘ │
│
┌────────────────────────────────────────┘
│
├─► LayerNorm ─► MLP (2-layer FF) ─────► (+) ─► output
└─────────────────────────────────────────┘
Two sublayers: attention (mixes positions) and MLP (transforms each position independently). Each wrapped in a residual connection and LayerNorm.

That’s it. ViT is “transformer on patches.” No convolutions, no spatial inductive bias.

For a \(224 \times 224\) image with \(16 \times 16\) patches:
The [CLS] token is a learned vector prepended to the sequence. After all transformer blocks, its final representation is used as the image embedding.
This is the same trick BERT uses for sentence classification.
For materials work: a pre-trained ViT (e.g., DINOv2) frozen + linear probe is often the strongest baseline. (Unit 9 told you this; now you know what’s inside the encoder.)


Three directions that move past 2017-vanilla attention:
What if attention isn’t the right operator at all?
Defaults in 2026: Flash Attention always; MoE for frontier LLMs; SSMs increasingly for very long sequences (genomes, audio, time series). Dense attention with quadratic memory is now a teaching reference, not a production target.
This is the dominant paradigm in 2026. For materials, true domain-specific foundation models are emerging (Materials Project’s MatBench Discovery, MoLFormer for molecules, MaterialsAtlas).
Note
Reading for Unit 11 (Generative Models: VAE & Diffusion). Skim Kingma & Welling (2013) “Auto-Encoding Variational Bayes” for VAE, and Lilian Weng’s blog post “What are diffusion models?” for diffusion intuition. Murphy (2023) Ch. 28 is the textbook reference.
Unit 11: today’s transformer can be an encoder. What if we want a generator — something that produces new molecules, microstructures, designs? VAEs add a probabilistic latent; diffusion learns to denoise. The U-Net inside Stable Diffusion is, increasingly, a transformer.
Week 10 notebooks (in example_notebooks/ once added)
timm or HuggingFace): attention map visualization on micrographs.Strongly recommended take-home: Karpathy’s “Let’s build GPT” notebook (nanoGPT). 200 lines, demystifies decoder-only transformers entirely.
By the end of this unit, students can:

© Philipp Pelz - Mathematical Foundations of AI & ML