Materials Genomics

Philipp Pelz; Stefan Hiemer

Abstract

This course introduces students to materials genomics, treating the periodic table and the space of known crystal structures as a searchable, computable design space. Students learn how materials databases are built, how simulation methods generate materials data, how atomic structure is represented numerically, how structure–property relationships are learned using machine learning, and how uncertainty-aware models enable accelerated materials discovery.

ECLIPSE Lab Teaching

Materials Genomics

Computational materials discovery through databases, simulation, structure representations, and machine learning.

Semester

Summer Semester 2026

Format

2h lecture

Credits

5 ECTS

Audience

Students interested in materials discovery, simulation, and AI-driven design

Prerequisites

Helpful: Mathematical Foundations of AI & ML and basic materials science

StudOn GitHub / Materials All Teaching KI in Materialtechnologie

How to use this course site. Use this page as the central hub for syllabus, lecture structure, reading, notebooks, and course materials. Formal announcements and enrollment remain on StudOn; code and openly shared resources live in the linked GitHub repository.

1 Instructors

Philipp Pelz
Stefan Hiemer

2 Course Information

5th Semester – 5 ECTS · 2h lecture
Coordinated with “Mathematical Foundations of AI & ML” (MFML) and
“ML for Materials Processing & Characterization” (ML-PC)

3 Course Philosophy

Materials genomics views the periodic table and all known crystal structures as a high-dimensional design space.

In this course, students learn to:

understand how materials data is generated by simulations and experiments,
treat materials data as a structured, learnable representation space,
move beyond classical descriptors toward learned representations,
use ML models as surrogates for quantum-mechanical and continuum simulations,
reason about uncertainty, stability, and discovery,
understand how computational screening integrates with experiments.

The course explicitly builds on MFML:

PCA and regression are assumed background,
neural networks, representation learning, and uncertainty are used, not re-derived.

4 Week-by-Week Curriculum (14 weeks)

Teaching responsibility

Weeks 2, 3, 4, and 6 of Materials Genomics will be taught by Stefan Hiemer.

4.1 Unit I — Where Materials Data Comes From (Weeks 1–5)

4.1.1 Week 1 – What is Materials Genomics? (14.04.2026)

Slides: Open

Why the data backbone of materials genomics is ultimately quantum: bulk properties originate at atomic scales and are encoded in simulation methods (DFT, MD) that presuppose QM.
Failures of classical physics: blackbody radiation and the UV catastrophe, photoelectric effect, Compton scattering, Taylor’s feeble-light double slit.
Evolution of atomic models: Thomson → Rutherford → Bohr; de Broglie matter waves; Davisson–Germer diffraction; Stern-Gerlach and electron spin.
Schrödinger’s equation, the hydrogen atom, Born’s probabilistic interpretation of \(|\psi|^2\).
Postulates of QM: Hilbert spaces, Hermitian operators, measurement outcomes as eigenvalues, bra-ket notation.

Summary: The opening lecture builds the quantum-mechanical foundation that all later simulation methods in this course (DFT, MD, electronic-structure descriptors) presuppose. It traces the historical arc from the hubris of late-19th-century classical physics through the experiments that forced quantization — blackbody radiation, the photoelectric effect, matter-wave diffraction, Stern-Gerlach — to Schrödinger’s wave-mechanical eigenvalue problem and Born’s probabilistic reading. The lecture closes with the operator/Hilbert-space postulates of QM that underpin every electronic-structure dataset students will later encounter in materials-genomics databases.

Exercise: Explore Materials Project; query bandgaps, formation energies, symmetries.

4.1.2 Week 2 – QM postulates, solvable systems, multi-electron atoms (21.04.2026)

Slides: Open

Postulates 5–6: state collapse and Schrödinger time evolution
Operators and orthogonal decomposition; expectation values and variance
Approximation methods: variational principle, perturbation theory
Analytically solvable systems: free particle, harmonic oscillator (1D + \(d\)-dim), infinite / finite well
Multi-electron atoms: helium, indistinguishability, Pauli antisymmetry, exchange
Molecular Hamiltonian → Born–Oppenheimer → Hartree product → Slater determinants → LCAO

Summary:

Closes the QM theory introduced in Unit 1 (operators, Hamiltonian formalism, postulates 5–6)
Approximation toolkit that every later quantum-chemistry method relies on
Walks through every analytically solvable system as a sanity check
Pivots from one-electron to many-electron physics: exchange, Slater determinants
Sets up the molecular electronic-structure problem and the LCAO ansatz used in Week 3

Exercise: Solve the 1D harmonic oscillator with ladder operators; compute expectation values and verify orthogonality.

4.1.3 Week 3 – Quantum chemistry methods (HF, MP, CC, DFT) (28.04.2026)

Slides: Open

Basis sets: Slater- vs Gaussian-type orbitals; STO-nG, 6-31G(d), cc-pVnZ; basis-set superposition error
Hartree–Fock: SCF cycle, Roothaan-Hall equations, restricted vs unrestricted
What HF misses: electron correlation
Møller–Plesset perturbation theory (MP2, MPn)
Coupled cluster (CCSD, CCSD(T)) — the gold standard for small molecules
Density Functional Theory: Hohenberg–Kohn theorems, Kohn–Sham equations
Exchange-correlation functionals: LDA → GGA → hybrid → double-hybrid (Jacob’s ladder)
Cost vs accuracy hierarchy; choosing a method for materials databases

Summary:

Numerical methods for solving the multi-electron Schrödinger equation set up in Week 2
Orbital basis sets and what makes a basis “good”
Hartree–Fock as the entry-level method; correlation as the missing ingredient
Post-HF (MP, CC) as systematic correlation hierarchies
DFT as the workhorse for periodic / large systems; the XC-functional ladder
Method-vs-cost decision logic — Materials Project and OQMD use mostly GGA-DFT

Exercise: Compare HF, MP2, and DFT (LDA/GGA/B3LYP) energies on H\(_2\) / H\(_2\)O; observe correlation contribution and basis-set convergence.

4.1.4 Week 4 – Thermodynamics, statistical mechanics & classical atomistic simulation (05.05.2026)

Slides: Open

Macroscopic thermodynamics: state variables, four laws, free energies \(U\), \(H\), \(F\), \(G\)
Ideal gas: empirical (Boyle, Charles, Avogadro) → equipartition → Maxwell–Boltzmann distribution
Statistical mechanics: microstates, ensembles (NVE, NVT, NPT, \(\mu\)VT), Boltzmann distribution, partition function \(Z\)
Interatomic potentials: pair (LJ, Morse), EAM for metals, bond-order (Tersoff, REBO, ReaxFF)
Brief preview: ML force fields (Behler–Parrinello, GAP, NequIP, MACE)
Static simulations: energy minimization (steepest descent, conjugate gradient, FIRE)
Molecular Dynamics: Verlet integrator, thermostats (Nosé–Hoover, Langevin), barostats, periodic BCs, RDF / MSD / diffusion

Summary:

Pivots from electronic-structure (Weeks 1–3) to finite-temperature physics
Thermodynamics framework: state variables, free energies, equilibrium concepts
Statistical mechanics: ensembles, partition function, Boltzmann distribution
Classical atomistic simulation as the bridge from QM to materials-scale data
MD for time-dependent observables; the Monte Carlo counterpart follows in Week 5

Exercise: Run a small MD simulation in an NVT ensemble; compute the radial distribution function and self-diffusion coefficient.

4.1.5 Week 5 – Monte Carlo sampling & continuum mechanics (12.05.2026)

Slides: Open

Monte Carlo: importance sampling, Markov chains, detailed balance, Metropolis algorithm, ergodicity diagnostics
MC move catalogue (displacement, swap, volume, insertion/removal, hybrid MC); ensembles beyond NVT
MC vs MD comparison; kinetic Monte Carlo as a brief outlook
Continuum mechanics: 1D balance law, transient continuity equation, constitutive relations (Fick, Fourier, Darcy), isotropy/anisotropy
Finite Difference Method: Taylor stencils, forward/backward/central differences, 1D Laplace example, BCs, CFL stability
Finite Element / Finite Volume methods: weak form, shape functions, divergence-theorem flux balance — when and why to use them
Where ML plugs in: neural samplers, neural PDE solvers, MLIP-powered MD/MC

Summary: This lecture closes the simulation half of MG. Monte Carlo joins MD as the second standard sampler of the Boltzmann distribution and unlocks species swaps, \(\mu\)VT, and barrier-crossing moves that MD cannot do. Continuum mechanics then jumps two scale rungs up: balance laws turn into PDEs, which we solve numerically through finite differences (with FEM and FV sketched as the production-grade alternatives). Together with Weeks 1–4, Unit I is now complete and the rest of the course builds on its data generators.

Exercise: Implement a 1D Lennard-Jones Metropolis MC sampler; compare step sizes via acceptance rate and autocorrelation. Solve 1D steady-state diffusion with FDM and check against the analytic solution.

4.2 Unit II — Representations of Materials (Weeks 6, 8)

(Aligned with early neural networks in MFML)

4.2.1 Week 6 – Graph-based crystal representations (19.05.2026)

Slides: Open

Brief recap of classical descriptors (Magpie, matminer, RDF) as the historical front-end — 10–15 min motivation only
Crystals as graphs: nodes, edges, periodic boundary conditions
Message passing intuition; CGCNN, MEGNet, ALIGNN
Invariance and equivariance — the conceptual leap toward MACE / M3GNet (treated in Week 8)

Summary: Students learn why graph representations are the dominant ML interface to crystalline materials in 2025 and how the message-passing pattern turns local atomic neighbourhoods into property-relevant embeddings. Classical descriptors are revisited only as background — the lecture focuses on representational choices (cutoffs, node/edge features, equivariance) that determine downstream model behaviour.

Exercise: Construct, visualise, and benchmark graph representations of crystal structures against a Magpie/random-forest baseline.

4.2.2 Week 7 – No lecture (26.05.2026, public holiday)

Cancelled — no MG lecture takes place on 26.05.2026. Week 7’s planned content (local atomic environments) is consolidated into Week 8.

4.2.3 Week 8 – Local environments & universal ML force fields (02.06.2026)

Slides: Open

Local atomic environments: coordination motifs, Voronoi tessellations, ACSF / SOAP fingerprints
Equivariance under rotations, translations, permutations — why it matters
From Behler–Parrinello and GAP to MACE, M3GNet, CHGNet, EquiformerV2
Universal MLIPs as foundation models for materials: MACE-MP-0, MatterSim, ORB
Plugging these surrogates back into the MD/MC engines from Week 5

Summary: Local-environment descriptors and universal ML interatomic potentials are treated as one continuous arc: SOAP / ACSF as the historical bridge, MACE-family equivariant message passing as the current state of the art. By the end students can describe how a single trained model now serves as a near-DFT-accuracy energy/force engine across most of the periodic table, and how that capability reshapes the high-throughput simulation pipeline.

Exercise: Compute SOAP vectors on a small dataset; run a MACE-MP-0 single-point energy and force evaluation and contrast accuracy / runtime with DFT.

4.3 Unit III — Learning Structure–Property Relations (Weeks 9–10)

4.3.1 Week 9 – Regression and generalization in materials data (09.06.2026)

Slides: Open

Predicting bandgaps, elastic moduli, formation energies
Bias–variance trade-off; composition vs structure splits; chemical-space leakage
Out-of-distribution detection and OOD-aware metrics
Dataset size vs model complexity; matbench-style protocols

Summary: This week reframes materials-property prediction as a generalization problem rather than a leaderboard problem. Students compare baseline regressors, grouped chemistry-aware validation schemes, and error metrics, and they learn why split design matters more than a small gain in test accuracy.

Exercise: Compare linear, random-forest, and graph-neural-network regressors under composition-aware splits.

4.3.2 Week 10 – Neural networks for materials properties (16.06.2026)

Slides: Open

Neural networks as surrogates for DFT-level properties
Training pitfalls: data leakage, imbalance, extrapolation
Transfer learning and scaling laws in materials NNs
Interpretability challenges

Summary: Neural networks are introduced as flexible surrogate models for non-linear structure–property relationships under realistic data constraints. The lecture emphasises optimization stability, regularization, and extrapolation failure modes — the main challenge is trustworthy use rather than raw expressive power.

Exercise: Train a small neural network and analyse its generalization behaviour, with and without pre-training transfer.

4.4 Unit IV — Latent Spaces, Generative Models, Uncertainty (Weeks 11–13)

4.4.1 Week 11 – Representation learning & latent spaces (23.06.2026)

Slides: Open · also incorporates content from Latent Spaces deck

Learned vs engineered features; transferability across chemical systems
Autoencoders, VAEs, contrastive embeddings
Latent geometry, anomaly detection, interpolation; what networks learn about chemistry and structure

Summary: This merged session covers representation learning end-to-end: from autoencoders and contrastive losses to the latent-space use cases (interpolation, anomaly detection, transferability auditing) that motivate them. Students examine when smooth latent maps are scientifically meaningful and when they are just a UX artefact.

Exercise: Train an autoencoder; compare raw-descriptor vs learned-embedding downstream regression performance and inspect latent structure.

4.4.2 Week 12 – Generative models & inverse design (30.06.2026)

Slides: Open

Why generative models for crystals: from forward prediction to inverse design
Diffusion-based methods (MatterGen, DiffCSP, CDVAE), flow matching (FlowMM), autoregressive (CrystaLLM)
Conditioning on properties, symmetries, and synthesis constraints
Evaluation: validity, novelty, uniqueness, S.U.N. metric, downstream DFT screening

Summary: Generative crystal models are the headline development of 2023–2025 in materials genomics. The lecture introduces the dominant families (diffusion, flow matching, autoregressive), how property and constraint conditioning are imposed, and how candidate structures are filtered downstream by force-field relaxation and DFT screening.

Exercise: Sample candidate structures from a pretrained generative model conditioned on a target property; rank with a MACE-MP-0 relaxation and an uncertainty estimate.

4.4.3 Week 13 – Uncertainty-aware discovery & active learning (07.07.2026)

Slides: Open

Aleatoric vs epistemic uncertainty; calibration
Gaussian Processes as the small-data gold standard; deep ensembles and evidential learning at scale
Active learning loops for materials screening; exploration vs exploitation
Cluster-based attention and outlier triage (the old “clustering” thread, compressed)

Summary: Students learn how uncertainty quantification turns a static surrogate into an active discovery engine. GPs remain the small-data reference; deep ensembles and evidential heads dominate at production scale. The lecture closes the discovery loop started in Weeks 6–12: representation → model → uncertainty → next experiment.

Exercise: Run a small active-learning loop on a screening task. Compare GP, deep-ensemble, and random-baseline acquisition strategies.

4.5 Unit V — Constraints, Trust, and Synthesis (Week 14)

4.5.1 Week 14 – Physical constraints, limits, and outlook (14.07.2026)

Slides: Open

Stability, charge neutrality, and symmetry constraints.
Physics-informed learning in materials discovery.
What ML can and cannot discover.
Integration with experimental workflows.

Summary: The final lecture consolidates the course around scientific trust: physical constraints, explainability, reproducibility, and the limits of data-driven discovery. Students leave with a realistic view of how ML can accelerate materials research when it is embedded in experimental and simulation workflows rather than treated as a replacement for them.

Exercise:
Mini-project synthesis and presentation.

5 Learning Outcomes

Students completing this course will be able to:

Explain how simulation methods generate materials data and introduce bias.
Navigate and interrogate major materials databases.
Represent crystal structures using descriptors, graphs, and learned embeddings.
Train and evaluate ML models for predicting materials properties.
Understand latent spaces and their role in materials discovery.
Quantify and interpret uncertainty in materials predictions.
Apply ML responsibly to accelerate materials screening.
Critically assess the limits of data-driven materials discovery.