Materials Genomics

Computational Materials Discovery

Authors

Philipp Pelz

Stefan Hiemer

Published

May 13, 2026

Abstract

This course introduces students to materials genomics, treating the periodic table and the space of known crystal structures as a searchable, computable design space. Students learn how materials databases are built, how simulation methods generate materials data, how atomic structure is represented numerically, how structure–property relationships are learned using machine learning, and how uncertainty-aware models enable accelerated materials discovery.

Keywords

Materials Science, Machine Learning, Computational Materials Discovery, Materials Databases, Crystal Structure

ECLIPSE Lab Teaching

Materials Genomics

Computational materials discovery through databases, simulation, structure representations, and machine learning.

Semester
Summer Semester 2026
Format
2h lecture
Credits
5 ECTS
Audience
Students interested in materials discovery, simulation, and AI-driven design
Prerequisites
Helpful: Mathematical Foundations of AI & ML and basic materials science
How to use this course site. Use this page as the central hub for syllabus, lecture structure, reading, notebooks, and course materials. Formal announcements and enrollment remain on StudOn; code and openly shared resources live in the linked GitHub repository.

1 Instructors

  • Philipp Pelz
  • Stefan Hiemer

2 Course Information

5th Semester – 5 ECTS · 2h lecture
Coordinated with “Mathematical Foundations of AI & ML” (MFML) and
“ML for Materials Processing & Characterization” (ML-PC)


3 Course Philosophy

Materials genomics views the periodic table and all known crystal structures as a high-dimensional design space.

In this course, students learn to:

  • understand how materials data is generated by simulations and experiments,
  • treat materials data as a structured, learnable representation space,
  • move beyond classical descriptors toward learned representations,
  • use ML models as surrogates for quantum-mechanical and continuum simulations,
  • reason about uncertainty, stability, and discovery,
  • understand how computational screening integrates with experiments.

The course explicitly builds on MFML:

  • PCA and regression are assumed background,
  • neural networks, representation learning, and uncertainty are used, not re-derived.

4 Week-by-Week Curriculum (14 weeks)

NoteTeaching responsibility

Weeks 2, 3, 4, and 6 of Materials Genomics will be taught by Stefan Hiemer.

4.1 Unit I — Where Materials Data Comes From (Weeks 1–5)

4.1.1 Week 1 – What is Materials Genomics? (14.04.2026)

Slides: Open

  • Why the data backbone of materials genomics is ultimately quantum: bulk properties originate at atomic scales and are encoded in simulation methods (DFT, MD) that presuppose QM.
  • Failures of classical physics: blackbody radiation and the UV catastrophe, photoelectric effect, Compton scattering, Taylor’s feeble-light double slit.
  • Evolution of atomic models: Thomson → Rutherford → Bohr; de Broglie matter waves; Davisson–Germer diffraction; Stern-Gerlach and electron spin.
  • Schrödinger’s equation, the hydrogen atom, Born’s probabilistic interpretation of \(|\psi|^2\).
  • Postulates of QM: Hilbert spaces, Hermitian operators, measurement outcomes as eigenvalues, bra-ket notation.

Summary: The opening lecture builds the quantum-mechanical foundation that all later simulation methods in this course (DFT, MD, electronic-structure descriptors) presuppose. It traces the historical arc from the hubris of late-19th-century classical physics through the experiments that forced quantization — blackbody radiation, the photoelectric effect, matter-wave diffraction, Stern-Gerlach — to Schrödinger’s wave-mechanical eigenvalue problem and Born’s probabilistic reading. The lecture closes with the operator/Hilbert-space postulates of QM that underpin every electronic-structure dataset students will later encounter in materials-genomics databases.

Exercise: Explore Materials Project; query bandgaps, formation energies, symmetries.


4.1.2 Week 2 – QM postulates, solvable systems, multi-electron atoms (21.04.2026)

Slides: Open

  • Postulates 5–6: state collapse and Schrödinger time evolution
  • Operators and orthogonal decomposition; expectation values and variance
  • Approximation methods: variational principle, perturbation theory
  • Analytically solvable systems: free particle, harmonic oscillator (1D + \(d\)-dim), infinite / finite well
  • Multi-electron atoms: helium, indistinguishability, Pauli antisymmetry, exchange
  • Molecular Hamiltonian → Born–Oppenheimer → Hartree product → Slater determinantsLCAO

Summary:

  • Closes the QM theory introduced in Unit 1 (operators, Hamiltonian formalism, postulates 5–6)
  • Approximation toolkit that every later quantum-chemistry method relies on
  • Walks through every analytically solvable system as a sanity check
  • Pivots from one-electron to many-electron physics: exchange, Slater determinants
  • Sets up the molecular electronic-structure problem and the LCAO ansatz used in Week 3

Exercise: Solve the 1D harmonic oscillator with ladder operators; compute expectation values and verify orthogonality.


4.1.3 Week 3 – Quantum chemistry methods (HF, MP, CC, DFT) (28.04.2026)

Slides: Open

  • Basis sets: Slater- vs Gaussian-type orbitals; STO-nG, 6-31G(d), cc-pVnZ; basis-set superposition error
  • Hartree–Fock: SCF cycle, Roothaan-Hall equations, restricted vs unrestricted
  • What HF misses: electron correlation
  • Møller–Plesset perturbation theory (MP2, MPn)
  • Coupled cluster (CCSD, CCSD(T)) — the gold standard for small molecules
  • Density Functional Theory: Hohenberg–Kohn theorems, Kohn–Sham equations
  • Exchange-correlation functionals: LDA → GGA → hybrid → double-hybrid (Jacob’s ladder)
  • Cost vs accuracy hierarchy; choosing a method for materials databases

Summary:

  • Numerical methods for solving the multi-electron Schrödinger equation set up in Week 2
  • Orbital basis sets and what makes a basis “good”
  • Hartree–Fock as the entry-level method; correlation as the missing ingredient
  • Post-HF (MP, CC) as systematic correlation hierarchies
  • DFT as the workhorse for periodic / large systems; the XC-functional ladder
  • Method-vs-cost decision logic — Materials Project and OQMD use mostly GGA-DFT

Exercise: Compare HF, MP2, and DFT (LDA/GGA/B3LYP) energies on H\(_2\) / H\(_2\)O; observe correlation contribution and basis-set convergence.


4.1.4 Week 4 – Thermodynamics, statistical mechanics & classical atomistic simulation (05.05.2026)

Slides: Open

  • Macroscopic thermodynamics: state variables, four laws, free energies \(U\), \(H\), \(F\), \(G\)
  • Ideal gas: empirical (Boyle, Charles, Avogadro) → equipartition → Maxwell–Boltzmann distribution
  • Statistical mechanics: microstates, ensembles (NVE, NVT, NPT, \(\mu\)VT), Boltzmann distribution, partition function \(Z\)
  • Interatomic potentials: pair (LJ, Morse), EAM for metals, bond-order (Tersoff, REBO, ReaxFF)
  • Brief preview: ML force fields (Behler–Parrinello, GAP, NequIP, MACE)
  • Static simulations: energy minimization (steepest descent, conjugate gradient, FIRE)
  • Molecular Dynamics: Verlet integrator, thermostats (Nosé–Hoover, Langevin), barostats, periodic BCs, RDF / MSD / diffusion

Summary:

  • Pivots from electronic-structure (Weeks 1–3) to finite-temperature physics
  • Thermodynamics framework: state variables, free energies, equilibrium concepts
  • Statistical mechanics: ensembles, partition function, Boltzmann distribution
  • Classical atomistic simulation as the bridge from QM to materials-scale data
  • MD for time-dependent observables; the Monte Carlo counterpart follows in Week 5

Exercise: Run a small MD simulation in an NVT ensemble; compute the radial distribution function and self-diffusion coefficient.


4.1.5 Week 5 – Monte Carlo sampling & continuum mechanics (12.05.2026)

Slides: Open

  • Monte Carlo: importance sampling, Markov chains, detailed balance, Metropolis algorithm, ergodicity diagnostics
  • MC move catalogue (displacement, swap, volume, insertion/removal, hybrid MC); ensembles beyond NVT
  • MC vs MD comparison; kinetic Monte Carlo as a brief outlook
  • Continuum mechanics: 1D balance law, transient continuity equation, constitutive relations (Fick, Fourier, Darcy), isotropy/anisotropy
  • Finite Difference Method: Taylor stencils, forward/backward/central differences, 1D Laplace example, BCs, CFL stability
  • Finite Element / Finite Volume methods: weak form, shape functions, divergence-theorem flux balance — when and why to use them
  • Where ML plugs in: neural samplers, neural PDE solvers, MLIP-powered MD/MC

Summary: This lecture closes the simulation half of MG. Monte Carlo joins MD as the second standard sampler of the Boltzmann distribution and unlocks species swaps, \(\mu\)VT, and barrier-crossing moves that MD cannot do. Continuum mechanics then jumps two scale rungs up: balance laws turn into PDEs, which we solve numerically through finite differences (with FEM and FV sketched as the production-grade alternatives). Together with Weeks 1–4, Unit I is now complete and the rest of the course builds on its data generators.

Exercise: Implement a 1D Lennard-Jones Metropolis MC sampler; compare step sizes via acceptance rate and autocorrelation. Solve 1D steady-state diffusion with FDM and check against the analytic solution.


4.2 Unit II — Representations of Materials (Weeks 6, 8)

(Aligned with early neural networks in MFML)

4.2.1 Week 6 – Graph-based crystal representations (19.05.2026)

Slides: Open

  • Brief recap of classical descriptors (Magpie, matminer, RDF) as the historical front-end — 10–15 min motivation only
  • Crystals as graphs: nodes, edges, periodic boundary conditions
  • Message passing intuition; CGCNN, MEGNet, ALIGNN
  • Invariance and equivariance — the conceptual leap toward MACE / M3GNet (treated in Week 8)

Summary: Students learn why graph representations are the dominant ML interface to crystalline materials in 2025 and how the message-passing pattern turns local atomic neighbourhoods into property-relevant embeddings. Classical descriptors are revisited only as background — the lecture focuses on representational choices (cutoffs, node/edge features, equivariance) that determine downstream model behaviour.

Exercise: Construct, visualise, and benchmark graph representations of crystal structures against a Magpie/random-forest baseline.


4.2.2 Week 7 – No lecture (26.05.2026, public holiday)

Cancelled — no MG lecture takes place on 26.05.2026. Week 7’s planned content (local atomic environments) is consolidated into Week 8.


4.2.3 Week 8 – Local environments & universal ML force fields (02.06.2026)

Slides: Open

  • Local atomic environments: coordination motifs, Voronoi tessellations, ACSF / SOAP fingerprints
  • Equivariance under rotations, translations, permutations — why it matters
  • From Behler–Parrinello and GAP to MACE, M3GNet, CHGNet, EquiformerV2
  • Universal MLIPs as foundation models for materials: MACE-MP-0, MatterSim, ORB
  • Plugging these surrogates back into the MD/MC engines from Week 5

Summary: Local-environment descriptors and universal ML interatomic potentials are treated as one continuous arc: SOAP / ACSF as the historical bridge, MACE-family equivariant message passing as the current state of the art. By the end students can describe how a single trained model now serves as a near-DFT-accuracy energy/force engine across most of the periodic table, and how that capability reshapes the high-throughput simulation pipeline.

Exercise: Compute SOAP vectors on a small dataset; run a MACE-MP-0 single-point energy and force evaluation and contrast accuracy / runtime with DFT.


4.3 Unit III — Learning Structure–Property Relations (Weeks 9–10)

4.3.1 Week 9 – Regression and generalization in materials data (09.06.2026)

Slides: Open

  • Predicting bandgaps, elastic moduli, formation energies
  • Bias–variance trade-off; composition vs structure splits; chemical-space leakage
  • Out-of-distribution detection and OOD-aware metrics
  • Dataset size vs model complexity; matbench-style protocols

Summary: This week reframes materials-property prediction as a generalization problem rather than a leaderboard problem. Students compare baseline regressors, grouped chemistry-aware validation schemes, and error metrics, and they learn why split design matters more than a small gain in test accuracy.

Exercise: Compare linear, random-forest, and graph-neural-network regressors under composition-aware splits.


4.3.2 Week 10 – Neural networks for materials properties (16.06.2026)

Slides: Open

  • Neural networks as surrogates for DFT-level properties
  • Training pitfalls: data leakage, imbalance, extrapolation
  • Transfer learning and scaling laws in materials NNs
  • Interpretability challenges

Summary: Neural networks are introduced as flexible surrogate models for non-linear structure–property relationships under realistic data constraints. The lecture emphasises optimization stability, regularization, and extrapolation failure modes — the main challenge is trustworthy use rather than raw expressive power.

Exercise: Train a small neural network and analyse its generalization behaviour, with and without pre-training transfer.


4.4 Unit IV — Latent Spaces, Generative Models, Uncertainty (Weeks 11–13)

4.4.1 Week 11 – Representation learning & latent spaces (23.06.2026)

Slides: Open · also incorporates content from Latent Spaces deck

  • Learned vs engineered features; transferability across chemical systems
  • Autoencoders, VAEs, contrastive embeddings
  • Latent geometry, anomaly detection, interpolation; what networks learn about chemistry and structure

Summary: This merged session covers representation learning end-to-end: from autoencoders and contrastive losses to the latent-space use cases (interpolation, anomaly detection, transferability auditing) that motivate them. Students examine when smooth latent maps are scientifically meaningful and when they are just a UX artefact.

Exercise: Train an autoencoder; compare raw-descriptor vs learned-embedding downstream regression performance and inspect latent structure.


4.4.2 Week 12 – Generative models & inverse design (30.06.2026)

Slides: Open

  • Why generative models for crystals: from forward prediction to inverse design
  • Diffusion-based methods (MatterGen, DiffCSP, CDVAE), flow matching (FlowMM), autoregressive (CrystaLLM)
  • Conditioning on properties, symmetries, and synthesis constraints
  • Evaluation: validity, novelty, uniqueness, S.U.N. metric, downstream DFT screening

Summary: Generative crystal models are the headline development of 2023–2025 in materials genomics. The lecture introduces the dominant families (diffusion, flow matching, autoregressive), how property and constraint conditioning are imposed, and how candidate structures are filtered downstream by force-field relaxation and DFT screening.

Exercise: Sample candidate structures from a pretrained generative model conditioned on a target property; rank with a MACE-MP-0 relaxation and an uncertainty estimate.


4.4.3 Week 13 – Uncertainty-aware discovery & active learning (07.07.2026)

Slides: Open

  • Aleatoric vs epistemic uncertainty; calibration
  • Gaussian Processes as the small-data gold standard; deep ensembles and evidential learning at scale
  • Active learning loops for materials screening; exploration vs exploitation
  • Cluster-based attention and outlier triage (the old “clustering” thread, compressed)

Summary: Students learn how uncertainty quantification turns a static surrogate into an active discovery engine. GPs remain the small-data reference; deep ensembles and evidential heads dominate at production scale. The lecture closes the discovery loop started in Weeks 6–12: representation → model → uncertainty → next experiment.

Exercise: Run a small active-learning loop on a screening task. Compare GP, deep-ensemble, and random-baseline acquisition strategies.


4.5 Unit V — Constraints, Trust, and Synthesis (Week 14)

4.5.1 Week 14 – Physical constraints, limits, and outlook (14.07.2026)

Slides: Open

  • Stability, charge neutrality, and symmetry constraints.
  • Physics-informed learning in materials discovery.
  • What ML can and cannot discover.
  • Integration with experimental workflows.

Summary: The final lecture consolidates the course around scientific trust: physical constraints, explainability, reproducibility, and the limits of data-driven discovery. Students leave with a realistic view of how ML can accelerate materials research when it is embedded in experimental and simulation workflows rather than treated as a replacement for them.

Exercise:
Mini-project synthesis and presentation.


5 Learning Outcomes

Students completing this course will be able to:

  • Explain how simulation methods generate materials data and introduce bias.
  • Navigate and interrogate major materials databases.
  • Represent crystal structures using descriptors, graphs, and learned embeddings.
  • Train and evaluate ML models for predicting materials properties.
  • Understand latent spaces and their role in materials discovery.
  • Quantify and interpret uncertainty in materials predictions.
  • Apply ML responsibly to accelerate materials screening.
  • Critically assess the limits of data-driven materials discovery.