Materials Genomics

Computational Materials Discovery

Authors

Philipp Pelz

Stefan Hiemer

Published

March 23, 2026

Abstract

This course introduces students to materials genomics, treating the periodic table and the space of known crystal structures as a searchable, computable design space. Students learn how materials databases are built, how simulation methods generate materials data, how atomic structure is represented numerically, how structure–property relationships are learned using machine learning, and how uncertainty-aware models enable accelerated materials discovery.

Keywords

Materials Science, Machine Learning, Computational Materials Discovery, Materials Databases, Crystal Structure

ECLIPSE Lab Teaching

Materials Genomics

Computational materials discovery through databases, simulation, structure representations, and machine learning.

Semester
Summer Semester 2026
Format
2h lecture
Credits
5 ECTS
Audience
Students interested in materials discovery, simulation, and AI-driven design
Prerequisites
Helpful: Mathematical Foundations of AI & ML and basic materials science
How to use this course site. Use this page as the central hub for syllabus, lecture structure, reading, notebooks, and course materials. Formal announcements and enrollment remain on StudOn; code and openly shared resources live in the linked GitHub repository.

1 Instructors

  • Philipp Pelz
  • Stefan Hiemer

2 Course Information

5th Semester – 5 ECTS · 2h lecture
Coordinated with “Mathematical Foundations of AI & ML” (MFML) and
“ML for Materials Processing & Characterization” (ML-PC)


3 Course Philosophy

Materials genomics views the periodic table and all known crystal structures as a high-dimensional design space.

In this course, students learn to:

  • understand how materials data is generated by simulations and experiments,
  • treat materials data as a structured, learnable representation space,
  • move beyond classical descriptors toward learned representations,
  • use ML models as surrogates for quantum-mechanical and continuum simulations,
  • reason about uncertainty, stability, and discovery,
  • understand how computational screening integrates with experiments.

The course explicitly builds on MFML:

  • PCA and regression are assumed background,
  • neural networks, representation learning, and uncertainty are used, not re-derived.

4 Week-by-Week Curriculum (14 weeks)

NoteTeaching responsibility

Weeks 2, 3, 4, and 6 of Materials Genomics will be taught by Stefan Hiemer.

4.1 Unit I — Where Materials Data Comes From (Weeks 1–4)

4.1.1 Week 1 – What is Materials Genomics? (14.04.2026)

  • Genomics analogy: genes → functions vs atoms → properties.
  • Structure–property–processing paradigm from a structure-first viewpoint.
  • Materials databases as design spaces: Materials Project, OQMD, AFLOW, NOMAD.

Summary: This opening lecture frames materials genomics as a search problem over composition and crystal-structure space rather than a static catalog of known materials. Students learn how databases, simulation outputs, and screening logic combine into a discovery workflow, and why uncertainty, provenance, and scientific validation are necessary from the start.

Exercise:
Explore Materials Project; query bandgaps, formation energies, symmetries.


4.1.2 Week 2 – Simulation methods as data generators (21.04.2026)

  • Why simulations dominate materials data generation.
  • Simulation methods as mappings from assumptions to data.
  • Overview of scales and outputs:
    • FEM: continuum fields (stress, strain).
    • MD: trajectories, forces, diffusion.
  • MC: thermodynamic sampling.
  • DFT: energies, electronic structure.
  • Accuracy–cost–scale trade-offs and systematic biases.

Summary: This week introduces simulation methods as controlled data generators whose assumptions determine both the quality and the bias of downstream ML datasets. The core message is that FEM, MD, MC, and DFT do not just produce numbers; they define the representation, scale, and reliability of the materials-learning problem.

Exercise:
For selected materials properties, identify suitable simulation methods and expected biases.


4.1.3 Week 3 – Atomistic and electronic simulations (DFT, MD, MC) (28.04.2026)

  • Density Functional Theory: ground-state bias, exchange–correlation functionals, consistency vs accuracy.
  • Molecular Dynamics: force fields, time averaging, limitations of timescales.
  • Monte Carlo: phase-space sampling and thermodynamic averages.
  • What quantities in materials databases come directly from simulations.

Summary: Students examine how DFT, MD, and MC contribute complementary atomistic and electronic observables to materials databases. The lecture emphasizes that energies, forces, band-related quantities, and thermodynamic averages are meaningful only together with the approximations and metadata that produced them.

Exercise:
Inspect Materials Project entries; identify simulation assumptions and derived quantities.


4.1.4 Week 4 – Continuum simulations, thermodynamics, and stability (05.05.2026)

  • FEM as a structure–property mapping at the continuum scale.
  • Constitutive models as implicit surrogates.
  • Formation energies, convex hulls, metastability.
  • Why “stable” does not imply “synthesizable”.

Summary: This lecture connects thermodynamic stability and continuum modeling to the ML representations used later in the course. Students learn why interpretable quantities such as formation energy, energy above hull, and constitutive response are useful targets, but also why stability, metastability, and synthesizability must not be conflated.

Exercise:
Analyze stability and simulated properties for a small materials system.


4.2 Unit II — Representations of Materials (Weeks 5–7)

(Aligned with early neural networks in MFML)

4.2.1 Week 5 – From classical descriptors to learned representations (12.05.2026)

  • Classical descriptors: Magpie, matminer.
  • Limits of hand-crafted features.
  • Motivation for representation learning.

Summary: This week introduces descriptor engineering as the first practical strategy for translating chemistry and structure into ML-ready feature vectors. Students see both the strengths of hand-crafted descriptors, especially interpretability, and the limits that motivate later learned representations.

Exercise:
Build a simple property predictor using classical descriptors.


4.2.2 Week 6 – Graph-based crystal representations (19.05.2026)

  • Crystals as graphs: nodes, edges, periodic boundary conditions.
  • Intuition behind CGCNN and MEGNet.
  • Relation to neural network concepts from MFML.

Summary: Students learn why graph representations are a natural match for periodic crystal data and how message passing turns local neighborhoods into property-relevant embeddings. The focus is on representational choices such as cutoffs, node and edge features, and invariances, not on architectural hype.

Exercise:
Construct and visualize graph representations of crystal structures.


4.2.3 Week 7 – Local atomic environments (26.05.2026)

  • Local vs global representations.
  • Coordination environments, Voronoi tessellations.
  • SOAP descriptors as a bridge to learned representations.

Summary: This lecture focuses on local atomic environments as compact and physically meaningful descriptors of crystal structure. Coordination motifs, Voronoi constructions, and SOAP-like fingerprints are used to show how local geometry becomes a bridge between classical descriptors and richer learned representations.

Exercise:
Compute SOAP vectors and explore similarity in descriptor space.


4.3 Unit III — Learning Structure–Property Relations (Weeks 8–10)

4.3.1 Week 8 – Regression and generalization in materials data (02.06.2026)

  • Predicting bandgaps, elastic moduli, formation energies.
  • Bias–variance trade-off and overfitting.
  • Dataset size vs model complexity.

Summary: This week reframes materials-property prediction as a generalization problem rather than a leaderboard problem. Students compare baseline regressors, grouped chemistry-aware validation schemes, and error metrics, and they learn why split design matters more than a small gain in test accuracy.

Exercise:
Compare linear, random forest, and neural network regressors.


4.3.2 Week 9 – Neural networks for materials properties (09.06.2026)

  • Neural networks as surrogates for DFT-level properties.
  • Training pitfalls: data leakage, imbalance, extrapolation.
  • Interpretability challenges.

Summary: Here neural networks are introduced as flexible surrogate models for nonlinear structure-property relationships under realistic data constraints. The lecture emphasizes optimization stability, regularization, and extrapolation failure modes, making clear that the main challenge is trustworthy use rather than raw expressive power.

Exercise:
Train a small neural network and analyze generalization behavior.


4.3.3 Week 10 – Representation learning and feature discovery (16.06.2026)

  • Learned vs engineered features.
  • Transferability across chemical systems.
  • What networks learn about chemistry and structure.

Summary: Students examine how hidden representations can move beyond fixed descriptors and discover structure in materials datasets that is useful across tasks. The key question is not whether embeddings look clean, but whether they transfer, remain scientifically interpretable, and improve downstream prediction.

Exercise:
Compare model performance using raw descriptors vs learned embeddings.


4.4 Unit IV — Latent Spaces, Uncertainty, and Discovery (Weeks 11–13)

4.4.1 Week 11 – Latent spaces of materials (23.06.2026)

  • Autoencoders and embeddings for crystal data.
  • Interpreting latent dimensions.
  • Structure families and chemical intuition.

Summary: This lecture presents latent spaces as compressed coordinate systems for materials data, typically learned through autoencoders or related embedding models. Students learn how reconstruction, interpolation, and anomaly detection interact, and why visually smooth latent maps are not automatically scientifically meaningful.

Exercise:
Train an autoencoder; visualize latent materials space.


4.4.2 Week 12 – Clustering, uncertainty, and discovery logic

  • Why clustering is not discovery.
  • Outliers, anomalies, and candidate identification.
  • Aleatoric vs epistemic uncertainty.

Summary: This week distinguishes exploratory structure in data from actual discovery logic. Clusters, outliers, and anomalies can guide attention, but students learn that credible candidate identification requires uncertainty estimates, validation logic, and a clear distinction between novelty, artifact, and noise.

Exercise:
Contrast clustering results with latent-space exploration.


4.4.3 Week 13 – Uncertainty-aware discovery and Gaussian Processes (07.07.2026)

  • Gaussian Process regression as a gold standard for uncertainty.
  • Exploration vs exploitation.
  • Relevance to materials acceleration platforms.

Summary: Students are introduced to Gaussian Processes as uncertainty-aware surrogate models that are especially valuable in small-data screening settings. The lecture connects posterior uncertainty to exploration-versus-exploitation decisions and compares GP-based screening logic with neural-network ensemble alternatives.

Exercise:
Compare GP regression and neural network ensembles for screening tasks.


4.5 Unit V — Constraints, Trust, and Synthesis (Week 14)

4.5.1 Week 14 – Physical constraints, limits, and outlook (14.07.2026)

  • Stability, charge neutrality, and symmetry constraints.
  • Physics-informed learning in materials discovery.
  • What ML can and cannot discover.
  • Integration with experimental workflows.

Summary: The final lecture consolidates the course around scientific trust: physical constraints, explainability, reproducibility, and the limits of data-driven discovery. Students leave with a realistic view of how ML can accelerate materials research when it is embedded in experimental and simulation workflows rather than treated as a replacement for them.

Exercise:
Mini-project synthesis and presentation.


5 Learning Outcomes

Students completing this course will be able to:

  • Explain how simulation methods generate materials data and introduce bias.
  • Navigate and interrogate major materials databases.
  • Represent crystal structures using descriptors, graphs, and learned embeddings.
  • Train and evaluate ML models for predicting materials properties.
  • Understand latent spaces and their role in materials discovery.
  • Quantify and interpret uncertainty in materials predictions.
  • Apply ML responsibly to accelerate materials screening.
  • Critically assess the limits of data-driven materials discovery.