Materials Genomics
Unit 5: Graph-Based Crystal Representations

Prof. Dr. Philipp Pelz

FAU Erlangen-Nürnberg

Goal: Move beyond fixed descriptors to learn inductive biases from connectivity.
Workflow Position: Post-database query, pre-regression/discovery.

02. Learning objectives

By the end of this unit, students can:

explain why graphs are the natural language for crystal structures,
implement periodic boundary conditions (PBC) in graph construction,
describe the message passing mechanism in GNNs like CGCNN and MEGNet,
analyze common failure modes like shortcut learning and cutoff sensitivity.

03. Recap: Prerequisite map

Unit 2: Simulation as data generation.
Unit 3/4: Local environments and descriptors.
MFML: Neural network foundations (Unit 4) and Backprop (Unit 5).

Today’s Step

From hand-crafted local features to learned graph representations.

04. Why this matters: The failure of descriptors

Traditional descriptors (e.g., SOAP, PRDF) often require fixed-size vectors.
Large unit cells or varying stoichiometry cause dimension mismatch.
Graph inductive bias: learn features that are invariant to permutation and cell size by design.

05. Reading map

Core Theory: Sandfeld Ch. 2.2 (Structure encoding).
ML Architecture: Neuer Ch. 4.5.1–4.5.4 (GNNs for engineering).
Deep Learning: Murphy Ch. 35 (GNN section).

06. Crystals as periodic graphs

Nodes: Atoms (\(Z\), position, oxidation state).
Edges: Bonds or proximity relationships.
Attributes: Distance, bond type, cell parameters.
Global: Crystal system, space group, density.

graph TD
    A((Atom i)) -- rij --> B((Atom j))
    B -- rjk --> C((Atom k))
    A -- rik --> C
    subgraph Crystal Graph
    A; B; C;
    end
    style A fill:#f9f,stroke:#333

07. Periodic Boundary Conditions (PBC)

Crystals are infinite lattices, not finite molecules.
An atom at the cell edge is connected to an atom in the neighboring image.
Graph construction requires searching within a supercell or using a distance cutoff across images.
Failure to account for PBC leads to disconnected graphs and physically wrong features.

08. Graph construction workflow

graph LR
    A[CIF File] --> B[Atom Positions]
    B --> C[Distance Matrix]
    C --> D[Neighbor Search]
    D --> E[Edge Creation]
    E --> F[Feature Assignment]
    style D fill:#ff9,stroke:#333

09. Neighbor cutoff choices

Fixed Cutoff (\(r_c\))

Pros: Simple, captures bonding shell.
Cons: Sensitive to density changes.

Fixed Neighbors (\(k\)-NN)

Pros: Constant degree graph.
Cons: May skip relevant bonds in sparse regions.

10. Distance encoding: Radial Basis Functions (RBF)

Raw distance \(r_{ij}\) is often expanded into a vector using Gaussians:
\(e_{ij} = [ \exp(-\beta(r_{ij} - \mu_1)^2), \dots, \exp(-\beta(r_{ij} - \mu_K)^2) ]\)
This allows the network to distinguish small distance variations precisely.

11. Invariance and Equivariance

Permutation Invariance: Changing atom order doesn’t change property.
Translation Invariance: Shifting the cell doesn’t change property.
Rotation Invariance: Rotating the crystal doesn’t change property.
GNNs achieve these by using pooling (sum/mean) and relative distances.

12. Message Passing: The intuition

Send: Neighbors send messages based on their state and edge features.
Aggregate: Atom \(i\) collects messages (invariant sum/mean).
Update: Atom \(i\) updates its hidden state.
Repeat: Information spreads across the crystal.

graph BT
    N1[Neighbor 1] --> Agg[Σ]
    N2[Neighbor 2] --> Agg
    N3[Neighbor 3] --> Agg
    Agg --> Upd[Update Atom i]
    style Agg fill:#bfb

13. CGCNN: Crystal Graph Convolutional NN

Key Innovation: Gated convolution for crystal graphs.
Edge features (\(r_{ij}\)) and node features (\(v_i\)) are concatenated.
\(v_i^{(t+1)} = v_i^{(t)} + \sum_{j \in N(i)} \sigma(W_z z_{ij} + b_z) \odot g(W_s z_{ij} + b_s)\)
Successfully used for formation energy and bandgap prediction (Sandfeld et al. 2024).

14. MEGNet: Adding Global State

Extends CGCNN by adding a Global State Variable (\(u\)).
\(u\) captures properties like temperature, pressure, or overall density.
Messages flow: Atom \(\to\) Bond \(\to\) Atom \(\to\) Global.
Enables multi-property learning and state-aware predictions.

15. Continuous-filter convolution (SchNet)

Instead of discrete edges, use a continuous function of distance.
Kernel \(W(r_{ij})\) is itself a neural network.
Particularly effective for modeling potential energy surfaces and forces.

16. Readout functions (Pooling)

How do we go from atom-level features to a single crystal property?
Global Sum/Mean: Invariant to cell size but may lose local detail.
Set2Set: Uses attention to aggregate information.
Choice depends on property (Extensive: Energy \(\to\) Sum; Intensive: Bandgap \(\to\) Mean).

17. Graph Depth and Over-smoothing

Each message passing step increases the “receptive field”.
Too few layers: Cannot see long-range interactions (e.g. ionic bonds).
Too many layers: Features become identical for all atoms (over-smoothing).
Typical depth for materials: 3–6 layers.

18. Data efficiency: Descriptors vs Graphs

Descriptors (MLP)

Fast to train.
Needs more data to generalize.
“Rigid” features.

Graphs (GNN)

Slower training.
Learns “on the fly”.
High data efficiency for novel structures.

19. Handling variable-size unit cells

Standard NNs need fixed \(D\).
GNNs process \(N\) atoms, where \(N\) can vary per sample.
The weight matrices are shared across all atoms, regardless of cell size.
This is critical for scaling from primitive to supercells.

20. Reproducibility in graph construction

Small changes in \(r_c\) (cutoff) can change graph topology.
Always document: Cutoff radius, RBF parameters, and neighbor search algorithm.
Deterministic construction is a prerequisite for scientific trust.

21. Beyond distances: Edge engineering

Beyond \(r_{ij}\), edges can encode:
- Bond angles (triplets).
- Dihedral angles.
- Coordination numbers.
Adding higher-order geometric features improves force prediction.

22. Incorporating Composition Priors

Initialize node features with atomic properties (Electronegativity, Ionization energy).
This “primes” the graph with chemical knowledge before the first message is passed.

23. Training Stability and Mini-batches

Crystal graphs vary in size \(\to\) “jagged” batches.
Solution: Pack multiple graphs into one large disconnected graph.
Requires careful index tracking for pooling and gradients.

24. Computational cost scaling

Graph construction: \(O(N \log N)\) or \(O(N^2)\) depending on search.
Message Passing: \(O(E)\) where \(E\) is the number of edges.
Scaling is linear with cell size, making it much faster than DFT (\(O(N^3)\)).

25. Interpretability: Attention maps

Which atoms contribute most to the predicted bandgap?
Use attention weights to visualize “importance” on the crystal structure.
Helps identify active sites in catalysts or structural defects.

26. Failure mode: Shortcut learning

If unit cell volume correlates with energy, the GNN might only learn volume.
Check: Perform an ablation study with randomized atom types.
If accuracy remains high, the model is cheating!

27. Failure mode: Cutoff artifacts

Sharp cutoffs cause “jumps” in energy as atoms cross the boundary.
Fix: Use a smoothing envelope function \(f_{cut}(r_{ij})\).

28. Transferability: Across Chemistries

Can a GNN trained on Oxides predict Nitrides?
Success depends on the diversity of the training set and node feature initialization.
Representation learning helps capture universal chemical patterns.

29. OOD behavior: Unseen Prototypes

Predicting properties for a new crystal system (e.g. Perovskite \(\to\) Spinel).
GNNs are better than descriptors but still fragile.
Use uncertainty quantification (Unit 12) to flag these cases.

30. Baseline comparison protocol

Always compare GNN performance against:
1. Mean target baseline.
2. Simple composition-only MLP.
3. Classical descriptor baseline (SOAP/PRDF).

31. Evaluation metrics for screening

MSE is for accuracy.
Top-k Recall: Did we find the best materials?
Spearman Rank Correlation: Is the ordering correct?
For discovery, ranking matters more than absolute values.

32. Uncertainty in GNNs

Ensembles of GNNs provide variance estimates.
High variance \(\to\) Structural novelty.
Key for active learning loops in Materials Genomics.

33. Case: Bandgap prediction

Challenge: Bandgaps depend on global symmetries.
GNNs capture local chemistry but need sufficient depth for global electronic structure.

34. Case: Formation Energy

Standard benchmark for CGCNN/MEGNet.
GNNs achieve MAE < 0.03 eV/atom, approaching DFT precision for many systems.

35. Case: Elasticity under limited data

Elastic constants are sparse in databases.
GNN transfer learning (from Energy \(\to\) Moduli) significantly improves accuracy.

36. Feeding Representation Learning (Unit 9)

Unit 9 will show how to use GNN encoders as feature extractors.
Discard the property head \(\to\) Latent vector \(\mathbf{z}\) represents the material structure.

37. Connection to Unit 6 (Local Environments)

GNN message passing is an iterative refinement of the local environments discussed in Unit 6.
Each layer “looks” one bond further away.

39. Advanced: Equivariant GNNs (Tensor Field Networks)

Predict vectors (forces) and tensors (elasticity) that rotate correctly with the input structure.
Higher mathematical complexity, higher physical fidelity.

40. Scalability to complex microstructures

Can we graph an entire polycrystal?
Use hierarchical graphs: Atomic scale \(\to\) Grain scale \(\to\) Component scale.

41. Interpreting GNN “Chemical Intuition”

Do the learned embeddings \(\mathbf{z}\) match the periodic table?
Plotting t-SNE of learned atomic features usually recovers Mendeleev’s structure.

42. Active Learning with GNNs

Step 1: Predict on 100k candidates.
Step 2: Select top candidates with high uncertainty.
Step 3: Run DFT validation.
Step 4: Retrain GNN.

43. Deployment in Automated Labs

GNNs provide the “brain” for self-driving laboratories.
Real-time structure-property feedback loop.

44. The limit of GNNs

GNNs are not physics solvers; they are interpolators.
They cannot predict truly new physics (e.g. superconductivity) without training data.

45. Exercise setup: Graph Construction

Goal: Build a crystal graph from a CIF file using pymatgen or ase.
Dataset: A subset of the Materials Project (e.g. ABX3 perovskites).

46. Exercise Task 1: The Pipeline

Load CIF \(\to\) Build Graph \(\to\) Visualize Neighbors.
Verify PBC implementation.

47. Exercise Task 2: Ablation

Compare regression accuracy with \(r_c = 4.0\)Å vs \(r_c = 8.0\)Å.
Plot error vs coordination number.

48. Exercise Task 3: Failure Analysis

Identify one sample with very high error.
Is it a rare chemistry? A strange unit cell? A shortcut learning victim?

49. Unit summary: 10 exam statements

Crystals are modeled as graphs to learn flexible, permutation-invariant representations.
PBC enforcement is mandatory to capture the infinite nature of crystal lattices.
RBF expansion of distances allows NNs to resolve fine structural differences.
Message passing iteratively aggregates neighbor information to build global context.
CGCNN uses gated convolutions; MEGNet adds a global state variable.
Readout choice (sum vs mean) must align with property intensive/extensive nature.
Over-smoothing occurs when excessive depth makes node features indistinguishable.
GNNs are typically more data-efficient than fixed-descriptor MLPs for novel structures.
Shortcut learning occurs when the model relies on spurious cell-size correlations.
GNN ranking metrics (Spearman) are often more critical for discovery than MSE.

50. References + Unit 6 Bridge

Next Unit: Local Atomic Environments and many-body descriptors.
Read: Sandfeld Ch. 2.2, Neuer Ch. 4.5.
Exercise: Submit your GNN-construction notebook by next week.

Sandfeld, Stefan et al. 2024. Materials Data Science. Springer.

Example Notebook

Week 5: Descriptors + Regression — ChemicalElementsDataset

Open rendered notebook →

Materials GenomicsUnit 5: Graph-Based Crystal Representations

02. Learning objectives

03. Recap: Prerequisite map

Today’s Step

04. Why this matters: The failure of descriptors

05. Reading map

06. Crystals as periodic graphs

07. Periodic Boundary Conditions (PBC)

08. Graph construction workflow

09. Neighbor cutoff choices

Fixed Cutoff (\(r_c\))

Fixed Neighbors (\(k\)-NN)

10. Distance encoding: Radial Basis Functions (RBF)

11. Invariance and Equivariance

12. Message Passing: The intuition

13. CGCNN: Crystal Graph Convolutional NN

14. MEGNet: Adding Global State

15. Continuous-filter convolution (SchNet)

16. Readout functions (Pooling)

17. Graph Depth and Over-smoothing

18. Data efficiency: Descriptors vs Graphs

Descriptors (MLP)

Graphs (GNN)

19. Handling variable-size unit cells

20. Reproducibility in graph construction

21. Beyond distances: Edge engineering

22. Incorporating Composition Priors

23. Training Stability and Mini-batches

24. Computational cost scaling

25. Interpretability: Attention maps

26. Failure mode: Shortcut learning

27. Failure mode: Cutoff artifacts

28. Transferability: Across Chemistries

29. OOD behavior: Unseen Prototypes

30. Baseline comparison protocol

31. Evaluation metrics for screening

32. Uncertainty in GNNs

33. Case: Bandgap prediction

34. Case: Formation Energy

35. Case: Elasticity under limited data

36. Feeding Representation Learning (Unit 9)

37. Connection to Unit 6 (Local Environments)

38. Consolidation: Multi-modal crystal graphs

39. Advanced: Equivariant GNNs (Tensor Field Networks)

40. Scalability to complex microstructures

41. Interpreting GNN “Chemical Intuition”

42. Active Learning with GNNs

43. Deployment in Automated Labs

44. The limit of GNNs

45. Exercise setup: Graph Construction

46. Exercise Task 1: The Pipeline

47. Exercise Task 2: Ablation

48. Exercise Task 3: Failure Analysis

49. Unit summary: 10 exam statements

50. References + Unit 6 Bridge

Example Notebook

Materials Genomics
Unit 5: Graph-Based Crystal Representations