Materials Genomics
Unit 5: Graph-Based Crystal Representations

Prof. Dr. Philipp Pelz

FAU Erlangen-Nürnberg

FAU Logo IMN Logo CENEM Logo ERC Logo Eclipse Logo
  • Goal: Move beyond fixed descriptors to learn inductive biases from connectivity.
  • Workflow Position: Post-database query, pre-regression/discovery.

02. Learning objectives

By the end of this unit, students can:

  • explain why graphs are the natural language for crystal structures,
  • implement periodic boundary conditions (PBC) in graph construction,
  • describe the message passing mechanism in GNNs like CGCNN and MEGNet,
  • analyze common failure modes like shortcut learning and cutoff sensitivity.

03. Recap: Prerequisite map

  • Unit 2: Simulation as data generation.
  • Unit 3/4: Local environments and descriptors.
  • MFML: Neural network foundations (Unit 4) and Backprop (Unit 5).

Today’s Step

From hand-crafted local features to learned graph representations.

04. Why this matters: The failure of descriptors

  • Traditional descriptors (e.g., SOAP, PRDF) often require fixed-size vectors.
  • Large unit cells or varying stoichiometry cause dimension mismatch.
  • Graph inductive bias: learn features that are invariant to permutation and cell size by design.

05. Reading map

  • Core Theory: Sandfeld Ch. 2.2 (Structure encoding).
  • ML Architecture: Neuer Ch. 4.5.1–4.5.4 (GNNs for engineering).
  • Deep Learning: Murphy Ch. 35 (GNN section).

06. Crystals as periodic graphs

  • Nodes: Atoms (\(Z\), position, oxidation state).
  • Edges: Bonds or proximity relationships.
  • Attributes: Distance, bond type, cell parameters.
  • Global: Crystal system, space group, density.
graph TD
    A((Atom i)) -- rij --> B((Atom j))
    B -- rjk --> C((Atom k))
    A -- rik --> C
    subgraph Crystal Graph
    A; B; C;
    end
    style A fill:#f9f,stroke:#333

07. Periodic Boundary Conditions (PBC)

  • Crystals are infinite lattices, not finite molecules.
  • An atom at the cell edge is connected to an atom in the neighboring image.
  • Graph construction requires searching within a supercell or using a distance cutoff across images.
  • Failure to account for PBC leads to disconnected graphs and physically wrong features.

08. Graph construction workflow

graph LR
    A[CIF File] --> B[Atom Positions]
    B --> C[Distance Matrix]
    C --> D[Neighbor Search]
    D --> E[Edge Creation]
    E --> F[Feature Assignment]
    style D fill:#ff9,stroke:#333

09. Neighbor cutoff choices

Fixed Cutoff (\(r_c\))

  • Pros: Simple, captures bonding shell.
  • Cons: Sensitive to density changes.

Fixed Neighbors (\(k\)-NN)

  • Pros: Constant degree graph.
  • Cons: May skip relevant bonds in sparse regions.

10. Distance encoding: Radial Basis Functions (RBF)

  • Raw distance \(r_{ij}\) is often expanded into a vector using Gaussians:
  • \(e_{ij} = [ \exp(-\beta(r_{ij} - \mu_1)^2), \dots, \exp(-\beta(r_{ij} - \mu_K)^2) ]\)
  • This allows the network to distinguish small distance variations precisely.

11. Invariance and Equivariance

  • Permutation Invariance: Changing atom order doesn’t change property.
  • Translation Invariance: Shifting the cell doesn’t change property.
  • Rotation Invariance: Rotating the crystal doesn’t change property.
  • GNNs achieve these by using pooling (sum/mean) and relative distances.

12. Message Passing: The intuition

  1. Send: Neighbors send messages based on their state and edge features.
  2. Aggregate: Atom \(i\) collects messages (invariant sum/mean).
  3. Update: Atom \(i\) updates its hidden state.
  4. Repeat: Information spreads across the crystal.
graph BT
    N1[Neighbor 1] --> Agg[Σ]
    N2[Neighbor 2] --> Agg
    N3[Neighbor 3] --> Agg
    Agg --> Upd[Update Atom i]
    style Agg fill:#bfb

13. CGCNN: Crystal Graph Convolutional NN

  • Key Innovation: Gated convolution for crystal graphs.
  • Edge features (\(r_{ij}\)) and node features (\(v_i\)) are concatenated.
  • \(v_i^{(t+1)} = v_i^{(t)} + \sum_{j \in N(i)} \sigma(W_z z_{ij} + b_z) \odot g(W_s z_{ij} + b_s)\)
  • Successfully used for formation energy and bandgap prediction (Sandfeld et al. 2024).

14. MEGNet: Adding Global State

  • Extends CGCNN by adding a Global State Variable (\(u\)).
  • \(u\) captures properties like temperature, pressure, or overall density.
  • Messages flow: Atom \(\to\) Bond \(\to\) Atom \(\to\) Global.
  • Enables multi-property learning and state-aware predictions.

15. Continuous-filter convolution (SchNet)

  • Instead of discrete edges, use a continuous function of distance.
  • Kernel \(W(r_{ij})\) is itself a neural network.
  • Particularly effective for modeling potential energy surfaces and forces.

16. Readout functions (Pooling)

  • How do we go from atom-level features to a single crystal property?
  • Global Sum/Mean: Invariant to cell size but may lose local detail.
  • Set2Set: Uses attention to aggregate information.
  • Choice depends on property (Extensive: Energy \(\to\) Sum; Intensive: Bandgap \(\to\) Mean).

17. Graph Depth and Over-smoothing

  • Each message passing step increases the “receptive field”.
  • Too few layers: Cannot see long-range interactions (e.g. ionic bonds).
  • Too many layers: Features become identical for all atoms (over-smoothing).
  • Typical depth for materials: 3–6 layers.

18. Data efficiency: Descriptors vs Graphs

Descriptors (MLP)

  • Fast to train.
  • Needs more data to generalize.
  • “Rigid” features.

Graphs (GNN)

  • Slower training.
  • Learns “on the fly”.
  • High data efficiency for novel structures.

19. Handling variable-size unit cells

  • Standard NNs need fixed \(D\).
  • GNNs process \(N\) atoms, where \(N\) can vary per sample.
  • The weight matrices are shared across all atoms, regardless of cell size.
  • This is critical for scaling from primitive to supercells.

20. Reproducibility in graph construction

  • Small changes in \(r_c\) (cutoff) can change graph topology.
  • Always document: Cutoff radius, RBF parameters, and neighbor search algorithm.
  • Deterministic construction is a prerequisite for scientific trust.

21. Beyond distances: Edge engineering

  • Beyond \(r_{ij}\), edges can encode:
    • Bond angles (triplets).
    • Dihedral angles.
    • Coordination numbers.
  • Adding higher-order geometric features improves force prediction.

22. Incorporating Composition Priors

  • Initialize node features with atomic properties (Electronegativity, Ionization energy).
  • This “primes” the graph with chemical knowledge before the first message is passed.

23. Training Stability and Mini-batches

  • Crystal graphs vary in size \(\to\) “jagged” batches.
  • Solution: Pack multiple graphs into one large disconnected graph.
  • Requires careful index tracking for pooling and gradients.

24. Computational cost scaling

  • Graph construction: \(O(N \log N)\) or \(O(N^2)\) depending on search.
  • Message Passing: \(O(E)\) where \(E\) is the number of edges.
  • Scaling is linear with cell size, making it much faster than DFT (\(O(N^3)\)).

25. Interpretability: Attention maps

  • Which atoms contribute most to the predicted bandgap?
  • Use attention weights to visualize “importance” on the crystal structure.
  • Helps identify active sites in catalysts or structural defects.

26. Failure mode: Shortcut learning

  • If unit cell volume correlates with energy, the GNN might only learn volume.
  • Check: Perform an ablation study with randomized atom types.
  • If accuracy remains high, the model is cheating!

27. Failure mode: Cutoff artifacts

  • Sharp cutoffs cause “jumps” in energy as atoms cross the boundary.
  • Fix: Use a smoothing envelope function \(f_{cut}(r_{ij})\).

Effect of cutoff on force continuity

28. Transferability: Across Chemistries

  • Can a GNN trained on Oxides predict Nitrides?
  • Success depends on the diversity of the training set and node feature initialization.
  • Representation learning helps capture universal chemical patterns.

29. OOD behavior: Unseen Prototypes

  • Predicting properties for a new crystal system (e.g. Perovskite \(\to\) Spinel).
  • GNNs are better than descriptors but still fragile.
  • Use uncertainty quantification (Unit 12) to flag these cases.

30. Baseline comparison protocol

  • Always compare GNN performance against:
    1. Mean target baseline.
    2. Simple composition-only MLP.
    3. Classical descriptor baseline (SOAP/PRDF).

31. Evaluation metrics for screening

  • MSE is for accuracy.
  • Top-k Recall: Did we find the best materials?
  • Spearman Rank Correlation: Is the ordering correct?
  • For discovery, ranking matters more than absolute values.

32. Uncertainty in GNNs

  • Ensembles of GNNs provide variance estimates.
  • High variance \(\to\) Structural novelty.
  • Key for active learning loops in Materials Genomics.

33. Case: Bandgap prediction

  • Challenge: Bandgaps depend on global symmetries.
  • GNNs capture local chemistry but need sufficient depth for global electronic structure.

34. Case: Formation Energy

  • Standard benchmark for CGCNN/MEGNet.
  • GNNs achieve MAE < 0.03 eV/atom, approaching DFT precision for many systems.

35. Case: Elasticity under limited data

  • Elastic constants are sparse in databases.
  • GNN transfer learning (from Energy \(\to\) Moduli) significantly improves accuracy.

36. Feeding Representation Learning (Unit 9)

  • Unit 9 will show how to use GNN encoders as feature extractors.
  • Discard the property head \(\to\) Latent vector \(\mathbf{z}\) represents the material structure.

37. Connection to Unit 6 (Local Environments)

  • GNN message passing is an iterative refinement of the local environments discussed in Unit 6.
  • Each layer “looks” one bond further away.

38. Consolidation: Multi-modal crystal graphs

  • Future direction: Combining crystal graphs with text (literature) and images (micrographs).
  • Nodes can represent not just atoms, but entire phases or grains.

39. Advanced: Equivariant GNNs (Tensor Field Networks)

  • Predict vectors (forces) and tensors (elasticity) that rotate correctly with the input structure.
  • Higher mathematical complexity, higher physical fidelity.

40. Scalability to complex microstructures

  • Can we graph an entire polycrystal?
  • Use hierarchical graphs: Atomic scale \(\to\) Grain scale \(\to\) Component scale.

41. Interpreting GNN “Chemical Intuition”

  • Do the learned embeddings \(\mathbf{z}\) match the periodic table?
  • Plotting t-SNE of learned atomic features usually recovers Mendeleev’s structure.

42. Active Learning with GNNs

  • Step 1: Predict on 100k candidates.
  • Step 2: Select top candidates with high uncertainty.
  • Step 3: Run DFT validation.
  • Step 4: Retrain GNN.

43. Deployment in Automated Labs

  • GNNs provide the “brain” for self-driving laboratories.
  • Real-time structure-property feedback loop.

44. The limit of GNNs

  • GNNs are not physics solvers; they are interpolators.
  • They cannot predict truly new physics (e.g. superconductivity) without training data.

45. Exercise setup: Graph Construction

  • Goal: Build a crystal graph from a CIF file using pymatgen or ase.
  • Dataset: A subset of the Materials Project (e.g. ABX3 perovskites).

46. Exercise Task 1: The Pipeline

  • Load CIF \(\to\) Build Graph \(\to\) Visualize Neighbors.
  • Verify PBC implementation.

47. Exercise Task 2: Ablation

  • Compare regression accuracy with \(r_c = 4.0\)Å vs \(r_c = 8.0\)Å.
  • Plot error vs coordination number.

48. Exercise Task 3: Failure Analysis

  • Identify one sample with very high error.
  • Is it a rare chemistry? A strange unit cell? A shortcut learning victim?

49. Unit summary: 10 exam statements

  1. Crystals are modeled as graphs to learn flexible, permutation-invariant representations.
  2. PBC enforcement is mandatory to capture the infinite nature of crystal lattices.
  3. RBF expansion of distances allows NNs to resolve fine structural differences.
  4. Message passing iteratively aggregates neighbor information to build global context.
  5. CGCNN uses gated convolutions; MEGNet adds a global state variable.
  6. Readout choice (sum vs mean) must align with property intensive/extensive nature.
  7. Over-smoothing occurs when excessive depth makes node features indistinguishable.
  8. GNNs are typically more data-efficient than fixed-descriptor MLPs for novel structures.
  9. Shortcut learning occurs when the model relies on spurious cell-size correlations.
  10. GNN ranking metrics (Spearman) are often more critical for discovery than MSE.

50. References + Unit 6 Bridge

  • Next Unit: Local Atomic Environments and many-body descriptors.
  • Read: Sandfeld Ch. 2.2, Neuer Ch. 4.5.
  • Exercise: Submit your GNN-construction notebook by next week.
Sandfeld, Stefan et al. 2024. Materials Data Science. Springer.

Example Notebook

Week 5: Descriptors + Regression — ChemicalElementsDataset