FAU Erlangen-Nürnberg
How do we hand a crystal to a regression model?
Today’s claim.
Recap — what we already have
Forward pointer
Today — Unit 6 in one line
By the end of 90 minutes, you can:
Four tiers of structural descriptor

The construction
Worked example: \(\mathrm{Li}_x \mathrm{Ni}_y \mathrm{Co}_z \mathrm{Mn}_w \mathrm{O}_2\)
{wmean, wstd, range, max, min}.What matminer adds on top of Magpie
ElementProperty (Magpie itself, plus alternative tables: Deml, Pymatgen, Slater).OxidationStates, IonProperty, BandCenter, Stoichiometry.Structure object is supplied): coordination statistics, bond fractions, X-ray diffraction patterns, …feature_labels() and a featurize(...) API.Why matminer is the tier-1 working tool
MultipleFeaturizer([...]) stacks generators into one feature matrix.MagpieData mean Electronegativity survive into SHAP plots and feature-importance tables.The empirical picture
Why it works
The pedagogical lesson, said aloud: every materials-ML project should report a composition-only baseline before any structure-aware model. If structure cannot beat composition, the bottleneck is the data, the target, or the split — not the representation.
Polymorph blindness
Defects, doping, disorder
Decision rule. If the science depends on which polymorph, which defect, or how an alloy is ordered — composition alone is the wrong tool. Climb the ladder.
The total RDF \(g(r)\)
The partial RDF \(g_{AB}(r)\)
The descriptors
SiteStatsFingerprint, BondFractions, CrystalNNFingerprint.What this rung gains over the RDF
n_bins axis.What it still discards
The recurring failure mode of tiers 1 and 2
The motif argument
The right question to ask of any descriptor: if the property of interest depends on a motif present at 1% of sites, does the descriptor preserve the 1% motif, or average it into the 99%? Tiers 1–2 average; tier 3 preserves.
Descriptor decision tree
The objects we keep
For atom \(i\) at position \(\mathbf{r}_i\) with species \(Z_i\):
We deliberately do not keep the absolute position of \(i\) in the cell.
Two ways to define \(\mathcal{N}(i)\)
The formula and the count
Smooth cutoffs
The geometry
The trap and the fix
NeighborList, pymatgen Structure.get_neighbors, MDAnalysis distance_array(box=...). Use them.The first audit on any local-descriptor pipeline: plot the coordination number for ten boundary atoms and ten interior atoms, check that they look chemically equivalent. If they don’t, the periodic images are wrong, not the descriptor.
A useful local descriptor \(\phi(\text{environment of } i)\) must satisfy:
If a descriptor fails any of these, the model learns file conventions or noise instead of physical structure. Invariance is not pedantry — it is the difference between a descriptor that generalises and a descriptor that overfits to coordinate-system accidents.
Three failures, in one slide
A regression model fed raw Cartesians has to learn all three invariances from the data — which it almost never does cleanly, especially with small materials datasets.
The “data augmentation will save us” objection
The mathematical object
For atom \(i\), the local environment is the set \[ \mathcal{E}_i = \{(Z_j,\, \mathbf{r}_j - \mathbf{r}_i) : j \in \mathcal{N}(i)\}, \] plus the central species \(Z_i\) and (optionally) the cell metric.
The descriptor \(\phi\) is any function \(\mathcal{E}_i \mapsto \mathbb{R}^d\) that respects the five invariances of slide 16.
The taxonomy of \(\phi\)
\(r_c\) is part of the model
Heuristics, not formulas
Per-atom coordination
\[ N_i(r_c) = \sum_{j \neq i} \mathbb{1}[r_{ij} < r_c]. \]
Species-resolved coordination
\[ N_{i, A}(r_c) = \sum_{j \neq i,\, Z_j = A} \mathbb{1}[r_{ij} < r_c]. \]
Same count, different shape
Coordination counts neighbours; it does not arrange them.
Why this matters in materials
To recover shape, we need distances and angles.
A worked microscopic example
The Jahn–Teller example
Per-atom bond-length features
For atom \(i\) with neighbours \(\mathcal{N}(i)\):
For per-pair: same statistics restricted to \(Z_j = A\).
What these features see
Per-atom angular features
For atom \(i\) and pairs of neighbours \(j, k \in \mathcal{N}(i)\):
What these features see
Setup
Tier-3 separation
CrystalNNFingerprint (slide 10) and BondFractions (slide 6) on it.Setup
Mean vs histogram pooling, contrasted
Lesson: the pooling rule decides whether minority motifs survive. Mean pooling washes them out; histogram pooling preserves them. Match the pooling to the mechanism — see §E.
The construction
Properties
VoronoiNN, matminer’s CrystalNN (refines Voronoi by chemistry).Advantages
Caveats
Pragmatic default: combine Voronoi with a chemistry-aware refinement (
CrystalNN) and a face-area threshold. Pure radial cutoffs and pure Voronoi are both edge cases of the more useful hybrid.
Radial term
\[ G_i^{\text{rad}} = \sum_j \exp[-\eta\,(r_{ij} - R_s)^2]\, f_c(r_{ij}). \]
Angular term
\[ G_i^{\text{ang}} = 2^{1-\zeta} \sum_{j,k} (1 + \lambda \cos\theta_{jik})^\zeta e^{-\eta'(r_{ij}^2 + r_{ik}^2 + r_{jk}^2)}\, f_c(r_{ij})f_c(r_{ik})f_c(r_{jk}). \]
Each function asks one question
The descriptor as a question stack
Strengths
dscribe, RuNNer, n2p2.Weaknesses
Step 1: smooth the neighbourhood
Replace the discrete neighbour list with a Gaussian density:
\[ \rho_i(\mathbf{r}) = \sum_{j \in \mathcal{N}(i)} \exp\!\left(-\frac{|\mathbf{r} - \mathbf{r}_{ij}|^2}{2\sigma^2}\right) f_c(|\mathbf{r}_{ij}|). \]
Step 2: expand on a basis
\[ \rho_i^{(Z)}(\mathbf{r}) = \sum_{n,l,m} c^{(Z)}_{nlm}\, g_n(r)\, Y_l^m(\hat{\mathbf{r}}). \]
The problem with raw \(c_{nlm}\)
The power spectrum
\[ p^{(Z_1, Z_2)}_{nn'l} = \pi \sqrt{\frac{8}{2l+1}} \sum_m c^{(Z_1)*}_{nlm}\, c^{(Z_2)}_{n'lm}. \]
The normalised SOAP kernel
\[ k(\mathcal{E}_i, \mathcal{E}_j) = \left[ \frac{\mathbf{p}_i \cdot \mathbf{p}_j}{\sqrt{(\mathbf{p}_i \cdot \mathbf{p}_i)(\mathbf{p}_j \cdot \mathbf{p}_j)}} \right]^\zeta. \]
What “similarity” means here
SOAP pipeline
The pipeline
Where the cost lives
The honest comparison: ACSF and SOAP capture similar information at similar cost when matched in expressivity. The choice is governed by interpretability vs systematicity, not by raw accuracy.
Reach for ACSF when
Reach for SOAP when
The recommendation for tier-3 baselines: start with
CrystalNN-aggregated coordination + bond-length / bond-angle stats (slides 22–23), and reach for ACSF or SOAP only when the simpler tier-3 features fail. Most matminer-shaped pipelines never need to go further.
The setup
The pooling rule encodes a scientific assumption
The construction
\[ \Phi^{\text{mean}} = \frac{1}{N} \sum_{i=1}^{N} \phi_i. \]
Where it works, where it fails
The construction
\[ \Phi^{\text{sum}} = \sum_{i=1}^{N} \phi_i. \]
Where it works
The construction
For each component \(\phi_i^{(d)}\), build a histogram over a fixed binning:
\[ \Phi^{\text{hist}, (d)}_b = \frac{1}{N} \sum_{i=1}^{N} \mathbb{1}[\phi_i^{(d)} \in B_b]. \]
Where it earns its keep
The construction
\[ \Phi^{\text{spec}, (Z)} = \text{pool}\left(\{\phi_i : Z_i = Z\}\right). \]
Why this is often the right pooling
SiteStatsFingerprint(stats=[...], group_by=composition)).Hand-built pooling (this lecture)
Learned pooling (GNNs, Unit 7)
The pragmatic decision: for small / medium datasets (~10\(^3\)–10\(^4\) materials), hand-built pooling on tier-3 descriptors is competitive with learned tier-4 pooling. For larger datasets and harder targets, the learned aggregator pulls ahead. Both are “correct”; the choice is data-budget driven.
The trap
The audit
The traps
The audit
The diagnostic
What the swings mean
The trap
Examples
Mitigation: if the property depends on long-range structure, augment tier-3 with global features (cell parameters, dimensionality, framework descriptors) or move to tier 4 (GNNs).
The trap
Diagnosis and mitigation
Where local descriptors stop being enough
What to do
Before any tier-3 descriptor enters a regression or classification model:
The discipline: these eight checks matter more than the choice between ACSF and SOAP. A tier-3 pipeline that fails three of them is worse than a tier-1 pipeline that passes all of them.
The five rungs, in one sentence each
When to climb a rung
Unit 8 — Regression and generalisation
The through-line for the rest of the semester
Exercise (90 min, this afternoon)
SiteStatsFingerprint features. Retrain. Report the gap.dscribe. Mean-pool, then histogram-pool. Train kernel ridge regression. Report MAE.Reading for next week
Next week (Unit 8): baselines, split design, learning curves, leakage, OOD — the trust audit on the regression you got today.

© Philipp Pelz - Materials Genomics