Materials Genomics Unit 6: Local Atomic Environments
Prof. Dr. Philipp Pelz
FAU Erlangen-Nürnberg
The central modeling move is to represent a material not only by one global fingerprint, but by the environments around its atomic sites. {.fragment}
This matters because many materials properties are controlled by local coordination, local chemistry, and local distortions. {.fragment}
02. Learning outcomes
By the end of this unit, students should be able to: - explain why local environments are an effective representation layer between chemistry and machine learning {.fragment} - define the invariance requirements of a useful atom-centered descriptor {.fragment} - compare simple geometric descriptors with ACSF- and SOAP-like representations {.fragment} - explain how local descriptors are pooled into material-level vectors {.fragment} - identify failure modes caused by cutoffs, parser mistakes, defects, and missing long-range physics {.fragment}
03. Recap: where this fits in the course
Unit 5 introduced graph-based representations, where atoms are nodes and bonds or neighbor relations are edges. {.fragment}
Unit 6 is more classical and more interpretable: we keep the focus on neighborhoods, but we summarize them into engineered descriptors. {.fragment}
Unit 7 will start from these descriptors and ask whether predictive performance actually generalizes. {.fragment}
04. Why local information matters
A single composition vector cannot tell whether a cation sits in tetrahedral or octahedral coordination. {.fragment}
A global crystal label may ignore defects, local strain, or minority motifs. {.fragment}
Yet many target properties depend precisely on these local features: diffusion barriers, catalytic site activity, defect energetics, local magnetism, and parts of elastic response. {.fragment}
05. Representation hierarchy
Global descriptors summarize the whole crystal in one object. {.fragment}
Local descriptors summarize the neighborhood around each site. {.fragment}
Learned graph representations infer useful local and semi-local features automatically. {.fragment}
Local atomic environments are attractive because they preserve physical interpretability while remaining compatible with standard machine-learning pipelines.
For a central atom \(i\), a local environment usually contains: - the atomic species \(Z_i\) {.fragment} - the neighboring species \(Z_j\) {.fragment} - relative distances \(r_{ij}\) {.fragment} - optional angular relations \(\theta_{jik}\) {.fragment}
The neighborhood can be defined by a radial cutoff or by a geometric rule such as Voronoi adjacency.
07. Neighbor construction under a radial cutoff
The simplest definition is: atom \(j\) belongs to the environment of atom \(i\) if \(r_{ij} < r_c\). {.fragment}
This immediately gives the coordination number {.fragment}
\[N_i(r_c) = \sum_j \mathbb{1}[r_{ij} < r_c]\]
The same formula already shows a major issue: the descriptor depends on the modeling choice \(r_c\). {.fragment}
08. Periodic boundary conditions are not optional
In a crystal, the local environment must be built with periodic images. {.fragment}
Otherwise, atoms near the unit-cell boundary lose neighbors that are physically present just across the boundary. {.fragment}
A wrong periodic treatment creates fake low-coordination sites and contaminates the dataset before any machine learning begins. {.fragment}
09. Local descriptors need symmetry discipline
A useful local descriptor should satisfy: - translation invariance {.fragment} - rotation invariance or equivariance {.fragment} - permutation invariance over identical neighbors {.fragment} - continuity under small atomic displacements {.fragment} - sensitivity to chemical identity {.fragment}
If a descriptor fails one of these tests, the model may learn file conventions or noise instead of structure.
10. Why raw Cartesian coordinates are a poor baseline
Raw coordinates depend on where the origin is placed. {.fragment}
They depend on how the cell axes are oriented. {.fragment}
They change if identical atoms are listed in a different order. {.fragment}
This is why classical materials ML rarely feeds raw atomic coordinates directly into a regression model without additional symmetry-aware processing.
11. Coordination number as a first descriptor
The coordination number is interpretable and chemically meaningful. {.fragment}
It distinguishes many familiar motifs immediately: isolated atoms, linear coordination, tetrahedral sites, octahedral sites, close-packed environments. {.fragment}
It is also very cheap to compute. {.fragment}
Its limitation is that different shapes can share the same count.
12. The loss of geometry in pure coordination counts
A tetrahedral site and a square-planar site both have coordination four. {.fragment}
A distorted octahedron and a near-perfect octahedron both have coordination six. {.fragment}
Therefore, counting neighbors is often necessary but rarely sufficient. {.fragment}
To recover shape information, we need distances, angles, or more expressive local densities.
13. Bond-length distributions
The set \(\{r_{ij}\}\) captures whether neighbors are uniformly arranged or split into short and long bonds. {.fragment}
Mean bond length and bond-length variance are often useful summary statistics. {.fragment}
Histograms of \(r_{ij}\) can separate compact, expanded, and distorted environments. {.fragment}
This is one step closer to local chemistry because it recognizes strain and distortion, not only connectivity.
14. Bond-angle distributions
Angular information tells us how neighbors are arranged around the central atom. {.fragment}
Tetrahedral, octahedral, trigonal-planar, and close-packed motifs become much easier to separate once angle statistics are included. {.fragment}
In practice, angle-based descriptors help distinguish environments that have the same coordination count but different geometry. {.fragment}
15. Example: tetrahedral versus octahedral coordination
Imagine two oxide materials with similar compositions. {.fragment}
If one cation occupies tetrahedral sites and the other octahedral sites, the local chemistry can be very different. {.fragment}
A descriptor based only on stoichiometry misses this distinction. {.fragment}
A descriptor with bond angles or SOAP similarity captures it naturally. {.fragment}
16. Voronoi neighborhoods
Instead of using a fixed cutoff, one can define neighbors geometrically via a Voronoi tessellation. {.fragment}
Two atoms are neighbors if their Voronoi cells share a face. {.fragment}
This adapts more naturally to local density variation than one fixed radius. {.fragment}
Voronoi approaches reduce arbitrariness, but they can become unstable for noisy or highly distorted structures.
17. Voronoi advantages and caveats
Advantages
Less dependence on a hand-chosen \(r_c\) {.fragment}
These choices matter when properties depend on heterogeneity rather than only the average site.
30. Worked example: a defect-sensitive material
Consider a crystal where most sites are regular, but a small population near vacancies is highly distorted. {.fragment}
The average descriptor changes only a little. {.fragment}
The tail of the environment distribution changes strongly. {.fragment}
This is a case where histogram-based or defect-aware pooling is more faithful than mean pooling.
31. Local descriptors for phase and motif discrimination
Local descriptors are often excellent for classifying structure motifs. {.fragment}
They can separate tetrahedral and octahedral environments, ordered and distorted variants, or defect-rich and defect-poor local neighborhoods. {.fragment}
Visualizing descriptor space with PCA or UMAP can reveal clusters, but the visualization is only a diagnostic, not a proof of physical truth. {.fragment}
32. Transferability across chemistry families
A good descriptor should map chemically similar local motifs to nearby regions of feature space. {.fragment}
But transferability is limited when similar local motifs appear inside very different global frameworks. {.fragment}
If the target depends on long-range connectivity, a purely local descriptor can create false analogies. {.fragment}
This is descriptor aliasing: local similarity does not always imply property similarity.
33. When local descriptors are not enough
Properties with strong long-range dependence challenge purely local models: - band dispersion and topology {.fragment} - extended magnetic order {.fragment} - elastic anisotropy {.fragment} - transport pathways and percolation {.fragment}
In such cases, local environments remain useful, but they must be combined with global or learned relational features.
34. Comparison to graph neural representations
Graph models also start from neighborhoods, but they learn the aggregation function rather than fixing it by hand. {.fragment}
Classical local descriptors remain valuable because they are interpretable, cheap, and effective in small-data settings. {.fragment}
In practice, they are often the baseline that more complex learned representations must beat. {.fragment}
35. Cutoff radius as a scientific hyperparameter
If \(r_c\) is too small, important neighbors are excluded. {.fragment}
If \(r_c\) is too large, the descriptor becomes expensive and may include weakly relevant structure. {.fragment}
The best cutoff depends on the chemistry and on the target property. {.fragment}
So the cutoff is part of the model, not merely preprocessing boilerplate.
36. Failure mode: parser and periodicity errors
Wrong unit-cell parsing or missing periodic images produce nonphysical local neighborhoods. {.fragment}
This creates systematic feature noise before any model is trained. {.fragment}
A high-capacity model may still fit such data and hide the preprocessing mistake. {.fragment}
The first debugging step should therefore be structural inspection, not hyperparameter tuning.
37. Failure mode: polymorph aliasing
Different polymorphs can share very similar local motifs. {.fragment}
A local descriptor may then suggest strong similarity even when bulk properties differ strongly. {.fragment}
This is particularly dangerous if the target depends on framework connectivity or long-range ordering. {.fragment}
The mitigation is to add global context or use a richer representation.
38. Quality checklist before downstream learning
Before using local descriptors in a predictive model, ask: - are neighbor lists physically correct? {.fragment} - is the cutoff scientifically justified? {.fragment} - does the descriptor vary smoothly under small perturbations? {.fragment} - does pooling preserve the motif statistics relevant to the property? {.fragment} - will the train-test split prevent near-duplicate structures from leaking across sets? {.fragment}
39. Summary
Local environments describe the neighborhood of each atom rather than the entire crystal at once. {.fragment}
Coordination, distances, angles, and Voronoi geometry are useful first descriptors. {.fragment}
ACSF and SOAP add principled symmetry-aware expressivity. {.fragment}
Pooling converts atom-level features into material-level inputs. {.fragment}
The main risks are scientific mismatch, bad preprocessing, and missing long-range context. {.fragment}
40. Bridge to Unit 7
Once we have a material-level vector, the representation problem becomes a regression problem. {.fragment}
But predictive success in materials science depends on more than a low mean error. {.fragment}
Unit 7 therefore studies baselines, split design, learning curves, leakage, and out-of-distribution generalization. {.fragment}