Materials Genomics
Unit 6: Local Atomic Environments

Prof. Dr. Philipp Pelz

FAU Erlangen-Nürnberg

FAU Logo IMN Logo CENEM Logo ERC Logo Eclipse Logo
  • The central modeling move is to represent a material not only by one global fingerprint, but by the environments around its atomic sites. {.fragment}
  • This matters because many materials properties are controlled by local coordination, local chemistry, and local distortions. {.fragment}

02. Learning outcomes

By the end of this unit, students should be able to: - explain why local environments are an effective representation layer between chemistry and machine learning {.fragment} - define the invariance requirements of a useful atom-centered descriptor {.fragment} - compare simple geometric descriptors with ACSF- and SOAP-like representations {.fragment} - explain how local descriptors are pooled into material-level vectors {.fragment} - identify failure modes caused by cutoffs, parser mistakes, defects, and missing long-range physics {.fragment}

03. Recap: where this fits in the course

  • Unit 5 introduced graph-based representations, where atoms are nodes and bonds or neighbor relations are edges. {.fragment}
  • Unit 6 is more classical and more interpretable: we keep the focus on neighborhoods, but we summarize them into engineered descriptors. {.fragment}
  • Unit 7 will start from these descriptors and ask whether predictive performance actually generalizes. {.fragment}

04. Why local information matters

  • A single composition vector cannot tell whether a cation sits in tetrahedral or octahedral coordination. {.fragment}
  • A global crystal label may ignore defects, local strain, or minority motifs. {.fragment}
  • Yet many target properties depend precisely on these local features: diffusion barriers, catalytic site activity, defect energetics, local magnetism, and parts of elastic response. {.fragment}

05. Representation hierarchy

  • Global descriptors summarize the whole crystal in one object. {.fragment}
  • Local descriptors summarize the neighborhood around each site. {.fragment}
  • Learned graph representations infer useful local and semi-local features automatically. {.fragment}

Local atomic environments are attractive because they preserve physical interpretability while remaining compatible with standard machine-learning pipelines.

graph TD
    Root[Representation Hierarchy]
    Root --> Global[Global Descriptors]
    Root --> Local[Local Descriptors]
    Root --> Graph[Learned Graph Repr.]
    
    Global --> GlobalEx[Composition, Fingerprints]
    Local --> LocalEx[ACSF, SOAP, Voronoi]
    Graph --> GraphEx[GNNs, SchNet, MEGNet]

06. What is a local atomic environment?

For a central atom \(i\), a local environment usually contains: - the atomic species \(Z_i\) {.fragment} - the neighboring species \(Z_j\) {.fragment} - relative distances \(r_{ij}\) {.fragment} - optional angular relations \(\theta_{jik}\) {.fragment}

The neighborhood can be defined by a radial cutoff or by a geometric rule such as Voronoi adjacency.

07. Neighbor construction under a radial cutoff

  • The simplest definition is: atom \(j\) belongs to the environment of atom \(i\) if \(r_{ij} < r_c\). {.fragment}
  • This immediately gives the coordination number {.fragment}

\[N_i(r_c) = \sum_j \mathbb{1}[r_{ij} < r_c]\]

  • The same formula already shows a major issue: the descriptor depends on the modeling choice \(r_c\). {.fragment}

08. Periodic boundary conditions are not optional

  • In a crystal, the local environment must be built with periodic images. {.fragment}
  • Otherwise, atoms near the unit-cell boundary lose neighbors that are physically present just across the boundary. {.fragment}
  • A wrong periodic treatment creates fake low-coordination sites and contaminates the dataset before any machine learning begins. {.fragment}

09. Local descriptors need symmetry discipline

A useful local descriptor should satisfy: - translation invariance {.fragment} - rotation invariance or equivariance {.fragment} - permutation invariance over identical neighbors {.fragment} - continuity under small atomic displacements {.fragment} - sensitivity to chemical identity {.fragment}

If a descriptor fails one of these tests, the model may learn file conventions or noise instead of structure.

10. Why raw Cartesian coordinates are a poor baseline

  • Raw coordinates depend on where the origin is placed. {.fragment}
  • They depend on how the cell axes are oriented. {.fragment}
  • They change if identical atoms are listed in a different order. {.fragment}

This is why classical materials ML rarely feeds raw atomic coordinates directly into a regression model without additional symmetry-aware processing.

11. Coordination number as a first descriptor

  • The coordination number is interpretable and chemically meaningful. {.fragment}
  • It distinguishes many familiar motifs immediately: isolated atoms, linear coordination, tetrahedral sites, octahedral sites, close-packed environments. {.fragment}
  • It is also very cheap to compute. {.fragment}

Its limitation is that different shapes can share the same count.

12. The loss of geometry in pure coordination counts

  • A tetrahedral site and a square-planar site both have coordination four. {.fragment}
  • A distorted octahedron and a near-perfect octahedron both have coordination six. {.fragment}
  • Therefore, counting neighbors is often necessary but rarely sufficient. {.fragment}

To recover shape information, we need distances, angles, or more expressive local densities.

13. Bond-length distributions

  • The set \(\{r_{ij}\}\) captures whether neighbors are uniformly arranged or split into short and long bonds. {.fragment}
  • Mean bond length and bond-length variance are often useful summary statistics. {.fragment}
  • Histograms of \(r_{ij}\) can separate compact, expanded, and distorted environments. {.fragment}

This is one step closer to local chemistry because it recognizes strain and distortion, not only connectivity.

14. Bond-angle distributions

  • Angular information tells us how neighbors are arranged around the central atom. {.fragment}
  • Tetrahedral, octahedral, trigonal-planar, and close-packed motifs become much easier to separate once angle statistics are included. {.fragment}
  • In practice, angle-based descriptors help distinguish environments that have the same coordination count but different geometry. {.fragment}

15. Example: tetrahedral versus octahedral coordination

  • Imagine two oxide materials with similar compositions. {.fragment}
  • If one cation occupies tetrahedral sites and the other octahedral sites, the local chemistry can be very different. {.fragment}
  • A descriptor based only on stoichiometry misses this distinction. {.fragment}
  • A descriptor with bond angles or SOAP similarity captures it naturally. {.fragment}

16. Voronoi neighborhoods

  • Instead of using a fixed cutoff, one can define neighbors geometrically via a Voronoi tessellation. {.fragment}
  • Two atoms are neighbors if their Voronoi cells share a face. {.fragment}
  • This adapts more naturally to local density variation than one fixed radius. {.fragment}

Voronoi approaches reduce arbitrariness, but they can become unstable for noisy or highly distorted structures.

17. Voronoi advantages and caveats

Advantages

  • Less dependence on a hand-chosen \(r_c\) {.fragment}
  • Better reflects relative packing geometry {.fragment}
  • Adaptive to local density variations {.fragment}

Caveats

  • Tiny faces can create questionable neighbors {.fragment}
  • Stability issues with noisy structures {.fragment}
  • Needs area or distance thresholds {.fragment}

A practical pipeline often mixes geometric and radial criteria rather than treating them as mutually exclusive.

18. Descriptor continuity matters for learning

  • If a small structural distortion causes a large jump in the descriptor, the regression problem becomes unstable. {.fragment}
  • Relaxed structures, thermal noise, and small DFT geometry differences then look like unrelated inputs. {.fragment}
  • Good local descriptors should vary smoothly when the underlying structure varies smoothly. {.fragment}

This is one reason smooth basis-function descriptors are attractive.

19. Atom-centered symmetry functions (ACSF)

ACSF descriptors construct local features from radial and angular functions centered on atom \(i\). A typical radial term is

\[G_i^{\text{rad}} = \sum_j \exp[-\eta (r_{ij} - R_s)^2] f_c(r_{ij})\]

  • \(\eta\) controls sensitivity to distance {.fragment}
  • \(R_s\) shifts the radial focus {.fragment}
  • \(f_c\) smoothly suppresses contributions near the cutoff {.fragment}

These functions are designed to be invariant and differentiable.

20. What ACSF is doing conceptually

  • Each symmetry function asks a specific question about the neighborhood. {.fragment}
  • How many neighbors are there near a certain radius? {.fragment}
  • Are there angular triplets with a certain shape? {.fragment}
  • How strongly are these patterns weighted by species and distance? {.fragment}

The result is a hand-designed feature vector whose components have a clear modeling role.

21. Strengths and weaknesses of ACSF

  • Strength: interpretable design logic {.fragment}
  • Strength: smooth and symmetry-aware {.fragment}
  • Strength: well suited for atom-centered models and machine-learned interatomic potentials {.fragment}
  • Weakness: many hyperparameters {.fragment}
  • Weakness: feature design can become manual and problem-specific {.fragment}

ACSF works well when we want controllable features and moderate descriptor complexity.

22. SOAP: smooth overlap of atomic positions

SOAP starts by replacing the discrete neighbor list with a smooth density around each atom:

\[\rho_i(\mathbf{r}) = \sum_j \exp\left(-\frac{|\mathbf{r} - \mathbf{r}_{ij}|^2}{2 \sigma^2}\right)\]

  • each neighbor contributes a Gaussian {.fragment}
  • \(\sigma\) controls how sharply positions are resolved {.fragment}
  • the local environment becomes a continuous field instead of a list of points {.fragment}

23. From local density to invariant features

  • The smooth density is expanded in radial basis functions and spherical harmonics. {.fragment}
  • Raw expansion coefficients still depend on orientation. {.fragment}
  • SOAP therefore forms rotationally invariant combinations, often called the power spectrum. {.fragment}

This converts a geometric neighborhood into a descriptor that is both expressive and symmetry-aware.

24. SOAP as a kernel similarity

  • SOAP is often used through a normalized kernel. {.fragment}
  • Two environments are similar when their smooth densities overlap strongly after accounting for rotation. {.fragment}
  • This gives a principled similarity score rather than a manually defined distance on raw features. {.fragment}

Conceptually, SOAP turns local geometry into a notion of environment similarity.

25. Why SOAP is often powerful

  • Small geometric distortions change the descriptor smoothly. {.fragment}
  • Fine differences in local geometry can still be resolved. {.fragment}
  • Similarity between environments becomes a first-class mathematical object. {.fragment}

That is especially useful when the scientific distinction is not only coordination count, but subtle distortion of a coordination polyhedron.

26. ACSF versus SOAP

ACSF

  • Explicit and hand-designed {.fragment}
  • Easier to interpret feature by feature {.fragment}
  • Many hyperparameters {.fragment}

SOAP

  • Systematic and often more expressive {.fragment}
  • Strong for subtle distortions {.fragment}
  • Computationally heavier and abstract {.fragment}
graph LR
    Input[Positions] --> Dens[Atomic Density]
    Dens --> ACSF[ACSF: Radial/Angular Functions]
    Dens --> SOAP[SOAP: Spherical Harmonics]
    ACSF --> Vec1[Feature Vector]
    SOAP --> Vec2[Power Spectrum]

Choosing between them is a scientific tradeoff, not a universal ranking.

27. From atom-level features to material-level features

Let \(\mathbf{\phi}_i\) denote the local descriptor of atom \(i\). For a crystal with many atoms, we need a material-level vector

\[\mathbf{\Phi} = \text{pool}(\{\mathbf{\phi}_i\})\]

This pooling step is essential because many targets, such as bulk modulus or band gap, are defined per material rather than per site.

28. Mean pooling and what it assumes

  • Mean pooling summarizes the average environment in a material. {.fragment}
  • It works best when the target is controlled by a typical or dominant local motif. {.fragment}
  • It fails when a rare motif or defect controls the property. {.fragment}

Mean pooling is therefore simple and efficient, but it can wash out minority environments that matter.

29. Histogram and moment pooling

  • Histogram pooling preserves more information about the distribution of local environments. {.fragment}
  • Moment summaries capture spread and skewness instead of only the mean. {.fragment}
  • Species-resolved pooling preserves chemistry-specific contributions. {.fragment}

These choices matter when properties depend on heterogeneity rather than only the average site.

30. Worked example: a defect-sensitive material

  • Consider a crystal where most sites are regular, but a small population near vacancies is highly distorted. {.fragment}
  • The average descriptor changes only a little. {.fragment}
  • The tail of the environment distribution changes strongly. {.fragment}

This is a case where histogram-based or defect-aware pooling is more faithful than mean pooling.

31. Local descriptors for phase and motif discrimination

  • Local descriptors are often excellent for classifying structure motifs. {.fragment}
  • They can separate tetrahedral and octahedral environments, ordered and distorted variants, or defect-rich and defect-poor local neighborhoods. {.fragment}
  • Visualizing descriptor space with PCA or UMAP can reveal clusters, but the visualization is only a diagnostic, not a proof of physical truth. {.fragment}

32. Transferability across chemistry families

  • A good descriptor should map chemically similar local motifs to nearby regions of feature space. {.fragment}
  • But transferability is limited when similar local motifs appear inside very different global frameworks. {.fragment}
  • If the target depends on long-range connectivity, a purely local descriptor can create false analogies. {.fragment}

This is descriptor aliasing: local similarity does not always imply property similarity.

33. When local descriptors are not enough

Properties with strong long-range dependence challenge purely local models: - band dispersion and topology {.fragment} - extended magnetic order {.fragment} - elastic anisotropy {.fragment} - transport pathways and percolation {.fragment}

In such cases, local environments remain useful, but they must be combined with global or learned relational features.

34. Comparison to graph neural representations

  • Graph models also start from neighborhoods, but they learn the aggregation function rather than fixing it by hand. {.fragment}
  • Classical local descriptors remain valuable because they are interpretable, cheap, and effective in small-data settings. {.fragment}
  • In practice, they are often the baseline that more complex learned representations must beat. {.fragment}

35. Cutoff radius as a scientific hyperparameter

  • If \(r_c\) is too small, important neighbors are excluded. {.fragment}
  • If \(r_c\) is too large, the descriptor becomes expensive and may include weakly relevant structure. {.fragment}
  • The best cutoff depends on the chemistry and on the target property. {.fragment}

So the cutoff is part of the model, not merely preprocessing boilerplate.

36. Failure mode: parser and periodicity errors

  • Wrong unit-cell parsing or missing periodic images produce nonphysical local neighborhoods. {.fragment}
  • This creates systematic feature noise before any model is trained. {.fragment}
  • A high-capacity model may still fit such data and hide the preprocessing mistake. {.fragment}

The first debugging step should therefore be structural inspection, not hyperparameter tuning.

37. Failure mode: polymorph aliasing

  • Different polymorphs can share very similar local motifs. {.fragment}
  • A local descriptor may then suggest strong similarity even when bulk properties differ strongly. {.fragment}
  • This is particularly dangerous if the target depends on framework connectivity or long-range ordering. {.fragment}

The mitigation is to add global context or use a richer representation.

38. Quality checklist before downstream learning

Before using local descriptors in a predictive model, ask: - are neighbor lists physically correct? {.fragment} - is the cutoff scientifically justified? {.fragment} - does the descriptor vary smoothly under small perturbations? {.fragment} - does pooling preserve the motif statistics relevant to the property? {.fragment} - will the train-test split prevent near-duplicate structures from leaking across sets? {.fragment}

39. Summary

  • Local environments describe the neighborhood of each atom rather than the entire crystal at once. {.fragment}
  • Coordination, distances, angles, and Voronoi geometry are useful first descriptors. {.fragment}
  • ACSF and SOAP add principled symmetry-aware expressivity. {.fragment}
  • Pooling converts atom-level features into material-level inputs. {.fragment}
  • The main risks are scientific mismatch, bad preprocessing, and missing long-range context. {.fragment}

40. Bridge to Unit 7

  • Once we have a material-level vector, the representation problem becomes a regression problem. {.fragment}
  • But predictive success in materials science depends on more than a low mean error. {.fragment}
  • Unit 7 therefore studies baselines, split design, learning curves, leakage, and out-of-distribution generalization. {.fragment}