Materials Genomics
Unit 6: Local Atomic Environments

Prof. Dr. Philipp Pelz

FAU Erlangen-Nürnberg

FAU Logo IMN Logo CENEM Logo ERC Logo Eclipse Logo

01. Title: Local Atomic Environments

  • This unit asks how a crystal can be decomposed into physically meaningful local neighborhoods.
  • The central modeling move is to represent a material not only by one global fingerprint, but by the environments around its atomic sites.
  • This matters because many materials properties are controlled by local coordination, local chemistry, and local distortions.

02. Learning outcomes

By the end of this unit, students should be able to: - explain why local environments are an effective representation layer between chemistry and machine learning - define the invariance requirements of a useful atom-centered descriptor - compare simple geometric descriptors with ACSF- and SOAP-like representations - explain how local descriptors are pooled into material-level vectors - identify failure modes caused by cutoffs, parser mistakes, defects, and missing long-range physics

03. Recap: where this fits in the course

  • Unit 5 introduced graph-based representations, where atoms are nodes and bonds or neighbor relations are edges.
  • Unit 6 is more classical and more interpretable: we keep the focus on neighborhoods, but we summarize them into engineered descriptors.
  • Unit 7 will start from these descriptors and ask whether predictive performance actually generalizes.

04. Why local information matters

  • A single composition vector cannot tell whether a cation sits in tetrahedral or octahedral coordination.
  • A global crystal label may ignore defects, local strain, or minority motifs.
  • Yet many target properties depend precisely on these local features: diffusion barriers, catalytic site activity, defect energetics, local magnetism, and parts of elastic response.

05. Representation hierarchy

  • Global descriptors summarize the whole crystal in one object.
  • Local descriptors summarize the neighborhood around each site.
  • Learned graph representations infer useful local and semi-local features automatically.

Local atomic environments are attractive because they preserve physical interpretability while remaining compatible with standard machine-learning pipelines.

06. What is a local atomic environment?

For a central atom i, a local environment usually contains: - the atomic species Z_i - the neighboring species Z_j - relative distances r_ij - optional angular relations theta_jik

The neighborhood can be defined by a radial cutoff or by a geometric rule such as Voronoi adjacency.

07. Neighbor construction under a radial cutoff

  • The simplest definition is: atom j belongs to the environment of atom i if r_ij < r_c.
  • This immediately gives the coordination number

N_i(r_c) = sum_j 1[r_ij < r_c]

  • The same formula already shows a major issue: the descriptor depends on the modeling choice r_c.

08. Periodic boundary conditions are not optional

  • In a crystal, the local environment must be built with periodic images.
  • Otherwise, atoms near the unit-cell boundary lose neighbors that are physically present just across the boundary.
  • A wrong periodic treatment creates fake low-coordination sites and contaminates the dataset before any machine learning begins.

09. Local descriptors need symmetry discipline

A useful local descriptor should satisfy: - translation invariance - rotation invariance or equivariance - permutation invariance over identical neighbors - continuity under small atomic displacements - sensitivity to chemical identity

If a descriptor fails one of these tests, the model may learn file conventions or noise instead of structure.

10. Why raw Cartesian coordinates are a poor baseline

  • Raw coordinates depend on where the origin is placed.
  • They depend on how the cell axes are oriented.
  • They change if identical atoms are listed in a different order.

This is why classical materials ML rarely feeds raw atomic coordinates directly into a regression model without additional symmetry-aware processing.

11. Coordination number as a first descriptor

  • The coordination number is interpretable and chemically meaningful.
  • It distinguishes many familiar motifs immediately: isolated atoms, linear coordination, tetrahedral sites, octahedral sites, close-packed environments.
  • It is also very cheap to compute.

Its limitation is that different shapes can share the same count.

12. The loss of geometry in pure coordination counts

  • A tetrahedral site and a square-planar site both have coordination four.
  • A distorted octahedron and a near-perfect octahedron both have coordination six.
  • Therefore, counting neighbors is often necessary but rarely sufficient.

To recover shape information, we need distances, angles, or more expressive local densities.

13. Bond-length distributions

  • The set {r_ij} captures whether neighbors are uniformly arranged or split into short and long bonds.
  • Mean bond length and bond-length variance are often useful summary statistics.
  • Histograms of r_ij can separate compact, expanded, and distorted environments.

This is one step closer to local chemistry because it recognizes strain and distortion, not only connectivity.

14. Bond-angle distributions

  • Angular information tells us how neighbors are arranged around the central atom.
  • Tetrahedral, octahedral, trigonal-planar, and close-packed motifs become much easier to separate once angle statistics are included.
  • In practice, angle-based descriptors help distinguish environments that have the same coordination count but different geometry.

15. Example: tetrahedral versus octahedral coordination

  • Imagine two oxide materials with similar compositions.
  • If one cation occupies tetrahedral sites and the other octahedral sites, the local chemistry can be very different.
  • A descriptor based only on stoichiometry misses this distinction.
  • A descriptor with bond angles or SOAP similarity captures it naturally.

16. Voronoi neighborhoods

  • Instead of using a fixed cutoff, one can define neighbors geometrically via a Voronoi tessellation.
  • Two atoms are neighbors if their Voronoi cells share a face.
  • This adapts more naturally to local density variation than one fixed radius.

Voronoi approaches reduce arbitrariness, but they can become unstable for noisy or highly distorted structures.

17. Voronoi advantages and caveats

  • Advantage: less dependence on a hand-chosen r_c
  • Advantage: often better reflects relative packing geometry
  • Caveat: tiny Voronoi faces can create questionable neighbors
  • Caveat: noisy structures may need extra thresholds on face area or distance

A practical pipeline often mixes geometric and radial criteria rather than treating them as mutually exclusive.

18. Descriptor continuity matters for learning

  • If a small structural distortion causes a large jump in the descriptor, the regression problem becomes unstable.
  • Relaxed structures, thermal noise, and small DFT geometry differences then look like unrelated inputs.
  • Good local descriptors should vary smoothly when the underlying structure varies smoothly.

This is one reason smooth basis-function descriptors are attractive.

19. Atom-centered symmetry functions (ACSF)

ACSF descriptors construct local features from radial and angular functions centered on atom i. A typical radial term is

G_i^rad = sum_j exp[-eta (r_ij - R_s)^2] f_c(r_ij)

  • eta controls sensitivity to distance
  • R_s shifts the radial focus
  • f_c smoothly suppresses contributions near the cutoff

These functions are designed to be invariant and differentiable.

20. What ACSF is doing conceptually

  • Each symmetry function asks a specific question about the neighborhood.
  • How many neighbors are there near a certain radius?
  • Are there angular triplets with a certain shape?
  • How strongly are these patterns weighted by species and distance?

The result is a hand-designed feature vector whose components have a clear modeling role.

21. Strengths and weaknesses of ACSF

  • Strength: interpretable design logic
  • Strength: smooth and symmetry-aware
  • Strength: well suited for atom-centered models and machine-learned interatomic potentials
  • Weakness: many hyperparameters
  • Weakness: feature design can become manual and problem-specific

ACSF works well when we want controllable features and moderate descriptor complexity.

22. SOAP: smooth overlap of atomic positions

SOAP starts by replacing the discrete neighbor list with a smooth density around each atom:

rho_i(r) = sum_j exp(-|r - r_ij|^2 / (2 sigma^2))

  • each neighbor contributes a Gaussian
  • sigma controls how sharply positions are resolved
  • the local environment becomes a continuous field instead of a list of points

23. From local density to invariant features

  • The smooth density is expanded in radial basis functions and spherical harmonics.
  • Raw expansion coefficients still depend on orientation.
  • SOAP therefore forms rotationally invariant combinations, often called the power spectrum.

This converts a geometric neighborhood into a descriptor that is both expressive and symmetry-aware.

24. SOAP as a kernel similarity

  • SOAP is often used through a normalized kernel.
  • Two environments are similar when their smooth densities overlap strongly after accounting for rotation.
  • This gives a principled similarity score rather than a manually defined distance on raw features.

Conceptually, SOAP turns local geometry into a notion of environment similarity.

25. Why SOAP is often powerful

  • Small geometric distortions change the descriptor smoothly.
  • Fine differences in local geometry can still be resolved.
  • Similarity between environments becomes a first-class mathematical object.

That is especially useful when the scientific distinction is not only coordination count, but subtle distortion of a coordination polyhedron.

26. ACSF versus SOAP

  • ACSF is explicit and hand-designed.
  • SOAP is more systematic and often more expressive for geometric similarity.
  • ACSF is easier to interpret feature by feature.
  • SOAP is stronger when subtle distortions matter, but it is computationally heavier and more abstract.

Choosing between them is a scientific tradeoff, not a universal ranking.

27. From atom-level features to material-level features

Let phi_i denote the local descriptor of atom i. For a crystal with many atoms, we need a material-level vector

Phi = pool({phi_i})

This pooling step is essential because many targets, such as bulk modulus or band gap, are defined per material rather than per site.

28. Mean pooling and what it assumes

  • Mean pooling summarizes the average environment in a material.
  • It works best when the target is controlled by a typical or dominant local motif.
  • It fails when a rare motif or defect controls the property.

Mean pooling is therefore simple and efficient, but it can wash out minority environments that matter.

29. Histogram and moment pooling

  • Histogram pooling preserves more information about the distribution of local environments.
  • Moment summaries capture spread and skewness instead of only the mean.
  • Species-resolved pooling preserves chemistry-specific contributions.

These choices matter when properties depend on heterogeneity rather than only the average site.

30. Worked example: a defect-sensitive material

  • Consider a crystal where most sites are regular, but a small population near vacancies is highly distorted.
  • The average descriptor changes only a little.
  • The tail of the environment distribution changes strongly.

This is a case where histogram-based or defect-aware pooling is more faithful than mean pooling.

31. Local descriptors for phase and motif discrimination

  • Local descriptors are often excellent for classifying structure motifs.
  • They can separate tetrahedral and octahedral environments, ordered and distorted variants, or defect-rich and defect-poor local neighborhoods.
  • Visualizing descriptor space with PCA or UMAP can reveal clusters, but the visualization is only a diagnostic, not a proof of physical truth.

32. Transferability across chemistry families

  • A good descriptor should map chemically similar local motifs to nearby regions of feature space.
  • But transferability is limited when similar local motifs appear inside very different global frameworks.
  • If the target depends on long-range connectivity, a purely local descriptor can create false analogies.

This is descriptor aliasing: local similarity does not always imply property similarity.

33. When local descriptors are not enough

Properties with strong long-range dependence challenge purely local models: - band dispersion and topology - extended magnetic order - elastic anisotropy - transport pathways and percolation

In such cases, local environments remain useful, but they must be combined with global or learned relational features.

34. Comparison to graph neural representations

  • Graph models also start from neighborhoods, but they learn the aggregation function rather than fixing it by hand.
  • Classical local descriptors remain valuable because they are interpretable, cheap, and effective in small-data settings.
  • In practice, they are often the baseline that more complex learned representations must beat.

35. Cutoff radius as a scientific hyperparameter

  • If r_c is too small, important neighbors are excluded.
  • If r_c is too large, the descriptor becomes expensive and may include weakly relevant structure.
  • The best cutoff depends on the chemistry and on the target property.

So the cutoff is part of the model, not merely preprocessing boilerplate.

36. Failure mode: parser and periodicity errors

  • Wrong unit-cell parsing or missing periodic images produce nonphysical local neighborhoods.
  • This creates systematic feature noise before any model is trained.
  • A high-capacity model may still fit such data and hide the preprocessing mistake.

The first debugging step should therefore be structural inspection, not hyperparameter tuning.

37. Failure mode: polymorph aliasing

  • Different polymorphs can share very similar local motifs.
  • A local descriptor may then suggest strong similarity even when bulk properties differ strongly.
  • This is particularly dangerous if the target depends on framework connectivity or long-range ordering.

The mitigation is to add global context or use a richer representation.

38. Quality checklist before downstream learning

Before using local descriptors in a predictive model, ask: - are neighbor lists physically correct? - is the cutoff scientifically justified? - does the descriptor vary smoothly under small perturbations? - does pooling preserve the motif statistics relevant to the property? - will the train-test split prevent near-duplicate structures from leaking across sets?

39. Summary

  • Local environments describe the neighborhood of each atom rather than the entire crystal at once.
  • Coordination, distances, angles, and Voronoi geometry are useful first descriptors.
  • ACSF and SOAP add principled symmetry-aware expressivity.
  • Pooling converts atom-level features into material-level inputs.
  • The main risks are scientific mismatch, bad preprocessing, and missing long-range context.

40. Bridge to Unit 7

  • Once we have a material-level vector, the representation problem becomes a regression problem.
  • But predictive success in materials science depends on more than a low mean error.
  • Unit 7 therefore studies baselines, split design, learning curves, leakage, and out-of-distribution generalization.