Materials Genomics
Unit 11: Clustering vs Discovery in Materials Spaces

Prof. Dr. Philipp Pelz

FAU Erlangen-Nürnberg

FAU Logo IMN Logo CENEM Logo ERC Logo Eclipse Logo

01. Title: Clustering vs Discovery in Materials Spaces

  • Goal: Transition from exploratory data grouping to actionable materials discovery logic.
  • Scientific Claim: Clustering reveals structural organization; Discovery requires physical hypothesis validation.
  • Role in Workflow: Coarse-graining the vast materials design space into manageable prototype families.
  • “Clustering is a powerful tool, but it is not a discovery proof in itself.”

02. Learning outcomes for Unit 11

By the end of this unit, students can: - explain the hard vs. soft assignment logic in K-Means and GMM, - implement density-based clustering (DBSCAN) to find “islands of stability,” - evaluate cluster validity using Adjusted Rand Index (ARI) and Purity, - diagnose artifact clusters (source-bias and scaling issues), - connect outliers and novelties to potential materials discovery candidates.

03. Recap: From Units 9 & 10 (Representation & Latent Spaces)

  • Unit 9: How to learn the representation \(\mathcal{E}(x) = z\).
  • Unit 10: The geometry of the continuous latent space \(\mathbb{L}\).
  • Unit 11 (Today): How to partition \(\mathbb{L}\) or raw space \(\mathbb{R}^D\) into discrete, searchable groups.
  • Clustering is the bridge between continuous manifold talk and discrete prototype classification.

04. Similarity and Dissimilarity: The Choice of Metric

  • Similarity-based: Input is an \(N \times N\) dissimilarity matrix \(\mathbf{D}\).
  • Feature-based: Input is an \(N \times D\) design matrix \(\mathbf{X}\) (Murphy 2012).
  • Common Metrics:
    • Euclidean: \(\|\mathbf{x}_i - \mathbf{x}_j\|_2\). Standard for lattice constants.
    • Hamming: \(\sum [x_{ik} \neq x_{jk}]\). Standard for categorical symmetry labels.
    • Mahalanobis: Accounts for feature correlations using the covariance matrix.

05. Why the Metric is the Most Important Choice

  • Scaling Sensitivity: If density is in \(g/cm^3\) (order 1) and volume is in \(\text{\AA}^3\) (order 100), volume will dominate the Euclidean distance.
  • Feature Weighting: Implicitly decides which physical property “matters” more for family definition.
  • Rule: Clustering is only as meaningful as the metric space it operates in.

06. K-Means: The “Hard Assignment” Baseline

  • Partitions \(N\) materials into \(K\) disjoint sets by minimizing distortion \(J\): \[ J = \sum_{n=1}^N \sum_{k=1}^K r_{nk} \|\mathbf{x}_n - \boldsymbol{\mu}_k\|^2 \]
  • \(r_{nk} \in \{0, 1\}\) is a “hard” 1-of-K indicator variable.
  • Scientific Role: Each cluster center \(\boldsymbol{\mu}_k\) represents a structural prototype (Bishop 2006).

07. The EM Algorithm for K-Means: E-step (Assignment)

  • Fix prototypes \(\boldsymbol{\mu}_k\).
  • Assign each data point \(x_n\) to the nearest prototype: \[ r_{nk} = \begin{cases} 1 & \text{if } k = \arg\min_j \|\mathbf{x}_n - \boldsymbol{\mu}_j\|^2 \\ 0 & \text{otherwise} \end{cases} \]
  • Geometrically, this creates a Voronoi partitioning of the materials space.

08. The EM Algorithm for K-Means: M-step (Update)

  • Fix assignments \(r_{nk}\).
  • Update prototypes to be the mean of their assigned points: \[ \boldsymbol{\mu}_k = \frac{\sum_n r_{nk} \mathbf{x}_n}{\sum_n r_{nk}} \]
  • Guaranteed to decrease \(J\) at every step until a local minimum is reached (Bishop 2006).

09. K-Means Limitations in Materials Space

  • Spherical Bias: Assumes clusters are roughly equal-sized and isotropic. Chemical families are often elongated.
  • Sensitivity to Outliers: One “extreme” structure (e.g., a massive unit cell) can pull the cluster center away from the family core.
  • Fixed \(K\): Requires knowing the number of families in advance, which is rarely true in discovery.

10. Choosing \(K\): The Elbow and Silhouette Score

  • Elbow Method: Plot \(J\) vs. \(K\). Look for the “kink” where adding clusters gives diminishing returns.
  • Silhouette Score: Measures how well-separated and cohesive clusters are. \[ s(i) = \frac{b(i) - a(i)}{\max\{a(i), b(i)\}} \]
  • In MG, \(K\) is often guided by known crystal structure archetypes (e.g., perovskites, garnets).

11. Mixture Models: The Probabilistic View

  • Gaussian Mixture Model (GMM) assumes data follows \(K\) Gaussians: \[ p(\mathbf{x}) = \sum_{k=1}^K \pi_k \mathcal{N}(\mathbf{x} | \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \]
  • \(\pi_k\): Prior probability of belonging to cluster \(k\).
  • Soft Assignment: Instead of a hard label, we compute the responsibility \(\gamma(z_{nk})\) that material \(n\) belongs to cluster \(k\) (Bishop 2006).

12. Soft Assignments and Phase Boundaries

  • A material with \(\gamma(\text{Cluster 1}) = 0.51\) and \(\gamma(\text{Cluster 2}) = 0.49\) sits on a “manifold boundary.”
  • Materials Context: This characterizes materials near phase transitions or those with ambiguous symmetry.
  • Soft assignments are more physically honest for discovery than hard K-Means boundaries.

13. Hierarchical Clustering: Materials Taxonomies

  • Creates a nested tree (dendrogram) of partitions.
  • Agglomerative (Bottom-up): Start with each material as its own cluster, merge the most similar pairs.
  • Linkage criteria: Single (min distance), Complete (max distance), Average.
  • Benefit: Reveals “taxonomies” of materials, mirroring the hierarchical nature of chemistry (Murphy 2012).

14. Dendrogram Cutting and Discovery Strategy

  • Cutting High: Broad material classes (e.g., Halides vs. Chalcogenides).
  • Cutting Low: Fine-grained structural nuances (e.g., different tilting patterns in the same prototype).
  • Hierarchical clustering is deterministic and doesn’t require pre-specifying \(K\)—you choose the “discovery resolution” post-hoc.

15. Density-Based Clustering: DBSCAN

  • Idea: Clusters are continuous high-density regions separated by low-density noise (Neuer et al. 2024).
  • Hyperparameters: \(\epsilon\) (distance radius) and min_samples.
  • DBSCAN Pros:
    • Finds clusters of arbitrary shape (elongated chemical trends).
    • Robust to outliers.
    • Automatically identifies “noise” (atypical materials).

16. DBSCAN: Finding “Islands of Stability”

  • High Density: Regions where many structural variants exist \(\approx\) stable chemical prototypes.
  • Low Density: Regions of structural instability or unexplored chemistry.
  • Identifying “valleys” in the chemical manifold helps target regions for high-throughput screening.

17. SVD and PCA as Clustering Preprocessors

  • High-dimensional clustering fails because distances “concentrate” (all points look equidistant).
  • Workflow:
    1. Perform SVD/PCA on raw descriptors (Unit 4).
    2. Retain top \(L\) components covering \(>95\%\) variance.
    3. Cluster in this reduced space (McClarren 2021).
  • This focuses the algorithm on the physical drivers of variation, not the high-dimensional noise.

18. Variance Explained vs. Clustering Resolution

  • Tradeoff: Dropping PCA components removes noise but may smooth over the subtle structural detail that defines a new family.
  • Diagnostic: Plot cluster stability as a function of retained PCA variance.
  • If clusters change drastically with one more PC, the discovery claim is unstable.

19. t-SNE: Preserving Local Neighborhoods

  • Mapping high-D to 2D by preserving neighborhood probabilities \(p(i|j)\) (Neuer et al. 2024).
  • KL-divergence minimization: \[ \text{KL}(p||\tilde{p}) = \sum p(i|j) \log \frac{p(i|j)}{\tilde{p}(i|j)} \]
  • Warning: t-SNE distances are not physical; local density in the plot is not local density in the data.

20. UMAP: The Current Discovery Standard

  • UMAP (Uniform Manifold Approximation and Projection).
  • Generally faster than t-SNE and better at preserving global structure (the relative positions of distant families).
  • Standard: Used to map the entire Materials Project (150k+ compounds) to see the “periodic table of structure families.”

21. Artifact Clusters: The Source Bias Problem

  • Case Study: Clustering a dataset combining VASP/PBE and Quantum Espresso/SCAN calculations.
  • Failure: The algorithm finds two giant clusters corresponding to the software used, not the chemistry.
  • Rule: Normalization must be done per-source to remove simulation artifacts before discovery.

22. Scaling Artifacts: The “Unit” Trap

  • Formation Energy: \(-10\) to \(0\) eV.
  • Unit Cell Volume: \(50\) to \(500\) \(\text{\AA}^3\).
  • Without Z-score standardization (\(x' = \frac{x - \mu}{\sigma}\)), Volume will completely define the distance metric.
  • Lesson: Feature engineering (Unit 4) is the prerequisite for unsupervised discovery.

23. Raw Descriptors (Unit 4) vs. Latent Codes (Unit 10)

  • Raw Space: Interpretable, but biased by human selection of what “matters.”
  • Latent Space: Captures “deep” structural relationships descriptors miss, but harder to explain.
  • Validation: Compare Adjusted Rand Index (ARI) of both cluster sets against known crystal systems.

24. Visualization Hallucinations

  • t-SNE/UMAP can show clusters even in perfectly random noise if hyperparameters are pushed.
  • Visual inspection is the beginning of discovery, not the proof.
  • Claims of “new families” must be backed by property correlation metrics, not just “clean” plots (Neuer et al. 2024).

25. External Validation: Purity and Rand Index

  • Purity: Average proportion of the majority class in each cluster.
  • Adjusted Rand Index (ARI): \[ \text{ARI} = \frac{\text{Index} - \text{Expected Index}}{\text{Max Index} - \text{Expected Index}} \]
  • High ARI against space groups/prototypes validates that the model has “learned” crystallography (Murphy 2012).

26. Case Study: Discovery in the Materials Project

  • Objective: Find rare coordination environments in 150,000 materials.
  • Method: SOAP descriptors (Unit 6) + UMAP + DBSCAN.
  • Outcome: Identification of 50+ previously unlabeled structure archetypes.
  • Success: The “clusters” led directly to new entries in the inorganic structure database.

27. Case Study: Clustering 2004 Cars (Murphy 12.2)

  • Intuition building: Axes of Price, MPG, and Weight.
  • Discovery: “The Economy/SUV/Luxury” triad.
  • Analogy: In Materials Genomics, we replace these with “Conductivity/Stability/Stiffness” to find our “Luxury” materials.

28. Case Study: Spectral Clustering (McClarren 4.2)

  • Clustering plant species from hyperspectral leaf images.
  • Latent factors = Chlorophyll + Water content.
  • MG Bridge: We cluster XRD or EELS spectra to find structural phases in high-throughput experimental data.

29. The Discovery Objective: Selecting the “Golden Cluster”

  • Once clusters are found, we overlay properties:
    • Which cluster has the highest mean bandgap?
    • Which cluster is the most stable (\(E_{hull} \to 0\))?
  • Discovery: The “Golden Cluster” represents a structural family with a high probability of yielding high-performance materials.

31. Uncertainty in Cluster Assignments

  • Use GMM “responsibilities” as a proxy for assignment uncertainty.
  • If a material has no dominant cluster, it lives in a “structurally ambiguous” region.
  • Risk: Discovery claims in high-uncertainty regions require the most validation.

32. Robustness: The Bootstrap Check

  • Run clustering on different 90% subsets of the data.
  • Do the same materials always end up together?
  • Stability: If clusters are unstable, the “discovery” is likely a sampling artifact.

33. The Silhouette Score Trap

  • High Silhouette \(\neq\) Physical Meaning.
  • You can get a perfect Silhouette score on clusters that only represent “Database Source” (Slide 21).
  • Rule: Silhouette score is an internal metric; discovery requires external physical validation.

34. Feature Importance for Clustering

  • Which descriptor (Unit 4) drives the cluster separation?
  • Method: Train a Random Forest classifier to predict “Cluster ID” from raw features.
  • High importance for “Atomic Volume” means your discovery is likely driven by size effects.

35. Over-Clustering: Structure from Noise

  • Clustering algorithms will find structure in random data if forced.
  • Permutation Test: Cluster shuffled data. If you still see “structure,” your original discovery is invalid.
  • Discovery must be “statistically significant” compared to the null hypothesis.

36. Interpreting Prototypes: Centroids vs. Medoids

  • Centroid: The mathematical mean \(\boldsymbol{\mu}_k\). May not be a real material.
  • Medoid: The actual material in the dataset closest to the center.
  • Discovery Step: Use the Medoid as the “representative” candidate of the newly discovered family.

37. Bridging to Latent Traversals (Unit 10)

  • Cluster \(\to\) Medoid \(\to\) Traversal.
  • Once a cluster of interest is identified, explore its neighborhood by “morphing” the Medoid structure.
  • Local Discovery: Unsupervised grouping defines the “where,” traversal explores the “how.”

38. Cluster-Target Correlation

  • Measure the \(R^2\) of property \(y\) explained by Cluster ID.
  • If Cluster ID explains 80% of \(y\) variation, the structural grouping is a physically valid discovery lead.

39. Handling Categorical Descriptors

  • Mixing atomic radius (continuous) and crystal system (categorical).
  • Use Gower’s distance or MCA (Multiple Correspondence Analysis).
  • Prevents categorical “bins” from drowning out subtle structural signals.

40. The Role of Whitening and Decoupling

  • Features are often highly correlated (e.g., atomic volume and radius).
  • Whitening: Transforming features to have zero correlation and unit variance.
  • Prevents the same physical signal from being “double-counted” in the distance metric.

41. Anomaly vs. Novelty Detection (Neuer 5.5.3)

  • Anomaly: Likely a simulation or measurement error.
  • Novelty: A physically plausible but “unusual” structure type.
  • Filter: Unit 9 Autoencoders filter anomalies; Unit 11 Clustering identifies novelties.

42. Domain-Aware Outlier Filtering

  • Is the outlier just a material with a massive unit cell?
  • Normalization: Ensure outliers are chemically/structurally novel, not just computationally complex or large.

43. Failure Mode: Stoichiometry Shortcut

  • If composition is included in features, clusters often just recreate the Periodic Table.
  • Goal: Find clusters driven by bonding and symmetry, even when stoichiometry is identical.

44. Failure Mode: Feature Saturation

  • Using 1,000+ descriptors (Unit 4) makes every material look unique.
  • The “Curse of Dimensionality” leads to a single giant cluster or \(N\) single-point clusters.
  • Solution: Rigorous dimensionality reduction before clustering.

45. Summary of the Discovery Logic

  • Clustering: Organizes the structure backlog.
  • Prototypes: Define the “average” of a new family.
  • Outliers: Signal structural novelty.
  • Metrics: Validate physical consistency.

46. Next Unit Bridge: Uncertainty-Aware Discovery

  • Unit 11 used “hard” or “soft” boundaries.
  • Unit 12 replaces boundaries with continuous Uncertainty Maps.
  • Rule: Cluster for exploration, model uncertainty for exploitation.

47. Exercise Task 1: Spectral Clustering

  • Dataset: Leaf spectra from McClarren.
  • Task: Compare K-Means vs. DBSCAN on SVD-reduced spectra.
  • Deliverable: Plot Silhouette Score vs. \(K\).

48. Exercise Task 2: Structural Clustering

  • Dataset: Materials Project formation energy subset.
  • Task: Cluster structures using stoichiometry vs. SOAP descriptors.
  • Deliverable: Adjusted Rand Index (ARI) against space group labels.

49. Exercise Task 3: Artifact Detection

  • Task: Intentionally mis-scale a feature (e.g., multiply volume by 1000).
  • Deliverable: Before/after UMAP visualization showing the cluster map collapse.

50. Exam Checklist: Evidence-backed Discovery

Bishop, Christopher M. 2006. Pattern Recognition and Machine Learning. Springer.
McClarren, Ryan G. 2021. Machine Learning for Engineers: Using Data to Solve Problems for Physical Systems. Springer.
Murphy, Kevin P. 2012. Machine Learning: A Probabilistic Perspective. MIT Press.
Neuer, Michael et al. 2024. Machine Learning for Engineers: Introduction to Physics-Informed, Explainable Learning Methods for AI in Engineering Applications. Springer Nature.