Materials Genomics
Unit 11: Clustering vs Discovery in Materials Spaces

Prof. Dr. Philipp Pelz

FAU Erlangen-Nürnberg

Scientific Claim: Clustering reveals structural organization; Discovery requires physical hypothesis validation.
Role in Workflow: Coarse-graining the vast materials design space into manageable prototype families.
“Clustering is a powerful tool, but it is not a discovery proof in itself.”

02. Learning outcomes for Unit 11

By the end of this unit, students can: - explain the hard vs. soft assignment logic in K-Means and GMM, - implement density-based clustering (DBSCAN) to find “islands of stability,” - evaluate cluster validity using Adjusted Rand Index (ARI) and Purity, - diagnose artifact clusters (source-bias and scaling issues), - connect outliers and novelties to potential materials discovery candidates.

03. Recap: From Units 9 & 10 (Representation & Latent Spaces)

Unit 9: How to learn the representation \(\mathcal{E}(x) = z\).
Unit 10: The geometry of the continuous latent space \(\mathbb{L}\).
Unit 11 (Today): How to partition \(\mathbb{L}\) or raw space \(\mathbb{R}^D\) into discrete, searchable groups.
Clustering is the bridge between continuous manifold talk and discrete prototype classification.

04. Similarity and Dissimilarity: The Choice of Metric

Similarity-based: Input is an \(N \times N\) dissimilarity matrix \(\mathbf{D}\).
Feature-based: Input is an \(N \times D\) design matrix \(\mathbf{X}\) (Murphy 2012).
Common Metrics:
- Euclidean: \(\|\mathbf{x}_i - \mathbf{x}_j\|_2\). Standard for lattice constants.
- Hamming: \(\sum [x_{ik} \neq x_{jk}]\). Standard for categorical symmetry labels.
- Mahalanobis: Accounts for feature correlations using the covariance matrix.

05. Why the Metric is the Most Important Choice

Scaling Sensitivity: If density is in \(g/cm^3\) (order 1) and volume is in \(\text{\AA}^3\) (order 100), volume will dominate the Euclidean distance.
Feature Weighting: Implicitly decides which physical property “matters” more for family definition.
Rule: Clustering is only as meaningful as the metric space it operates in.

06. K-Means: The “Hard Assignment” Baseline

Partitions \(N\) materials into \(K\) disjoint sets by minimizing distortion \(J\): \[ J = \sum_{n=1}^N \sum_{k=1}^K r_{nk} \|\mathbf{x}_n - \boldsymbol{\mu}_k\|^2 \]
\(r_{nk} \in \{0, 1\}\) is a “hard” 1-of-K indicator variable.
Scientific Role: Each cluster center \(\boldsymbol{\mu}_k\) represents a structural prototype (Bishop 2006).

07. The EM Algorithm for K-Means: E-step (Assignment)

Fix prototypes \(\boldsymbol{\mu}_k\).
Assign each data point \(x_n\) to the nearest prototype: \[ r_{nk} = \begin{cases} 1 & \text{if } k = \arg\min_j \|\mathbf{x}_n - \boldsymbol{\mu}_j\|^2 \\ 0 & \text{otherwise} \end{cases} \]
Geometrically, this creates a Voronoi partitioning of the materials space.

08. The EM Algorithm for K-Means: M-step (Update)

Fix assignments \(r_{nk}\).
Update prototypes to be the mean of their assigned points: \[ \boldsymbol{\mu}_k = \frac{\sum_n r_{nk} \mathbf{x}_n}{\sum_n r_{nk}} \]
Guaranteed to decrease \(J\) at every step until a local minimum is reached (Bishop 2006).

09. K-Means Limitations in Materials Space

Spherical Bias: Assumes clusters are roughly equal-sized and isotropic. Chemical families are often elongated.
Sensitivity to Outliers: One “extreme” structure (e.g., a massive unit cell) can pull the cluster center away from the family core.
Fixed \(K\): Requires knowing the number of families in advance, which is rarely true in discovery.

10. Choosing \(K\): The Elbow and Silhouette Score

Elbow Method: Plot \(J\) vs. \(K\). Look for the “kink” where adding clusters gives diminishing returns.
Silhouette Score: Measures how well-separated and cohesive clusters are. \[ s(i) = \frac{b(i) - a(i)}{\max\{a(i), b(i)\}} \]
In MG, \(K\) is often guided by known crystal structure archetypes (e.g., perovskites, garnets).

11. Mixture Models: The Probabilistic View

Gaussian Mixture Model (GMM) assumes data follows \(K\) Gaussians: \[ p(\mathbf{x}) = \sum_{k=1}^K \pi_k \mathcal{N}(\mathbf{x} | \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \]
\(\pi_k\): Prior probability of belonging to cluster \(k\).
Soft Assignment: Instead of a hard label, we compute the responsibility \(\gamma(z_{nk})\) that material \(n\) belongs to cluster \(k\) (Bishop 2006).

12. Soft Assignments and Phase Boundaries

A material with \(\gamma(\text{Cluster 1}) = 0.51\) and \(\gamma(\text{Cluster 2}) = 0.49\) sits on a “manifold boundary.”
Materials Context: This characterizes materials near phase transitions or those with ambiguous symmetry.
Soft assignments are more physically honest for discovery than hard K-Means boundaries.

13. Hierarchical Clustering: Materials Taxonomies

Creates a nested tree (dendrogram) of partitions.
Agglomerative (Bottom-up): Start with each material as its own cluster, merge the most similar pairs.
Linkage criteria: Single (min distance), Complete (max distance), Average.
Benefit: Reveals “taxonomies” of materials, mirroring the hierarchical nature of chemistry (Murphy 2012).

14. Dendrogram Cutting and Discovery Strategy

Cutting High: Broad material classes (e.g., Halides vs. Chalcogenides).
Cutting Low: Fine-grained structural nuances (e.g., different tilting patterns in the same prototype).
Hierarchical clustering is deterministic and doesn’t require pre-specifying \(K\)—you choose the “discovery resolution” post-hoc.

15. Density-Based Clustering: DBSCAN

Idea: Clusters are continuous high-density regions separated by low-density noise (Neuer et al. 2024).
Hyperparameters: \(\epsilon\) (distance radius) and min_samples.
DBSCAN Pros:
- Finds clusters of arbitrary shape (elongated chemical trends).
- Robust to outliers.
- Automatically identifies “noise” (atypical materials).

16. DBSCAN: Finding “Islands of Stability”

High Density: Regions where many structural variants exist \(\approx\) stable chemical prototypes.
Low Density: Regions of structural instability or unexplored chemistry.
Identifying “valleys” in the chemical manifold helps target regions for high-throughput screening.

17. SVD and PCA as Clustering Preprocessors

High-dimensional clustering fails because distances “concentrate” (all points look equidistant).
Workflow:
1. Perform SVD/PCA on raw descriptors (Unit 4).
2. Retain top \(L\) components covering \(>95\%\) variance.
3. Cluster in this reduced space (McClarren 2021).
This focuses the algorithm on the physical drivers of variation, not the high-dimensional noise.

18. Variance Explained vs. Clustering Resolution

Tradeoff: Dropping PCA components removes noise but may smooth over the subtle structural detail that defines a new family.
Diagnostic: Plot cluster stability as a function of retained PCA variance.
If clusters change drastically with one more PC, the discovery claim is unstable.

19. t-SNE: Preserving Local Neighborhoods

Mapping high-D to 2D by preserving neighborhood probabilities \(p(i|j)\) (Neuer et al. 2024).
KL-divergence minimization: \[ \text{KL}(p||\tilde{p}) = \sum p(i|j) \log \frac{p(i|j)}{\tilde{p}(i|j)} \]
Warning: t-SNE distances are not physical; local density in the plot is not local density in the data.

20. UMAP: The Current Discovery Standard

UMAP (Uniform Manifold Approximation and Projection).
Generally faster than t-SNE and better at preserving global structure (the relative positions of distant families).
Standard: Used to map the entire Materials Project (150k+ compounds) to see the “periodic table of structure families.”

21. Artifact Clusters: The Source Bias Problem

Case Study: Clustering a dataset combining VASP/PBE and Quantum Espresso/SCAN calculations.
Failure: The algorithm finds two giant clusters corresponding to the software used, not the chemistry.
Rule: Normalization must be done per-source to remove simulation artifacts before discovery.

22. Scaling Artifacts: The “Unit” Trap

Formation Energy: \(-10\) to \(0\) eV.
Unit Cell Volume: \(50\) to \(500\) \(\text{\AA}^3\).
Without Z-score standardization (\(x' = \frac{x - \mu}{\sigma}\)), Volume will completely define the distance metric.
Lesson: Feature engineering (Unit 4) is the prerequisite for unsupervised discovery.

23. Raw Descriptors (Unit 4) vs. Latent Codes (Unit 10)

Raw Space: Interpretable, but biased by human selection of what “matters.”
Latent Space: Captures “deep” structural relationships descriptors miss, but harder to explain.
Validation: Compare Adjusted Rand Index (ARI) of both cluster sets against known crystal systems.

24. Visualization Hallucinations

t-SNE/UMAP can show clusters even in perfectly random noise if hyperparameters are pushed.
Visual inspection is the beginning of discovery, not the proof.
Claims of “new families” must be backed by property correlation metrics, not just “clean” plots (Neuer et al. 2024).

25. External Validation: Purity and Rand Index

Purity: Average proportion of the majority class in each cluster.
Adjusted Rand Index (ARI): \[ \text{ARI} = \frac{\text{Index} - \text{Expected Index}}{\text{Max Index} - \text{Expected Index}} \]
High ARI against space groups/prototypes validates that the model has “learned” crystallography (Murphy 2012).

26. Case Study: Discovery in the Materials Project

Objective: Find rare coordination environments in 150,000 materials.
Method: SOAP descriptors (Unit 6) + UMAP + DBSCAN.
Outcome: Identification of 50+ previously unlabeled structure archetypes.
Success: The “clusters” led directly to new entries in the inorganic structure database.

27. Case Study: Clustering 2004 Cars (Murphy 12.2)

Intuition building: Axes of Price, MPG, and Weight.
Discovery: “The Economy/SUV/Luxury” triad.
Analogy: In Materials Genomics, we replace these with “Conductivity/Stability/Stiffness” to find our “Luxury” materials.

28. Case Study: Spectral Clustering (McClarren 4.2)

Clustering plant species from hyperspectral leaf images.
Latent factors = Chlorophyll + Water content.
MG Bridge: We cluster XRD or EELS spectra to find structural phases in high-throughput experimental data.

29. The Discovery Objective: Selecting the “Golden Cluster”

Once clusters are found, we overlay properties:
- Which cluster has the highest mean bandgap?
- Which cluster is the most stable (\(E_{hull} \to 0\))?
Discovery: The “Golden Cluster” represents a structural family with a high probability of yielding high-performance materials.

30. Outlier Detection: The “Black Swan” Search

Discovery often happens at the fringe, not the mean.
An outlier in a stability cluster might be a metastable phase with extraordinary properties (e.g., diamond).
Strategy: Systematically screen the DBSCAN “noise” points as candidates for novel synthesis.

31. Uncertainty in Cluster Assignments

Use GMM “responsibilities” as a proxy for assignment uncertainty.
If a material has no dominant cluster, it lives in a “structurally ambiguous” region.
Risk: Discovery claims in high-uncertainty regions require the most validation.

32. Robustness: The Bootstrap Check

Run clustering on different 90% subsets of the data.
Do the same materials always end up together?
Stability: If clusters are unstable, the “discovery” is likely a sampling artifact.

33. The Silhouette Score Trap

High Silhouette \(\neq\) Physical Meaning.
You can get a perfect Silhouette score on clusters that only represent “Database Source” (Slide 21).
Rule: Silhouette score is an internal metric; discovery requires external physical validation.

34. Feature Importance for Clustering

Which descriptor (Unit 4) drives the cluster separation?
Method: Train a Random Forest classifier to predict “Cluster ID” from raw features.
High importance for “Atomic Volume” means your discovery is likely driven by size effects.

35. Over-Clustering: Structure from Noise

Clustering algorithms will find structure in random data if forced.
Permutation Test: Cluster shuffled data. If you still see “structure,” your original discovery is invalid.
Discovery must be “statistically significant” compared to the null hypothesis.

36. Interpreting Prototypes: Centroids vs. Medoids

Centroid: The mathematical mean \(\boldsymbol{\mu}_k\). May not be a real material.
Medoid: The actual material in the dataset closest to the center.
Discovery Step: Use the Medoid as the “representative” candidate of the newly discovered family.

37. Bridging to Latent Traversals (Unit 10)

Cluster \(\to\) Medoid \(\to\) Traversal.
Once a cluster of interest is identified, explore its neighborhood by “morphing” the Medoid structure.
Local Discovery: Unsupervised grouping defines the “where,” traversal explores the “how.”

38. Cluster-Target Correlation

Measure the \(R^2\) of property \(y\) explained by Cluster ID.
If Cluster ID explains 80% of \(y\) variation, the structural grouping is a physically valid discovery lead.

39. Handling Categorical Descriptors

Mixing atomic radius (continuous) and crystal system (categorical).
Use Gower’s distance or MCA (Multiple Correspondence Analysis).
Prevents categorical “bins” from drowning out subtle structural signals.

40. The Role of Whitening and Decoupling

Features are often highly correlated (e.g., atomic volume and radius).
Whitening: Transforming features to have zero correlation and unit variance.
Prevents the same physical signal from being “double-counted” in the distance metric.

41. Anomaly vs. Novelty Detection (Neuer 5.5.3)

Anomaly: Likely a simulation or measurement error.
Novelty: A physically plausible but “unusual” structure type.
Filter: Unit 9 Autoencoders filter anomalies; Unit 11 Clustering identifies novelties.

42. Domain-Aware Outlier Filtering

Is the outlier just a material with a massive unit cell?
Normalization: Ensure outliers are chemically/structurally novel, not just computationally complex or large.

43. Failure Mode: Stoichiometry Shortcut

If composition is included in features, clusters often just recreate the Periodic Table.
Goal: Find clusters driven by bonding and symmetry, even when stoichiometry is identical.

44. Failure Mode: Feature Saturation

Using 1,000+ descriptors (Unit 4) makes every material look unique.
The “Curse of Dimensionality” leads to a single giant cluster or \(N\) single-point clusters.
Solution: Rigorous dimensionality reduction before clustering.

45. Summary of the Discovery Logic

Clustering: Organizes the structure backlog.
Prototypes: Define the “average” of a new family.
Outliers: Signal structural novelty.
Metrics: Validate physical consistency.

46. Next Unit Bridge: Uncertainty-Aware Discovery

Unit 11 used “hard” or “soft” boundaries.
Unit 12 replaces boundaries with continuous Uncertainty Maps.
Rule: Cluster for exploration, model uncertainty for exploitation.

47. Exercise Task 1: Spectral Clustering

Dataset: Leaf spectra from McClarren.
Task: Compare K-Means vs. DBSCAN on SVD-reduced spectra.
Deliverable: Plot Silhouette Score vs. \(K\).

48. Exercise Task 2: Structural Clustering

Dataset: Materials Project formation energy subset.
Task: Cluster structures using stoichiometry vs. SOAP descriptors.
Deliverable: Adjusted Rand Index (ARI) against space group labels.

49. Exercise Task 3: Artifact Detection

Task: Intentionally mis-scale a feature (e.g., multiply volume by 1000).
Deliverable: Before/after UMAP visualization showing the cluster map collapse.

50. Exam Checklist: Evidence-backed Discovery

Why are t-SNE distances not physical?
How do you diagnose “source bias” in a materials dataset?
What is the difference between a centroid and a medoid?
When is DBSCAN superior to K-Means for discovery?
Why is validation (ARI) necessary for unsupervised claims?

Bishop, Christopher M. 2006. Pattern Recognition and Machine Learning. Springer.

McClarren, Ryan G. 2021. Machine Learning for Engineers: Using Data to Solve Problems for Physical Systems. Springer.

Murphy, Kevin P. 2012. Machine Learning: A Probabilistic Perspective. MIT Press.

Neuer, Michael et al. 2024. Machine Learning for Engineers: Introduction to Physics-Informed, Explainable Learning Methods for AI in Engineering Applications. Springer Nature.

Example Notebooks

Week 11: Latent Space Exploration — IsingDataset

Open rendered notebook →

Week 11: Materials Latent Space — CahnHilliardDataset

Open rendered notebook →

Materials GenomicsUnit 11: Clustering vs Discovery in Materials Spaces

02. Learning outcomes for Unit 11

03. Recap: From Units 9 & 10 (Representation & Latent Spaces)

04. Similarity and Dissimilarity: The Choice of Metric

05. Why the Metric is the Most Important Choice

06. K-Means: The “Hard Assignment” Baseline

07. The EM Algorithm for K-Means: E-step (Assignment)

08. The EM Algorithm for K-Means: M-step (Update)

09. K-Means Limitations in Materials Space

10. Choosing \(K\): The Elbow and Silhouette Score

11. Mixture Models: The Probabilistic View

12. Soft Assignments and Phase Boundaries

13. Hierarchical Clustering: Materials Taxonomies

14. Dendrogram Cutting and Discovery Strategy

15. Density-Based Clustering: DBSCAN

16. DBSCAN: Finding “Islands of Stability”

17. SVD and PCA as Clustering Preprocessors

18. Variance Explained vs. Clustering Resolution

19. t-SNE: Preserving Local Neighborhoods

20. UMAP: The Current Discovery Standard

21. Artifact Clusters: The Source Bias Problem

22. Scaling Artifacts: The “Unit” Trap

23. Raw Descriptors (Unit 4) vs. Latent Codes (Unit 10)

24. Visualization Hallucinations

25. External Validation: Purity and Rand Index

26. Case Study: Discovery in the Materials Project

27. Case Study: Clustering 2004 Cars (Murphy 12.2)

28. Case Study: Spectral Clustering (McClarren 4.2)

29. The Discovery Objective: Selecting the “Golden Cluster”

30. Outlier Detection: The “Black Swan” Search

31. Uncertainty in Cluster Assignments

32. Robustness: The Bootstrap Check

33. The Silhouette Score Trap

34. Feature Importance for Clustering

35. Over-Clustering: Structure from Noise

36. Interpreting Prototypes: Centroids vs. Medoids

37. Bridging to Latent Traversals (Unit 10)

38. Cluster-Target Correlation

39. Handling Categorical Descriptors

40. The Role of Whitening and Decoupling

41. Anomaly vs. Novelty Detection (Neuer 5.5.3)

42. Domain-Aware Outlier Filtering

43. Failure Mode: Stoichiometry Shortcut

44. Failure Mode: Feature Saturation

45. Summary of the Discovery Logic

46. Next Unit Bridge: Uncertainty-Aware Discovery

47. Exercise Task 1: Spectral Clustering

48. Exercise Task 2: Structural Clustering

49. Exercise Task 3: Artifact Detection

50. Exam Checklist: Evidence-backed Discovery

Example Notebooks

Materials Genomics
Unit 11: Clustering vs Discovery in Materials Spaces