\(C_k\): set of points assigned to cluster \(k\). \(\boldsymbol{\mu}_k\): centroid of cluster \(k\).
This is an NP-hard optimization problem — K-Means finds a local minimum [@neuer2024machine].
K-Means algorithm: two alternating steps
Assignment step: assign each point to the nearest centroid: \[
c_i = \arg\min_k \|\mathbf{x}_i - \boldsymbol{\mu}_k\|^2
\]
Update step: recompute centroids as the mean of assigned points: \[
\boldsymbol{\mu}_k = \frac{1}{|C_k|}\sum_{i \in C_k} \mathbf{x}_i
\]
%%| fig-align: centergraph TD A[Initialize Centroids] --> B[Assign Points to Nearest Centroid] B --> C[Update Centroids to Mean of Assigned Points] C --> D{Convergence?} D -- No --> B D -- Yes --> E[Final Clusters]
Repeat until assignments stop changing.
K-Means as coordinate descent
The assignment step minimizes \(J\) with respect to cluster assignments (centroids fixed).
The update step minimizes \(J\) with respect to centroids (assignments fixed).
Each step reduces or maintains \(J\) — this is coordinate descent on the distortion cost.
Since \(J\) is bounded below (by 0) and decreases monotonically, convergence is guaranteed.
K-Means convergence
\(J\) decreases (or stays constant) at every step.
There are finitely many possible assignments → algorithm must terminate.
Convergence is to a local minimum — the result depends on initialization.
Typical convergence: 10–50 iterations for moderate datasets [@mcclarren2021machine].
K-Means: initialization matters
Random initialization can lead to poor local minima.
K-Means++: choose first centroid randomly, then each subsequent centroid with probability proportional to squared distance from nearest existing centroid.
Spreads initial centroids → much better results on average.
Multiple restarts: run K-Means several times, keep the result with lowest \(J\).
K-Means: choosing K
Elbow method: plot \(J\) vs \(K\). Look for the “elbow” where adding clusters gives diminishing returns.
Silhouette score: for each point, compare within-cluster distance to nearest-neighbor-cluster distance. Average across all points.
Domain knowledge: sometimes \(K\) is known (e.g., number of material phases).
No universally best method — combine quantitative and qualitative assessment.
K-Means: limitations
Assumes spherical clusters of roughly equal size and density.
Hard assignments: each point belongs to exactly one cluster — no uncertainty.
Sensitive to outliers (centroids pulled toward outliers).
Cannot represent clusters with different shapes or orientations.
These limitations motivate probabilistic clustering with GMMs.
K-Means: worked example
2D data with 3 Gaussian blobs (well-separated).
Initialize: place 3 centroids (K-Means++).
Iteration 1: assign points → update centroids.
Iteration 5: centroids stabilize at cluster centers. Assignments are correct.
Total distortion \(J\) decreases from ~150 to ~45 over 5 iterations.
K-Medoids: robust variant
Instead of centroids (means), use actual data points as cluster representatives (medoids).
Objective: minimize sum of distances (not necessarily squared) to nearest medoid.
Robust to outliers: the medoid is a real data point, not pulled by extreme values.
More expensive: \(O(N^2)\) per iteration vs \(O(NK)\) for K-Means.
Checkpoint: K-Means failure mode
Question: K-Means splits a single elongated cluster into two halves. Why?
Answer: K-Means assumes spherical clusters. An elongated cluster has high variance along one axis — two spherical clusters fit it better under the K-Means objective.
Fix: use GMM with full covariance matrices.
From hard to soft clustering
Hard Clustering (K-Means)
Each point belongs to exactly one cluster.
Membership is binary: \(\{0, 1\}\).
No representation of uncertainty.
Soft Clustering (GMM)
Each point has a probability of belonging to each cluster.
Membership is continuous: \([0, 1]\).
Naturally handles overlap and uncertainty.
Gaussian Mixture Models provide this probabilistic framework.
After 10 iterations: log-likelihood converges; elliptical clusters correctly identified.
Checkpoint: EM understanding
Question: After the E-step, what do the responsibilities tell you?
Answer: The posterior probability of each cluster \(k\) given each data point \(\mathbf{x}_i\), under the current model parameters. They quantify how much each cluster “explains” each point.
Bernoulli mixture models
For binary data (e.g., presence/absence of features):
Replace Gaussian with Bernoulli likelihood: \(p(\mathbf{x} | z = k) = \prod_j \mu_{kj}^{x_j}(1-\mu_{kj})^{1-x_j}\).
EM updates: \(\mu_{kj} = \frac{\sum_i r_{ik} x_{ij}}{N_k}\) (weighted mean of binary features per cluster).
Application: document clustering, genetic data analysis.
Model selection: choosing K
BIC (Bayesian Information Criterion):
\[
\text{BIC} = \ell(\hat{\theta}) - \frac{K_{\text{params}}}{2}\log N
\]
Select \(K\) that maximizes BIC (or minimizes −BIC). Lower penalty = more components allowed.
BIC vs AIC
BIC: stronger penalty for complexity (\(\log N / 2\) per parameter). Preferred for large \(N\).
AIC: weaker penalty (1 per parameter). More permissive — tends to select larger \(K\).
BIC is consistent (selects true \(K\) as \(N \to \infty\)). AIC is not.
In practice: compare both and use domain knowledge to adjudicate.
Initialization strategies for EM
Random: fast but unreliable — can converge to poor local maxima.
K-Means: run K-Means first, use resulting centroids as initial means. Most common.
Multiple restarts: run EM several times with different initializations, keep best log-likelihood.
Good initialization is critical for EM convergence quality.
Singular covariance matrices
If a cluster collapses to a single point, \(\boldsymbol{\Sigma}_k\) becomes singular (determinant = 0).
The likelihood goes to infinity — a pathological solution.
Fix: add a small diagonal regularization \(\boldsymbol{\Sigma}_k + \epsilon \mathbf{I}\).
Alternative: constrain covariance matrices to be diagonal or shared across clusters.
Checkpoint: model selection
Question: BIC selects \(K=3\) but you know there are 5 distinct groups. What might be wrong?
Answer: The groups may significantly overlap, making them statistically indistinguishable. BIC prefers parsimony — 3 components may fit the data nearly as well as 5.
Application: image segmentation
Model pixel colors (RGB) as a GMM with \(K\) components.
Each component corresponds to a material or region (background, grain, defect).
Segment the image by assigning each pixel to its most probable cluster.
Soft assignments provide uncertainty at boundaries.
Unsupervised segmentation of polycrystalline microstructure.
Application: vector quantization
Replace each data point with its nearest centroid from a learned codebook.
Used for signal compression: store cluster index instead of full data.
Materials application: fingerprint alloy microstructures by their cluster assignments.
Compression ratio = \(\log_2(\text{codebook size})\) bits per data point.
Vector quantization partitioning feature space into Voronoi cells.
Application: regime identification in manufacturing
Cluster sensor data over time; each cluster represents an operating regime.
Transitions between clusters indicate regime changes (startup, steady state, degradation).
Real-time clustering enables automated process monitoring and control.
Placeholder: Time-series sensor data with clusters indicating different regimes
Application: anomaly detection via density
Fit a GMM to normal operating data.
For a new observation, compute \(p(\mathbf{x}_{\text{new}})\) under the GMM.
Low density \(\to\) anomaly.
This provides a principled, probabilistic anomaly score (unlike ad-hoc thresholds).
Clustering vs autoencoder embeddings
Clustering operates in the original data space — affected by curse of dimensionality.
Autoencoders provide a learned latent space where clustering may work better.
Best of both: train autoencoder (Unit 9), then cluster in latent space.
The latent space concentrates relevant variation, making cluster separation clearer.
Materials example: alloy phase identification
Input: X-ray diffraction (XRD) patterns from unknown alloy samples.