Mathematical Foundations of AI & ML
Unit 5: Clustering and Autoencoders

Prof. Dr. Philipp Pelz

FAU Erlangen-Nürnberg

FAU Logo IMN Logo CENEM Logo ERC Logo Eclipse Logo

Where we are in the course

Behind us (Units 1-4):

  • Risk minimization with labels: \((x_i, y_i)\) pairs.
  • Linear models, generalized linear models, neural networks.
  • Optimization (Unit 3) and architectures (Unit 4).

Today (Unit 5):

  • The labels disappear: only \(\{x_i\}\) remains.
  • Two complementary perspectives on finding structure without labels.
  • Classical clustering (K-means, GMM) and neural autoencoders.

Note: backpropagation is self-study this term

  • Unit 4 covered the architectures; how networks actually train (the chain rule, vanishing/exploding gradients) is in a self-study supplement.
  • See 02_backprop_self_study.qmd in the Unit 4 folder, plus the two example notebooks 18.3_Backpropagation and 18.5_Python_Implementation.
  • Today’s autoencoder section uses PyTorch autograd: loss.backward() handles the gradients.
  • A short chain-rule warm-up is on the next exercise sheet

The big leap

  • All previous units assumed each datapoint comes with a target \(y_i\).
  • In practice, most data has no labels: alloy compositions in a database, micrographs from a new sample, spectra from a new instrument.
  • We can still ask: what structure is there? What groups together? What axes of variation matter?
  • Today’s tools: clustering for discrete structure, autoencoders for continuous structure.

Learning outcomes

By the end of this unit, students can:

  • Distinguish supervised vs unsupervised learning and recognize where each fits.
  • Run K-means by hand on a small dataset and explain its convergence and failure modes.
  • State the GMM likelihood and articulate why EM is needed when latent variables are present.
  • Describe the autoencoder architecture and explain why a linear AE recovers PCA.
  • Use an autoencoder for two practical tasks: data compression and anomaly detection.
  • Anticipate how the latent space sets up Unit 9 (representation learning).

Roadmap of today’s 90 min

  1. Unsupervised landscape (~10 min) — what counts as “structure”?
  2. K-means (~15 min) — the workhorse.
  3. Hierarchical clustering (~5 min) — when you don’t pick \(K\) in advance.
  4. GMM + EM (~20 min) — probabilistic clustering.
  5. Autoencoders (~25 min) — the neural counterpart.
  6. Variants + applications (~10 min) — denoising, compression, anomaly detection.
  7. Materials examples + bridge to Unit 6 (~5 min).

The unsupervised landscape

  • Clustering: assign each \(x_i\) to a discrete group.
  • Dimensionality reduction: find low-dim coordinates that summarize \(x_i\).
  • Density estimation: model \(p(x)\) directly.
  • Generative modeling: sample new \(x \sim p(x)\). (Unit 11 will return here.)

Today: clustering (slides 6-32) and dimensionality reduction (slides 33-58). Density and generation come back in Units 8, 11, 12.

What counts as “structure”?

Compactness

Points within a group are close.

Separation

Groups are far from each other.

Plus: the structure must be interpretable — relate to something we care about (alloy family, defect type, processing regime). A “good” cluster on a materials dataset is one a metallurgist can explain.

Overview of 4D-STEM with the visual representation of datasets and results. (a) Diagram of the 4D-STEM dataset. (b) Maximum diffraction patterns and (c) true cluster labels for the three simulated datasets, Ag1 (top), Ag2 (middle), and Ag3 (bottom). (d) Example of a successful model for each dataset. (e) Example of a failed model for each dataset. (Bruefach et al. 2023)

Why unsupervised matters in materials

  • Most data starts unlabeled — a database of alloy compositions, a folder of micrographs, a stack of spectra.
  • Labels often require expensive characterization (TEM, mechanical testing, EBSD).
  • Unsupervised methods let us:
    • Explore before committing to a label scheme.
    • Compress (1000-channel spectrum → 10 latents).
    • Flag anomalies for expert attention before testing them all.

K-means: objective and Lloyd’s algorithm

Assign \(N\) points \(\{x_1, \ldots, x_N\} \subset \mathbb{R}^d\) to \(K\) groups \(C_1, \ldots, C_K\) by minimizing:

\[ J(C_1, \ldots, C_K, \mu_1, \ldots, \mu_K) = \sum_{k=1}^{K} \sum_{x_i \in C_k} \|x_i - \mu_k\|^2. \]

The objective: each point should be close to its assigned centroid. Each cluster is represented by a single point \(\mu_k\).

Lloyd’s algorithm

Alternate two steps until assignments stop changing:

  1. Assign: for each \(i\), set \(C_k = \{x_i : k = \arg\min_j \|x_i - \mu_j\|^2\}\).
  2. Update: for each \(k\), set \(\mu_k = \frac{1}{|C_k|} \sum_{x_i \in C_k} x_i\).

This is coordinate descent on \(J\): each step strictly decreases \(J\) unless we are already at a fixed point. Convergence in finitely many steps is guaranteed.

K-means iterations on the Old Faithful data (Bishop 2006, fig. 9.1). Each row shows one E-step (assign) and one M-step (update centroids). Converges in 3 M-steps.

Worked example: 6 points, \(K=2\)

Points: \((1,1), (1,2), (2,1)\), \((8,8), (9,8), (8,9)\).

Initial centroids: \(\mu_1 = (1,1)\), \(\mu_2 = (2,2)\).

Step 0: Initialization.

Step 1: Assign to closest \(\mu\).

Step 2: Update \(\mu\) to cluster mean.

Step 3: Assign (No change).

The bad initialization still found the right clusters here — luck. With harder geometries, initialization matters a lot.

K-means is sensitive to initialization

  • Different initial centroids → different local optima.
  • Standard fix: multiple random initializations, keep lowest \(J\).
  • K-means++: smart initialization.
    • Pick first centroid randomly.
    • Pick next with probability \(\propto D(x)^2\) to nearest chosen centroid.
    • Provably bounds expected \(J\).

Choosing \(K\): the elbow method

  • Run K-means for \(K = 1, 2, 3, \ldots, K_{\max}\).
  • Plot \(J(K)\) vs \(K\).
  • Look for the elbow: the point where \(J\) stops dropping fast.
  • Heuristic, not principled, but widely used.

\(J(K)\) always decreases with \(K\) — at \(K = N\), every point is its own cluster and \(J = 0\).

The elbow signals diminishing returns from extra clusters.

Cost function \(J\) after each E-step (blue) and M-step (red) for the Old Faithful example. Converged after 3 M-steps. (Bishop 2006, fig. 9.2)

Choosing \(K\): the silhouette score

For each point \(x_i\), define:

\[ s(x_i) = \frac{b(x_i) - a(x_i)}{\max(a(x_i), b(x_i))}, \]

where \(a(x_i)\) is the average distance to other points in \(x_i\)’s cluster, and \(b(x_i)\) is the average distance to points in the nearest other cluster.

  • \(s(x_i) \approx 1\): well-clustered. \(s(x_i) \approx 0\): on a boundary. \(s(x_i) < 0\): probably misclustered.
  • Pick the \(K\) that maximizes the mean silhouette.

Visual interpretation of the silhouette score components: \(a(x_i)\) measures intra-cluster cohesion, while \(b(x_i)\) measures inter-cluster separation.

K-means: the spherical assumption

  • K-means uses Euclidean distance to a single centroid.
  • Implicitly assumes clusters are spherical and equal-sized.
  • Fails when:
    • Clusters are elongated (Anisotropic).
    • Geometry is non-convex (Moons, Rings).

These failure cases motivate GMM (slides 24+).

K-medoids: a robust variant

  • K-means uses the mean as a centroid → sensitive to outliers.
  • K-medoids restricts the centroid to be an actual data point (the medoid).
  • Update step: in each cluster, pick the point that minimizes the sum of distances to others.
  • Slower (no closed-form update) but robust.

Useful when: outliers contaminate the data, or distances are non-Euclidean (e.g., edit distance for SMILES strings).

The K-means centroid (orange cross) is pulled away from the main cluster by a single outlier, whereas the K-medoid (blue circle) remains safely anchored to a real data point in the dense region.

Hierarchical clustering: no \(K\) in advance

  • Agglomerative: start with each point as its own cluster; repeatedly merge the closest pair.
  • Divisive: start with one big cluster; recursively split.
  • Linkage criteria for “closest”:
    • Single: nearest pair across clusters (chains).
    • Complete: farthest pair (compact).
    • Average: mean pair distance.
    • Ward: minimize variance increase (popular default).

Visual comparison of linkage criteria between two clusters. The top line shows Single Linkage (closest points), the middle line shows Average Linkage, and the bottom line shows Complete Linkage (furthest points).

Dendrograms

  • The merge sequence is a tree: leaves are points, internal nodes are merges, height = merge distance.
  • Cut the dendrogram at any height to obtain a clustering.
  • Materials use case: dendrograms over compositions reveal natural alloy families and the heights tell you how distinct each family is.
  • Linkage choice matters: single, complete, average, or Ward produce qualitatively different trees.

Effect of linkage on the dendrogram for yeast gene-expression data: (a) single link — chaining artifacts; (b) complete link — compact clusters; (c) average link — a compromise. (Murphy 2012, fig. 25.15)

From hard to soft assignments

  • K-means gives a hard assignment: each point belongs to exactly one cluster.
  • Reality: a point near a boundary could plausibly belong to either neighbor.
  • A soft assignment gives a probability over clusters: \(\gamma_{ik} = P(\text{cluster } k \mid x_i)\).
  • Soft assignments come naturally from a probabilistic model: the Gaussian Mixture Model.

Same 500 points from a 3-Gaussian mixture: (a) complete data with hard labels, (b) unlabeled (incomplete data), (c) soft coloring by responsibilities \(\gamma_{ik}\) — purple = uncertain. (Bishop 2006, fig. 9.5)

Gaussian Mixture Model (GMM)

A weighted sum of \(K\) Gaussian densities:

\[ p(x) = \sum_{k=1}^{K} \pi_k \mathcal{N}(x; \mu_k, \Sigma_k), \qquad \pi_k \geq 0, \quad \sum_k \pi_k = 1. \]

  • \(\pi_k\): mixture weight (prior probability of cluster \(k\)).
  • \(\mu_k\): cluster center; \(\Sigma_k\): cluster shape (covariance).
  • Each \(\mathcal{N}(x; \mu_k, \Sigma_k)\) is a multivariate Gaussian — Unit 7 will derive these formally.

A mixture of 3 Gaussians in 2D: (a) contours of each component; (b) the combined density surface. Each component can have its own shape and orientation. (Murphy 2012, fig. 11.3)

The latent variable view

Introduce \(z_i \in \{1, \ldots, K\}\) — the (unobserved) cluster index for \(x_i\).

\[ p(x_i, z_i = k) = \pi_k \mathcal{N}(x_i; \mu_k, \Sigma_k), \qquad p(x_i) = \sum_k p(x_i, z_i = k). \]

  • We observe \(x_i\) and don’t observe \(z_i\).
  • Maximizing \(\log p(x_i; \theta)\) directly is hard because it contains a sum inside the log.
  • The trick: alternate between guessing \(z_i\) and updating \(\theta\). This is EM.

EM algorithm: E-step

For each point \(i\) and cluster \(k\), compute the responsibility:

\[ \gamma_{ik} = P(z_i = k \mid x_i, \theta) = \frac{\pi_k \mathcal{N}(x_i; \mu_k, \Sigma_k)}{\sum_{j} \pi_j \mathcal{N}(x_i; \mu_j, \Sigma_j)}. \]

This is the soft assignment: how strongly does the model believe point \(i\) belongs to cluster \(k\), given the current parameters?

EM algorithm: M-step

Update parameters using the responsibilities:

\[ \mu_k = \frac{\sum_i \gamma_{ik} x_i}{\sum_i \gamma_{ik}}, \qquad \Sigma_k = \frac{\sum_i \gamma_{ik} (x_i - \mu_k)(x_i - \mu_k)^T}{\sum_i \gamma_{ik}}, \qquad \pi_k = \frac{1}{N}\sum_i \gamma_{ik}. \]

Each update is a weighted average — points contribute to a cluster in proportion to their responsibility.

EM for GMM on Old Faithful: (a) initial random parameters; (b) responsibilities after 1st E-step (purple = uncertain); (c)–(f) convergence over 16 iterations. (Murphy 2012, fig. 11.11)

EM as alternating optimization

  1. Initialize \(\{\pi_k, \mu_k, \Sigma_k\}\) (e.g., from K-means output).
  2. E-step: compute all \(\gamma_{ik}\) given current parameters.
  3. M-step: update parameters given current \(\gamma_{ik}\).
  4. Repeat until log-likelihood converges.

Property: each EM iteration is guaranteed not to decrease the data log-likelihood \(\sum_i \log p(x_i; \theta)\). (Proof: EM optimizes a lower bound on the log-likelihood — Bishop Ch. 9 has the full derivation.)

EM for a 2-component GMM on the Old Faithful data. (a) initialization; (b)–(f) E/M steps shown with soft color-coded responsibilities. Ellipses show the 1-\(\sigma\) contours of each Gaussian. (Bishop 2006, fig. 9.8)

K-means vs GMM

K-means GMM
Assignment hard soft
Cluster shape spherical ellipsoidal (full \(\Sigma\))
Cluster size implicit learned via \(\Sigma\)
Output partition density
Cost \(O(NKd)\) per iter \(O(NKd^2)\) per iter

Rule of thumb: K-means for fast, geometric, well-separated clusters. GMM when clusters overlap, vary in shape, or you need probabilities downstream.

Choosing \(K\) for GMM: BIC

The Bayesian Information Criterion penalizes complexity:

\[ \text{BIC}(K) = -2 \log p(\mathcal{D}; \hat\theta_K) + p_K \log N, \]

where \(p_K\) is the number of free parameters in a \(K\)-component GMM.

  • Fit GMMs for \(K = 1, 2, \ldots\), pick the \(K\) minimizing BIC.
  • BIC works because GMM has a likelihood. K-means does not — that’s why we used elbow/silhouette there.

Bridge: from centroids to learned representations

  • K-means represents each cluster by a single point \(\mu_k\).
  • GMM represents each cluster by a distribution \(\mathcal{N}(\mu_k, \Sigma_k)\).
  • Both compress the data into a small number of “summary” objects.
  • What if we want continuous structure — a smooth low-dimensional surface that describes the data?

That is what an autoencoder learns. The encoder maps \(x\) to a low-dim latent code \(z\); the decoder reconstructs \(x\) from \(z\).

The manifold hypothesis

  • High-dimensional data is rarely “filled in” — a 1024-channel spectrum lives in \(\mathbb{R}^{1024}\), but real spectra concentrate on a much lower-dimensional surface (manifold).
  • The manifold’s intrinsic dimension is governed by physics (number of phases, processing parameters).
  • A linear method (PCA) finds a flat low-dim subspace.
  • A nonlinear method (autoencoder) can curve to follow the manifold.

The autoencoder architecture

        encoder              decoder
   x  ──────►  z (bottleneck)  ──────►  x̂
 R^d           R^k (k ≪ d)              R^d
  • Encoder \(f_\phi: \mathbb{R}^d \to \mathbb{R}^k\): compresses input to a code \(z\).
  • Bottleneck \(z \in \mathbb{R}^k\): forced low-dimensional representation.
  • Decoder \(g_\theta: \mathbb{R}^k \to \mathbb{R}^d\): reconstructs input from code.
  • Loss: reconstruction error, typically MSE.

A deep autoencoder: (a) greedily pre-train RBM layers; (b) unroll into encoder/decoder by tying weights; (c) fine-tune end-to-end with backprop. The bottleneck (middle layer) is the latent code \(z\). (Murphy 2012, fig. 28.3)

The reconstruction objective

\[ \mathcal{L}(\theta, \phi) = \frac{1}{N}\sum_{i=1}^{N} \|x_i - g_\theta(f_\phi(x_i))\|^2. \]

  • No labels — only the inputs themselves serve as targets.
  • Minimizing \(\mathcal{L}\) forces the bottleneck \(z\) to retain enough information to reconstruct \(x\).
  • Without the bottleneck, the network could just learn the identity. The bottleneck creates a useful constraint.

Linear autoencoder = PCA

Take the simplest possible AE:

\[ f_\phi(x) = W_e x, \qquad g_\theta(z) = W_d z, \qquad W_e \in \mathbb{R}^{k \times d}, \quad W_d \in \mathbb{R}^{d \times k}. \]

Theorem. With MSE loss, the optimal \((W_e, W_d)\) satisfy \(W_d W_e = U_k U_k^T\), where \(U_k\) contains the top-\(k\) left singular vectors of the centered data — i.e., the PCA subspace.

A linear autoencoder is just PCA in disguise.

Why nonlinearity matters

  • Linear AEs find the best flat subspace.
  • Real data manifolds are usually curved: alloy compositions on a phase diagram, micrographs under varying lighting.
  • Adding a nonlinearity (ReLU, tanh) and a hidden layer to encoder and decoder lets the AE bend the latent space.
  • Result: a nonlinear AE can capture variance that PCA leaves on the table.

This is the same lesson as Unit 4: nonlinearity is what lets neural networks go beyond linear models.

Choosing the bottleneck dimension \(k\)

  • Too small: the AE underfits — reconstruction is bad even on training data.
  • Too large: the AE overfits — it just learns the identity through a wide pipe.
  • Procedure: sweep \(k\), plot validation reconstruction error vs \(k\), look for the elbow.
  • Sanity check: compare to PCA at the same \(k\). Nonlinear AE should do at least as well.

Convolutional autoencoders

For spatial data (images, fields), use convolution:

  • Encoder: stack of strided conv → pool blocks (downsample).
  • Decoder: stack of transposed conv (or upsample + conv) blocks.
  • The bottleneck is now a small feature map.
  • Same architectural reasoning as Unit 4: locality + weight sharing.
  • Materials uses: micrograph compression, simulation field compression.

Training: autograd handles it

  • An AE is a standard neural network with a peculiar loss.

  • PyTorch:

    loss = ((x - decoder(encoder(x)))**2).mean()
    loss.backward()
    optimizer.step()
  • All the backprop machinery (chain rule, gradient flow, Xavier/He init) from the self-study supplement applies here directly.

Denoising autoencoder

Train the AE with corrupted inputs:

\[ \mathcal{L} = \frac{1}{N}\sum_i \|x_i - g_\theta(f_\phi(\tilde x_i))\|^2, \qquad \tilde x_i = x_i + \epsilon_i. \]

  • The AE must denoise — recover the clean \(x_i\) from a noisy \(\tilde x_i\).
  • Forces the latent code to capture robust features, not noise.
  • Practical hyperparameter: noise level. Too low → trivial; too high → unrecoverable.

Sparse autoencoders (briefly)

  • Add a penalty that encourages most latent activations to be near zero.
  • Forces the AE to use few latent units per input — interpretable, disentangled features.
  • Loss: \(\mathcal{L}_{\text{recon}} + \lambda \|z\|_1\) (Lasso-like) or KL penalty against a Bernoulli prior.

Application 1 — anomaly detection

  • Train the AE only on normal data.
  • At test time, compute reconstruction error per sample.
  • Anomalies (defects, instrument failures, novel phases) reconstruct poorly — they’re outside the manifold the AE learned.
  • Threshold: choose at, say, the 99th percentile of training reconstruction error.

Short Story: Crystal Defect Detection Prifti et al. (2023) used a Convolutional Variational Autoencoder (CVAE) on Scanning Transmission Electron Microscopy (STEM) images. Trained purely on perfect crystal lattices, the CVAE flags point defects (e.g., vacancies or anti-sites) simply because it fails to reconstruct them. It identifies anomalies without ever seeing a defect during training!

Application 2 — features for downstream tasks

  • Train an AE on a large unlabeled corpus.
  • Discard the decoder; use the encoder \(z = f_\phi(x)\) as a feature extractor.
  • Train a small supervised model (linear regression, MLP) on \(z\) instead of \(x\).
  • Works when labels are scarce and the AE has seen enough unlabeled data to learn the manifold.

This is transfer learning with self-supervision — a precursor to today’s foundation models.

The latent space is a coordinate system

  • The bottleneck \(z\) is not just a compression target — it is a learned coordinate system for the data.
  • Two questions about a latent space:
    • Geometry: how are points arranged? Are similar samples close?
    • Interpolation: does the line between \(z_A\) and \(z_B\) correspond to a smooth transition in \(x\)?

2D latent space of a deep autoencoder trained on Reuters news articles. Left: LSA (linear, overlapping topics). Right: deep AE (nonlinear, well-separated topic clusters). Labels were not used during training. (Murphy 2012, fig. 28.5)

Latent space arithmetic (teaser)

  • In some learned latents, vector arithmetic is meaningful.
  • Famous example (faces): \(z_{\text{man with glasses}} - z_{\text{man}} + z_{\text{woman}} \approx z_{\text{woman with glasses}}\).
  • For materials: \(z_{\text{brittle alloy}} - z_{\text{steel}} + z_{\text{aluminum}}\) might give a brittle aluminum alloy — a starting point for inverse design.
  • Standard AEs don’t guarantee this structure; VAEs (Unit 11) do, by constraining the latent distribution.

Bridge to Unit 9 and Unit 11

  • Unit 9: what makes a latent space good? Visualization (t-SNE, UMAP), contrastive learning, foundation embeddings.
  • Unit 11: generative models that sample from the latent: VAEs and diffusion.
  • Today plants the seed: an AE bottleneck is a learned representation, and learned representations are the substrate for everything that follows.

Materials example 1 — alloy composition clustering

  • 5000 alloys, each described by 12 elemental fractions.
  • Run K-means with \(K = 8\), k-means++ init, 10 restarts.
  • Resulting clusters track known alloy families (austenitic stainless, martensitic stainless, low-alloy steels, …).
  • Outliers in each cluster: candidate novel compositions worth lab investigation.

Materials example 2 — spectral compression

  • 1D conv autoencoder on 8000 XRD patterns, each with 2000 angular channels.
  • Bottleneck 32 → reconstruction error \(< 2\%\) on held-out patterns.
  • 60× compression of the dataset.
  • Encoder output usable as input to a downstream phase-classifier with 50× fewer parameters.

Materials example 3 — defect anomaly detection

  • Train conv AE on micrographs of defect-free material (no labels needed).
  • Test on 200 micrographs, including some with cracks, voids, or unusual texture.
  • Pixel-wise reconstruction error highlights defect locations as bright spots.
  • ROC-AUC > 0.9 for flagging defective images, without ever showing a defect at training time.

Three exam-must-knows

  1. K-means minimizes the within-cluster sum of squares; convergence is to a local optimum and depends on initialization (use K-means++ + restarts). Spherical clusters only.
  2. EM for GMM alternates an E-step (compute responsibilities \(\gamma_{ik}\)) and an M-step (weighted update of \(\pi_k, \mu_k, \Sigma_k\)); each step is guaranteed not to decrease the data log-likelihood.
  3. Linear autoencoder = PCA; nonlinearity + bottleneck generalize PCA to curved manifolds; conv AEs do this for spatial data.

Reading and bridge to Unit 6

Note

Reading for Unit 6. Skim Neuer Ch. 5 (unsupervised) and McClarren Ch. 4 + Ch. 8 to consolidate today’s content. For Unit 6 (Loss Landscapes & Optimization), read Sandfeld’s chapters on gradient descent and ADAM.

Unit 6: now that we have a richer set of objective functions (clustering objectives, reconstruction loss), what does their landscape look like? When does ADAM beat plain gradient descent? Why does flat vs sharp matter for generalization?

Continue

Notebook companion + references

Week 5 notebooks (in example_notebooks/ once added)

  • K-means by hand (NumPy) on alloy compositions.
  • K-means vs GMM on synthetic Gaussian mixture (sklearn).
  • AE on Fashion-MNIST: train, plot 2-D latent, reconstruct.
  • AE for anomaly detection: corrupt 5% of test images, threshold reconstruction error, report ROC-AUC.

Self-study supplement (Unit 4): the chain-rule and gradient-flow material is in 04_neural_networks_backprop/02_backprop_self_study.qmd plus the 18.3 and 18.5 notebooks. A short warm-up question on the next exercise sheet uses it.

Learning outcomes — recap

By the end of this unit, students can:

  • Distinguish supervised vs unsupervised learning and recognize where each fits.
  • Run K-means by hand and explain its convergence and failure modes.
  • State the GMM likelihood and intuit the EM algorithm as alternating optimization.
  • Describe the autoencoder architecture and explain why a linear AE recovers PCA.
  • Use an autoencoder for compression and for anomaly detection.
  • Anticipate how the latent space connects to Unit 9 (representation learning) and Unit 11 (generative models).
Bishop, Christopher M. 2006. Pattern Recognition and Machine Learning. Springer.
Bruefach, Alexandra, Colin Ophus, and M. C. Scott. 2023. “Robust Design of Semi-Automated Clustering Models for 4D-STEM Datasets.” APL Machine Learning 1 (1): 016106. https://doi.org/10.1063/5.0130546.
Gómez-Bombarelli, Rafael, Jennifer N Wei, David Duvenaud, et al. 2018. “Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules.” ACS Central Science 4 (2): 268–76.
Murphy, Kevin P. 2012. Machine Learning: A Probabilistic Perspective. MIT Press.