Mathematical Foundations of AI & ML
Foundations course for the ECLIPSE teaching track in machine learning, computational imaging, and materials data science.
1 Instructors
- Philipp Pelz
- Stefan Hiemer
2 Recommended readings
We base much of the lecture on the following books:
- Neuer (2024), Machine Learning for Engineers: Introduction to Physics-Informed, Explainable Learning Methods for AI in Engineering Applications. Springer Nature.
- McClarren (2021), Machine Learning for Engineers: Using Data to Solve Problems for Physical Systems. Springer.
Tangentially, we also recommend the following books:
- Murphy (2012), Machine Learning: A Probabilistic Perspective. MIT Press.
- Bishop (2006), Pattern Recognition and Machine Learning. Springer Science+ Business Media, LLC Berlin, Germany.
| MFML Week | MFML Lecture Focus (Revised) | Neuer ? Required Reading | Neuer ? Optional / Skim | McClarren ? Contextual / Optional | Bishop ? Targeted Depth (Optional) |
|---|---|---|---|---|---|
| 1 | Learning vs data analysis; models, loss functions | Ch. 1.1 Data-Based Modeling; 1.1.1 Concept of Model | 1.1.3 Criticism of Data-Based Modeling | Ch. 1 Introduction (ML in physical systems) | Ch. 1 �1.1?1.2 (what is a model, pattern recognition view) |
| 2 | Linear algebra refresher; covariance, PCA/SVD (R) | Ch. 5.2 PCA (skim, notation & geometry only) | PCA implementation details | Ch. 5 Dimension Reduction (ROM intuition) | Ch. 12 �12.1?12.2 PCA derivation (selective) |
| 3 | Regression as loss minimization | Ch. 4.2.2 Regression; Ch. 4.4.1 LMS theory | LMS algorithm variants | Ch. 4 Regression (physical meaning of regression) | Ch. 3 �3.1?3.3 Linear regression, least squares geometry |
| 4 | Neural networks early: neuron, activations, backprop | Ch. 4.5.1 Neuron; 4.5.3 Activations; 4.5.4 Training | Framework-specific NN sections | Ch. 8 Neural Networks (surrogate perspective) | Ch. 5 §5.1–5.3 NN basics & Backpropagation |
| 5 | Clustering & Autoencoders | Ch. 5.3 K-Means; Ch. 5.5 Autoencoder | Advanced clustering | Ch. 5 Dimension Reduction | Ch. 9 Mixture Models; Ch. 12.3 Nonlinear PCA |
| 6 | Loss landscapes & optimization behavior | Ch. 4.4.6 Hyperparameters; Ch. 4.5.5 Optimization | Detailed optimizer variants | Ch. 7 Optimization | Ch. 3 �3.4 Regularization; �3.5 Bayesian view (skim) |
| 7 | Generalization, bias-variance, regularization, tree ensembles (RF & gradient boosting) | Ch. 4.5.9 Overfitting & Cross-Validation | — | Ch. 3 Decision Trees & Random Forests; Ch. 6 Model Selection & Validation | Ch. 3 §3.2 Bias-variance decomposition; Ch. 14 §14.3–14.4 (combining models, bagging, boosting) |
| 8 | Probabilistic view of learning; noise | Ch. 2.2 Distinguishing Uncertainties; Ch. 6.4 Uncertainty | Bayesian details | Ch. 3 Error and Uncertainty | Ch. 2 �2.1?2.3 Gaussian distributions & moments |
| 9 | Latent Spaces & Advanced Representation Learning (t-SNE, UMAP, contrastive, foundation embeddings) | Ch. 5.5 Autoencoder (recap) | — | Ch. 5 Dimension Reduction | Ch. 9 §9.1–9.4 mixture models / latent variables; Ch. 12 §12.3 nonlinear latent models |
| 10 | Attention & Transformers (self-attention, multi-head, ViT, foundation models) | — | — | Ch. 8 Neural Networks (context only) | — (covered via (vaswani2017attention?), (dosovitskiy2021vit?)) |
| 11 | Generative Models: VAE & Diffusion (ELBO, reparameterization, forward/reverse process, classifier-free guidance) | Ch. 5.5 AE (foundation for VAE) | — | Ch. 8 Neural Networks (autoencoder context) | Ch. 9 §9.4 EM as ELBO precursor; Ch. 13 §13.3 deep generative perspectives |
| 12 | Uncertainty in predictions | Ch. 6.4 Stochastic Methods for Uncertainty | Advanced stochastic methods | Ch. 3 Error and Uncertainty | Ch. 3 �3.5 Bayesian regularization (skim) |
| 13 | Physics-informed & constrained learning | Ch. 6.1?6.3 Physics-Informed Learning | Semantic technologies | Ch. 11 Physics-Informed & Hybrid Models | Ch. 1 �1.6 Model complexity & Occam?s razor |
| 14 | Explainability, limits, scientific trust | Ch. 7 Explainability (discussion & outlook) | ? | Ch. 12 Limitations and Outlook | Ch. 1 �1.1?1.2 Reflection on model limits |
References
3 Week 1 Summary: Learning vs Data Analysis; Models, Loss Functions
Lecture: Monday, 13.04.2026, 10:15-11:45
Slides: Open
- Models: simplified representations for prediction / explanation (white / grey / black-box)
- Learning types: supervised (regression, classification), unsupervised, reinforcement
- Empirical risk minimization: learning as optimization, not statistics
- Loss zoo:
- Regression: MSE, MAE
- Classification: 0-1, softmax + cross-entropy
- Train / val / test splits, cross-validation, data-leakage taxonomy
- Bias–variance intuition; Occam’s razor and regularization
- Uncertainty preview: aleatoric vs epistemic
- Limits: no-free-lunch, curse of dimensionality
- Frequentist vs Bayesian lenses (set-up for Unit 8)
4 Week 2 Summary: Linear Algebra Refresher; Covariance, PCA/SVD
Lecture: Monday, 20.04.2026, 10:15-11:45
Slides: Open
- LA refresher: vector spaces, basis, rank; column / row / null spaces and identifiability
- Projection geometry; least squares as projection onto column space
- Condition number and numerical stability
- Spectral decomposition of symmetric / PSD matrices; covariance-matrix geometry
- PCA: linear dimensionality reduction by variance maximization
- Scree plots for intrinsic dimensionality (“elbow”)
- SVD: factorization for any matrix; low-rank approximation (Eckart–Young)
- NMF: parts-based decomposition for non-negative spectra / images
- Pseudo-inverse and least-squares solvability
- L1 vs L2 regularization (geometric intuition); whitening and multicollinearity
- Kernel hint from inner products (sets up later units)
5 Week 3 Summary: Regression as Loss Minimization
Lecture: Monday, 27.04.2026, 10:15-11:45
Slides: Open
- Supervised Framework: Minimizing a cost function (MSE) to find optimal parameters.
- Optimization: Analytical (Ordinary Least Squares) vs. Iterative (Gradient Descent).
- Basis Functions: Expanding linear models to fit non-linear data using transformations (polynomials, splines).
- Runge’s Phenomenon: Overfitting risk with high-order global polynomials.
6 Week 4 Summary: Neural Networks — From Neurons to CNNs
Lecture: Monday, 04.05.2026, 10:15-11:45
Slides: Open · Backprop self-study
- Fixed bases (Fourier / wavelet / polynomial) → motivation for learned representations
- The modern neuron and dense layer; why non-linear activations are non-negotiable
- Universal approximation vs the parameter explosion of dense layers on images
- Invariance vs equivariance: what we want from image models
- Convolution from weight sharing; cross-correlation, feature maps, receptive fields
- Padding, stride, pooling, \(1\times1\) channel mixing
- Architectures: LeNet → VGG → NiN → DenseNet → U-Net for dense prediction
- Why CNNs fit microscopy / materials data — and where they fall short
Note: Backpropagation is covered in the self-study supplement 02_backprop_self_study.qmd appended to this unit.
7 Week 5 Summary: Clustering & Autoencoders
Lecture: Monday, 11.05.2026, 10:15-11:45
Slides: Open
- K-Means / K-Medoids: hard clustering by minimizing within-cluster distance
- Sensitive to initialization (use k-means++); assumes spherical clusters
- GMM + EM: probabilistic clustering with soft assignments
- E-step: responsibilities; M-step: parameter update
- Each EM step never decreases the log-likelihood
- Autoencoders: encoder–bottleneck–decoder, trained on reconstruction loss
- Linear AE recovers PCA
- Non-linear AE captures curved manifolds
- Applications: compression, anomaly detection (reconstruction error), feature extraction
Note: Backpropagation has moved to a self-study supplement appended to Unit 4 (02_backprop_self_study.qmd); this freed the Unit 5 slot for unsupervised learning.
8 Week 6 Summary: Loss Landscapes & Optimization Behavior
Lecture: Monday, 18.05.2026, 10:15-11:45
Slides: Open
- Loss Landscape: High-dimensional topography determining optimization success.
- Curvature: Hessian matrix dictates steepness; ill-conditioned landscapes slow descent.
- Saddle Points: Common traps in high dimensions that hinder optimizers.
- Advanced Optimizers: Momentum and adaptive learning rates (ADAM) navigate complex landscapes robustly.
9 Week 7 Summary: Generalization, Bias-Variance, Regularization, Tree Ensembles
Lecture: Monday, 25.05.2026, 10:15-11:45
Slides: Open
- Generalization: performance on unseen data
- Bias–variance tradeoff: \(\mathrm{MSE} = \mathrm{Bias}^2 + \mathrm{Variance} + \mathrm{noise}\)
- Regularization: L2 (Ridge), L1 (Lasso), Dropout
- Validation: cross-validation as unbiased generalization estimate
- Never tune on the test set
- Random Forests: bagging + random feature subsets per split
- → variance reduction via decorrelated trees
- Gradient Boosting: sequential weak learners on residuals
- → bias reduction
- XGBoost / LightGBM / CatBoost as the practical workhorses
- Trees vs NNs: gradient boosting usually wins on tabular data (\(N < 10^5\))
10 Week 8 Summary: Probabilistic View of Learning; Noise
Lecture: Monday, 01.06.2026, 10:15-11:45
Slides: Open
- Aleatoric (irreducible noise) vs epistemic (lack of data) uncertainty
- Gaussian as the maximum-entropy distribution; multivariate Gaussian and covariance geometry; CLT
- Entropy and KL divergence, including KL between Gaussians (used later for VAEs)
- MLE: log-likelihood maximization; for a Gaussian, MLE recovers MSE
- Bayes’ theorem workflow: prior · likelihood ∝ posterior; predictive distribution
- MAP estimation; MAP = regularized MLE
- Frequentist vs Bayesian comparison; credible vs confidence intervals
- Robustness: Student’s t-distribution mitigates outliers better than Gaussian
- Stochastic enrichment and mixture-density networks (preview of Unit 12)
- Practical diagnostic: calibration plots
11 Week 9 Summary: Latent Spaces & Advanced Representation Learning
Lecture: Monday, 08.06.2026, 10:15-11:45
Slides: Open
- What makes a latent space “good”
- Compactness within concept, separation across concepts
- Smooth interpolation, transferability
- None guaranteed by reconstruction alone
- t-SNE: KL between high-dim Gaussian and low-dim Student-t similarities
- Heavy tail solves the crowding problem
- Cluster sizes and between-cluster distances are not quantitatively meaningful
- UMAP: preserves more global structure than t-SNE; scales to millions of points
- Contrastive learning (SimCLR, InfoNCE): label-free latent shaping
- Pull augmentations of the same sample together, push others apart
- Foundation embeddings (DINOv2, CLIP): pretrained encoders as feature extractors
- Linear probe is often the strongest baseline for label-scarce tasks
12 Week 10 Summary: Attention & Transformers
Lecture: Monday, 15.06.2026, 10:15-11:45
Slides: Open
- Why attention: limits of RNNs / LSTMs for long sequences
- Self-attention: \(\mathrm{softmax}(QK^T/\sqrt{d_k})V\)
- Similarity-weighted average of value vectors with content-based weights
- Positions choose whom to listen to (no fixed locality prior)
- Multi-head attention: parallel heads on learned subspaces, then concat + project
- Different heads specialize on different relationships
- Positional encoding: needed because attention is permutation-equivariant
- Sinusoidal, learned, or RoPE
- Transformer block: multi-head attention + MLP with residual + LayerNorm; stack many
- ViT: image as a sequence of patch tokens
- Beats CNNs at scale; loses with little data (no locality prior)
- Foundation models: GPT, BERT, ViT, DINO, CLIP
- Pretrain at scale, freeze, reuse via embeddings + small heads
13 Week 11 Summary: Generative Models — VAE & Diffusion
Lecture: Monday, 22.06.2026, 10:15-11:45
Slides: Open
- Why generative: vanilla autoencoders cannot sample new data
- VAE: stochastic encoder produces \(\mathcal{N}(\mu, \sigma^2)\)
- Trained on the ELBO = reconstruction − KL to a Gaussian prior
- Reparameterization trick \(z = \mu + \sigma \odot \epsilon\) for differentiable sampling
- Diffusion: predict the noise \(\epsilon_\theta(x_t, t)\) added at a random timestep
- Loss is plain MSE on the noise
- Sample from \(\mathcal{N}(0, I)\) and iterate the learned reverse process
- Classifier-free guidance: train conditional + unconditional jointly; mix at sampling
- Trade-offs:
- VAE: fast sampling, lower-bound likelihood, blurry samples
- Diffusion: slow sampling, state-of-the-art quality
- GANs: fast, no likelihood
- Normalizing flows: exact likelihood, restricted architectures
- Materials applications:
- Inverse design (compositions matching a target property)
- Microstructure generation
- Physics-constrained spectral synthesis
14 Week 12 Summary: Uncertainty in Predictions
Lecture: Monday, 29.06.2026, 10:15-11:45
Slides: Open
- Why point predictions aren’t enough; aleatoric vs epistemic recap
- Bayesian predictive distribution and variance decomposition
- Evidence framework: marginal likelihood as automatic Occam’s razor; effective number of parameters; empirical Bayes
- Gaussian Processes as the unit’s main tool
- Mean and kernel function (RBF and others)
- Closed-form prior and posterior; hyperparameter learning
- Strengths and limits (\(O(n^3)\) cost, kernel choice)
- MC Dropout and deep ensembles as cheaper UQ
- Mixture-density networks for multi-modal predictive distributions
- Calibration plots and recalibration
- Choosing a UQ method: comparison table
- Active learning via GP uncertainty (materials acceleration platforms)
15 Week 13 Summary: Physics-informed & Constrained Learning
Lecture: Monday, 06.07.2026, 10:15-11:45
Slides: Open
- PINNs: Integrating physical laws directly into the loss function to reduce data needs.
- Data Enrichment: Applying known mathematical transformations (FFT, derivatives).
- Neural Integrators: Using NNs with automatic differentiation as flexible differential equation solvers.
- Scientific Trust: Physics constraints act as powerful regularizers promoting Occam’s Razor.
16 Week 14 Summary: Explainability, Limits, Scientific Trust
Lecture: Monday, 13.07.2026, 10:15-11:45
Slides: Open
- Why XAI: from black-box to transparent, justifiable decisions; interpretability vs explainability
- Six levels of explainability (E1–E6): data → process → feature → model → prediction → decision
- Semantic structures: controlled vocabularies, taxonomies, ontologies for materials data
- Sensitivity analysis: perturbation-based, global vs local; feature importance from sensitivity
- Attribution methods:
- SHAP (waterfall, beeswarm)
- LIME (local linear approximation)
- Integrated Gradients for deep networks
- Causality vs correlation: causal process chains; detection vs prediction
- Counterfactuals: “what-if” explanations
- Limits: data bias, extrapolation, OOD detection, fairness — when models should NOT be trusted
- Course retrospective: how the 14-unit arc fits together
17 Mathematical Foundations of AI & ML ? Unified Syllabus Overview (with ML-PC & MG)
Legend
- ? First serious use ? concept must be introduced in MFML before being used in ML-PC or MG
- ? Reinforcement / application ? concept is applied or deepened, but not introduced
- (R) Refresher ? topic was covered in a prior course and is only briefly revisited
- MFML Mathematical Foundations of AI & ML
- ML-PC Machine Learning in Materials Processing & Characterization
- MG Materials Genomics
| Week | MFML ? Mathematical Foundations (revised) | ML-PC ? ML in Materials Processing & Characterization (revised) | MG ? Materials Genomics (revised) | Exercise (90 min, Python-based) | Dependency Logic |
|---|---|---|---|---|---|
| 1 | Learning vs data analysis; models, loss functions, prediction vs explanation | Role of ML in processing & characterization; ML vs physics models | Role of ML in materials discovery; databases & targets | NumPy refresher; vectors, dot products, simple loss (MSE) | MFML defines ?learning? as optimization, not statistics |
| 2 | Linear algebra refresher for learning: covariance, PCA/SVD (R) | PCA as a tool for spectra & images (?) | PCA & low-D structure in materials spaces (?) | PCA refresher on known dataset; visualize variance directions | PCA assumed known; MFML aligns notation & geometry |
| 3 | Regression as loss minimization; linear models revisited | Regression as surrogate modeling for processes & properties (?) | Regression & correlation in materials datasets (?) | Linear regression from scratch via loss minimization | Regression reframed explicitly as learning problem |
| 4 | Neural networks early: neuron, activations, backprop | NN regression for materials properties (?) | NN models for structure-property relations (?) | Single-neuron + backprop (manual forward/backward pass) | MFML must precede any NN usage |
| 5 | Clustering & Autoencoders | Clustering & process drift detection (?) | Clustering vs discovery in materials space (?) | K-Means & simple Autoencoder implementation | MFML supplies unsupervised models |
| 6 | Loss landscapes, conditioning, optimization behavior | Hyperparameters, robustness, convergence issues (?) | Model robustness & sensitivity (?) | Gradient descent experiments: learning rate & conditioning | Optimization treated as learning dynamics |
| 7 | Generalization, bias-variance, regularization; tree ensembles (RF, gradient boosting) | Overfitting control in models (?); RF / XGBoost as workhorses for tabular characterization data (?) | Limits of high-D regression (?); tree ensembles for property prediction over tabular materials descriptors (?) | Overfitting demo: polynomial vs NN models; tree-ensemble baseline (RF & XGBoost) on alloy regression | Critical conceptual gate for both applied courses; introduces the tabular workhorse |
| 8 | Probabilistic view of learning: noise & likelihood | Noise-aware modeling & error propagation (?) | Noise & uncertainty in materials datasets (?) | Noise injection; likelihood vs MSE comparison | MFML reframes probability for ML |
| 9 | Latent spaces & advanced representation learning: t-SNE, UMAP, contrastive learning, foundation embeddings | Visualization & quality assessment of learned features (?) | Foundation embeddings for materials descriptors (?) | t-SNE/UMAP comparison; SimCLR + linear probe on a foundation embedding | Critical for both applied courses; sets up modern self-supervised paradigm |
| 10 | Attention & Transformers: self-attention, multi-head, positional encoding, ViT, foundation models | Transformers for sequences and images in characterization (?) | Transformers for compositions & sequences (SMILES, crystal tokens) (?) | Single-head attention from scratch; tiny ViT vs CNN on a small dataset | Architecture behind all modern foundation models |
| 11 | Generative Models: VAE & Diffusion (ELBO, reparameterization, forward/reverse process, classifier-free guidance) | Generative models for data augmentation & process simulation (?) | Inverse design via conditional generation (?) | VAE on Fashion-MNIST + toy diffusion model (200-line DDPM) | Enables inverse design and modern generative applications |
| 12 | Uncertainty in predictions (aleatoric vs epistemic); Gaussian Processes (conceptual) | Trust & confidence in ML-assisted decisions; surrogate models (?) | Discovery & screening with uncertainty; exploration vs exploitation (?) | Predictive uncertainty: GP regression vs NN ensembles | Enables responsible ML & accelerator concepts |
| 13 | Physics-informed & constrained learning | Physics-informed ML for processes & characterization (?) | Physical constraints in materials ML (?) | Constrained NN / penalty-based PINN demo | MFML leads constraints & PINN concepts |
| 14 | Explainability, limits, scientific trust | Integrated case studies & failure modes | Limits & ethics of data-driven discovery | Mini end-to-end synthesis project | All courses converge conceptually |