Mathematical Foundations of AI & ML

Authors

Philipp Pelz

Stefan Hiemer

Published

May 13, 2026

Other Formats
Keywords

Machine Learning, Artificial Intelligence, Mathematics, Linear Algebra, Probability, Optimization

ECLIPSE Lab Teaching

Mathematical Foundations of AI & ML

Foundations course for the ECLIPSE teaching track in machine learning, computational imaging, and materials data science.

Semester
Summer Semester 2026
Format
2h lecture + exercises
Credits
5 ECTS
Audience
Students in Materials Science and related quantitative programmes
Prerequisites
Linear algebra, calculus, and basic Python recommended
How to use this course site. Use this page as the central hub for syllabus, lecture structure, reading, notebooks, and course materials. Formal announcements and enrollment remain on StudOn; code and openly shared resources live in the linked GitHub repository.

1 Instructors

  • Philipp Pelz
  • Stefan Hiemer

References

Bishop, Christopher M. 2006. Pattern Recognition and Machine Learning by Christopher m. Bishop. Vol. 400. Springer Science+ Business Media, LLC Berlin, Germany:
McClarren, Ryan G.. 2021. Machine Learning for Engineers: Using Data to Solve Problems for Physical Systems. Springer.
Murphy, Kevin P. 2012. Machine Learning: A Probabilistic Perspective. MIT press.
Neuer, Marcus J. 2024. Machine Learning for Engineers: Introduction to Physics-Informed, Explainable Learning Methods for AI in Engineering Applications. Springer Nature.

3 Week 1 Summary: Learning vs Data Analysis; Models, Loss Functions

Lecture: Monday, 13.04.2026, 10:15-11:45

Slides: Open

  • Models: simplified representations for prediction / explanation (white / grey / black-box)
  • Learning types: supervised (regression, classification), unsupervised, reinforcement
  • Empirical risk minimization: learning as optimization, not statistics
  • Loss zoo:
    • Regression: MSE, MAE
    • Classification: 0-1, softmax + cross-entropy
  • Train / val / test splits, cross-validation, data-leakage taxonomy
  • Bias–variance intuition; Occam’s razor and regularization
  • Uncertainty preview: aleatoric vs epistemic
  • Limits: no-free-lunch, curse of dimensionality
  • Frequentist vs Bayesian lenses (set-up for Unit 8)

4 Week 2 Summary: Linear Algebra Refresher; Covariance, PCA/SVD

Lecture: Monday, 20.04.2026, 10:15-11:45

Slides: Open

  • LA refresher: vector spaces, basis, rank; column / row / null spaces and identifiability
  • Projection geometry; least squares as projection onto column space
  • Condition number and numerical stability
  • Spectral decomposition of symmetric / PSD matrices; covariance-matrix geometry
  • PCA: linear dimensionality reduction by variance maximization
  • Scree plots for intrinsic dimensionality (“elbow”)
  • SVD: factorization for any matrix; low-rank approximation (Eckart–Young)
  • NMF: parts-based decomposition for non-negative spectra / images
  • Pseudo-inverse and least-squares solvability
  • L1 vs L2 regularization (geometric intuition); whitening and multicollinearity
  • Kernel hint from inner products (sets up later units)

5 Week 3 Summary: Regression as Loss Minimization

Lecture: Monday, 27.04.2026, 10:15-11:45

Slides: Open

  • Supervised Framework: Minimizing a cost function (MSE) to find optimal parameters.
  • Optimization: Analytical (Ordinary Least Squares) vs. Iterative (Gradient Descent).
  • Basis Functions: Expanding linear models to fit non-linear data using transformations (polynomials, splines).
  • Runge’s Phenomenon: Overfitting risk with high-order global polynomials.

6 Week 4 Summary: Neural Networks — From Neurons to CNNs

Lecture: Monday, 04.05.2026, 10:15-11:45

Slides: Open · Backprop self-study

  • Fixed bases (Fourier / wavelet / polynomial) → motivation for learned representations
  • The modern neuron and dense layer; why non-linear activations are non-negotiable
  • Universal approximation vs the parameter explosion of dense layers on images
  • Invariance vs equivariance: what we want from image models
  • Convolution from weight sharing; cross-correlation, feature maps, receptive fields
  • Padding, stride, pooling, \(1\times1\) channel mixing
  • Architectures: LeNet → VGG → NiN → DenseNet → U-Net for dense prediction
  • Why CNNs fit microscopy / materials data — and where they fall short

Note: Backpropagation is covered in the self-study supplement 02_backprop_self_study.qmd appended to this unit.

7 Week 5 Summary: Clustering & Autoencoders

Lecture: Monday, 11.05.2026, 10:15-11:45

Slides: Open

  • K-Means / K-Medoids: hard clustering by minimizing within-cluster distance
    • Sensitive to initialization (use k-means++); assumes spherical clusters
  • GMM + EM: probabilistic clustering with soft assignments
    • E-step: responsibilities; M-step: parameter update
    • Each EM step never decreases the log-likelihood
  • Autoencoders: encoder–bottleneck–decoder, trained on reconstruction loss
    • Linear AE recovers PCA
    • Non-linear AE captures curved manifolds
  • Applications: compression, anomaly detection (reconstruction error), feature extraction

Note: Backpropagation has moved to a self-study supplement appended to Unit 4 (02_backprop_self_study.qmd); this freed the Unit 5 slot for unsupervised learning.

8 Week 6 Summary: Loss Landscapes & Optimization Behavior

Lecture: Monday, 18.05.2026, 10:15-11:45

Slides: Open

  • Loss Landscape: High-dimensional topography determining optimization success.
  • Curvature: Hessian matrix dictates steepness; ill-conditioned landscapes slow descent.
  • Saddle Points: Common traps in high dimensions that hinder optimizers.
  • Advanced Optimizers: Momentum and adaptive learning rates (ADAM) navigate complex landscapes robustly.

9 Week 7 Summary: Generalization, Bias-Variance, Regularization, Tree Ensembles

Lecture: Monday, 25.05.2026, 10:15-11:45

Slides: Open

  • Generalization: performance on unseen data
  • Bias–variance tradeoff: \(\mathrm{MSE} = \mathrm{Bias}^2 + \mathrm{Variance} + \mathrm{noise}\)
  • Regularization: L2 (Ridge), L1 (Lasso), Dropout
  • Validation: cross-validation as unbiased generalization estimate
    • Never tune on the test set
  • Random Forests: bagging + random feature subsets per split
    • → variance reduction via decorrelated trees
  • Gradient Boosting: sequential weak learners on residuals
    • → bias reduction
    • XGBoost / LightGBM / CatBoost as the practical workhorses
  • Trees vs NNs: gradient boosting usually wins on tabular data (\(N < 10^5\))

10 Week 8 Summary: Probabilistic View of Learning; Noise

Lecture: Monday, 01.06.2026, 10:15-11:45

Slides: Open

  • Aleatoric (irreducible noise) vs epistemic (lack of data) uncertainty
  • Gaussian as the maximum-entropy distribution; multivariate Gaussian and covariance geometry; CLT
  • Entropy and KL divergence, including KL between Gaussians (used later for VAEs)
  • MLE: log-likelihood maximization; for a Gaussian, MLE recovers MSE
  • Bayes’ theorem workflow: prior · likelihood ∝ posterior; predictive distribution
  • MAP estimation; MAP = regularized MLE
  • Frequentist vs Bayesian comparison; credible vs confidence intervals
  • Robustness: Student’s t-distribution mitigates outliers better than Gaussian
  • Stochastic enrichment and mixture-density networks (preview of Unit 12)
  • Practical diagnostic: calibration plots

11 Week 9 Summary: Latent Spaces & Advanced Representation Learning

Lecture: Monday, 08.06.2026, 10:15-11:45

Slides: Open

  • What makes a latent space “good”
    • Compactness within concept, separation across concepts
    • Smooth interpolation, transferability
    • None guaranteed by reconstruction alone
  • t-SNE: KL between high-dim Gaussian and low-dim Student-t similarities
    • Heavy tail solves the crowding problem
    • Cluster sizes and between-cluster distances are not quantitatively meaningful
  • UMAP: preserves more global structure than t-SNE; scales to millions of points
  • Contrastive learning (SimCLR, InfoNCE): label-free latent shaping
    • Pull augmentations of the same sample together, push others apart
  • Foundation embeddings (DINOv2, CLIP): pretrained encoders as feature extractors
    • Linear probe is often the strongest baseline for label-scarce tasks

12 Week 10 Summary: Attention & Transformers

Lecture: Monday, 15.06.2026, 10:15-11:45

Slides: Open

  • Why attention: limits of RNNs / LSTMs for long sequences
  • Self-attention: \(\mathrm{softmax}(QK^T/\sqrt{d_k})V\)
    • Similarity-weighted average of value vectors with content-based weights
    • Positions choose whom to listen to (no fixed locality prior)
  • Multi-head attention: parallel heads on learned subspaces, then concat + project
    • Different heads specialize on different relationships
  • Positional encoding: needed because attention is permutation-equivariant
    • Sinusoidal, learned, or RoPE
  • Transformer block: multi-head attention + MLP with residual + LayerNorm; stack many
  • ViT: image as a sequence of patch tokens
    • Beats CNNs at scale; loses with little data (no locality prior)
  • Foundation models: GPT, BERT, ViT, DINO, CLIP
    • Pretrain at scale, freeze, reuse via embeddings + small heads

13 Week 11 Summary: Generative Models — VAE & Diffusion

Lecture: Monday, 22.06.2026, 10:15-11:45

Slides: Open

  • Why generative: vanilla autoencoders cannot sample new data
  • VAE: stochastic encoder produces \(\mathcal{N}(\mu, \sigma^2)\)
    • Trained on the ELBO = reconstruction − KL to a Gaussian prior
    • Reparameterization trick \(z = \mu + \sigma \odot \epsilon\) for differentiable sampling
  • Diffusion: predict the noise \(\epsilon_\theta(x_t, t)\) added at a random timestep
    • Loss is plain MSE on the noise
    • Sample from \(\mathcal{N}(0, I)\) and iterate the learned reverse process
  • Classifier-free guidance: train conditional + unconditional jointly; mix at sampling
  • Trade-offs:
    • VAE: fast sampling, lower-bound likelihood, blurry samples
    • Diffusion: slow sampling, state-of-the-art quality
    • GANs: fast, no likelihood
    • Normalizing flows: exact likelihood, restricted architectures
  • Materials applications:
    • Inverse design (compositions matching a target property)
    • Microstructure generation
    • Physics-constrained spectral synthesis

14 Week 12 Summary: Uncertainty in Predictions

Lecture: Monday, 29.06.2026, 10:15-11:45

Slides: Open

  • Why point predictions aren’t enough; aleatoric vs epistemic recap
  • Bayesian predictive distribution and variance decomposition
  • Evidence framework: marginal likelihood as automatic Occam’s razor; effective number of parameters; empirical Bayes
  • Gaussian Processes as the unit’s main tool
    • Mean and kernel function (RBF and others)
    • Closed-form prior and posterior; hyperparameter learning
    • Strengths and limits (\(O(n^3)\) cost, kernel choice)
  • MC Dropout and deep ensembles as cheaper UQ
  • Mixture-density networks for multi-modal predictive distributions
  • Calibration plots and recalibration
  • Choosing a UQ method: comparison table
  • Active learning via GP uncertainty (materials acceleration platforms)

15 Week 13 Summary: Physics-informed & Constrained Learning

Lecture: Monday, 06.07.2026, 10:15-11:45

Slides: Open

  • PINNs: Integrating physical laws directly into the loss function to reduce data needs.
  • Data Enrichment: Applying known mathematical transformations (FFT, derivatives).
  • Neural Integrators: Using NNs with automatic differentiation as flexible differential equation solvers.
  • Scientific Trust: Physics constraints act as powerful regularizers promoting Occam’s Razor.

16 Week 14 Summary: Explainability, Limits, Scientific Trust

Lecture: Monday, 13.07.2026, 10:15-11:45

Slides: Open

  • Why XAI: from black-box to transparent, justifiable decisions; interpretability vs explainability
  • Six levels of explainability (E1–E6): data → process → feature → model → prediction → decision
  • Semantic structures: controlled vocabularies, taxonomies, ontologies for materials data
  • Sensitivity analysis: perturbation-based, global vs local; feature importance from sensitivity
  • Attribution methods:
    • SHAP (waterfall, beeswarm)
    • LIME (local linear approximation)
    • Integrated Gradients for deep networks
  • Causality vs correlation: causal process chains; detection vs prediction
  • Counterfactuals: “what-if” explanations
  • Limits: data bias, extrapolation, OOD detection, fairness — when models should NOT be trusted
  • Course retrospective: how the 14-unit arc fits together

17 Mathematical Foundations of AI & ML ? Unified Syllabus Overview (with ML-PC & MG)

Legend

  • ? First serious use ? concept must be introduced in MFML before being used in ML-PC or MG
  • ? Reinforcement / application ? concept is applied or deepened, but not introduced
  • (R) Refresher ? topic was covered in a prior course and is only briefly revisited
  • MFML Mathematical Foundations of AI & ML
  • ML-PC Machine Learning in Materials Processing & Characterization
  • MG Materials Genomics
Week MFML ? Mathematical Foundations (revised) ML-PC ? ML in Materials Processing & Characterization (revised) MG ? Materials Genomics (revised) Exercise (90 min, Python-based) Dependency Logic
1 Learning vs data analysis; models, loss functions, prediction vs explanation Role of ML in processing & characterization; ML vs physics models Role of ML in materials discovery; databases & targets NumPy refresher; vectors, dot products, simple loss (MSE) MFML defines ?learning? as optimization, not statistics
2 Linear algebra refresher for learning: covariance, PCA/SVD (R) PCA as a tool for spectra & images (?) PCA & low-D structure in materials spaces (?) PCA refresher on known dataset; visualize variance directions PCA assumed known; MFML aligns notation & geometry
3 Regression as loss minimization; linear models revisited Regression as surrogate modeling for processes & properties (?) Regression & correlation in materials datasets (?) Linear regression from scratch via loss minimization Regression reframed explicitly as learning problem
4 Neural networks early: neuron, activations, backprop NN regression for materials properties (?) NN models for structure-property relations (?) Single-neuron + backprop (manual forward/backward pass) MFML must precede any NN usage
5 Clustering & Autoencoders Clustering & process drift detection (?) Clustering vs discovery in materials space (?) K-Means & simple Autoencoder implementation MFML supplies unsupervised models
6 Loss landscapes, conditioning, optimization behavior Hyperparameters, robustness, convergence issues (?) Model robustness & sensitivity (?) Gradient descent experiments: learning rate & conditioning Optimization treated as learning dynamics
7 Generalization, bias-variance, regularization; tree ensembles (RF, gradient boosting) Overfitting control in models (?); RF / XGBoost as workhorses for tabular characterization data (?) Limits of high-D regression (?); tree ensembles for property prediction over tabular materials descriptors (?) Overfitting demo: polynomial vs NN models; tree-ensemble baseline (RF & XGBoost) on alloy regression Critical conceptual gate for both applied courses; introduces the tabular workhorse
8 Probabilistic view of learning: noise & likelihood Noise-aware modeling & error propagation (?) Noise & uncertainty in materials datasets (?) Noise injection; likelihood vs MSE comparison MFML reframes probability for ML
9 Latent spaces & advanced representation learning: t-SNE, UMAP, contrastive learning, foundation embeddings Visualization & quality assessment of learned features (?) Foundation embeddings for materials descriptors (?) t-SNE/UMAP comparison; SimCLR + linear probe on a foundation embedding Critical for both applied courses; sets up modern self-supervised paradigm
10 Attention & Transformers: self-attention, multi-head, positional encoding, ViT, foundation models Transformers for sequences and images in characterization (?) Transformers for compositions & sequences (SMILES, crystal tokens) (?) Single-head attention from scratch; tiny ViT vs CNN on a small dataset Architecture behind all modern foundation models
11 Generative Models: VAE & Diffusion (ELBO, reparameterization, forward/reverse process, classifier-free guidance) Generative models for data augmentation & process simulation (?) Inverse design via conditional generation (?) VAE on Fashion-MNIST + toy diffusion model (200-line DDPM) Enables inverse design and modern generative applications
12 Uncertainty in predictions (aleatoric vs epistemic); Gaussian Processes (conceptual) Trust & confidence in ML-assisted decisions; surrogate models (?) Discovery & screening with uncertainty; exploration vs exploitation (?) Predictive uncertainty: GP regression vs NN ensembles Enables responsible ML & accelerator concepts
13 Physics-informed & constrained learning Physics-informed ML for processes & characterization (?) Physical constraints in materials ML (?) Constrained NN / penalty-based PINN demo MFML leads constraints & PINN concepts
14 Explainability, limits, scientific trust Integrated case studies & failure modes Limits & ethics of data-driven discovery Mini end-to-end synthesis project All courses converge conceptually