Materials Genomics
Unit 7: Regression and Generalization in Materials Data

Prof. Dr. Philipp Pelz

FAU Erlangen-Nürnberg

FAU Logo IMN Logo CENEM Logo ERC Logo Eclipse Logo

01. Title: Regression and Generalization in Materials Data

  • Unit 6 asked how to represent a material.
  • Unit 7 asks how to predict a property from that representation and how to decide whether the result is scientifically trustworthy.
  • In materials ML, regression and evaluation must be taught together because weak validation can make a weak model look strong.

02. Learning outcomes

By the end of this unit, students should be able to: - formulate a materials-property prediction task as a supervised regression problem - explain the role of linear, regularized, and nonlinear baselines - choose evaluation metrics that match the scientific question - explain why random splits are often misleading in materials datasets - diagnose models using grouped splits, learning curves, residual analysis, and leakage checks

03. Connection to Unit 6

  • Unit 6 produced material-level feature vectors from local environments.
  • Those vectors are now the inputs x.
  • The target y might be band gap, formation energy, elastic modulus, conductivity, or another property.

What changes is not only the objective function, but the meaning of success.

04. Why this unit matters

  • A low error on a random split does not imply discovery power.
  • A complex model that beats a linear baseline by a tiny margin may not justify its complexity.
  • A benchmark with leakage or near-duplicate structures can produce impressive but scientifically empty results.

This unit is therefore about honest predictive modeling.

05. Regression as empirical risk minimization

We fit model parameters theta by minimizing

hat(theta) = arg min_theta (1/N) sum_i ell(f_theta(x_i), y_i) + lambda Omega(theta)

  • ell is the prediction loss
  • Omega regularizes model complexity
  • lambda balances fit against stability

This is the standard machine-learning view, but the scientific question lies in what data and what split we choose.

06. Choosing the target variable

Different targets imply different regression problems: - band gap: often sensitive to chemistry and electronic structure approximations - energy above hull: connected to thermodynamic competition and reference states - elastic modulus: may have broad scales and anisotropy effects - conductivity: may be noisy, sparse, and strongly nonlocal

Target choice determines both learnability and the meaning of evaluation.

07. Noise and target realism

  • Some targets are noisy because the underlying calculations are approximate.
  • Some targets mix multiple physical mechanisms.
  • Some datasets combine values generated under different computational settings.

A regression model cannot be more reliable than the target definition allows. This is why “dataset quality” is not a background issue; it is part of the modeling problem.

08. Start with the simplest honest baseline

The first baseline should usually be linear regression:

hat(y) = w^T x + b

This tests whether the feature representation already contains a strong approximately linear signal. A simple model is not a weak baseline. It is a diagnostic tool for understanding whether additional flexibility is actually needed.

09. Least squares as projection

With mean squared error, linear regression chooses the vector Xw that best approximates y in the column space of X.

  • if the representation is informative, even a linear model may work well
  • if the representation is poor, increasing model complexity may only fit noise

The geometry reminds us that model quality and feature quality cannot be separated cleanly.

10. Ridge regression

Ridge regularization modifies least squares to

hat(w) = arg min_w ||y - Xw||_2^2 + lambda ||w||_2^2

  • correlated features are common in materials descriptors
  • ridge shrinks coefficients and stabilizes the fit
  • the model becomes less sensitive to sampling noise

Ridge is therefore one of the most important baseline models in materials regression.

11. Lasso and elastic net

  • Lasso adds an L1 penalty and can set coefficients exactly to zero.
  • Elastic net mixes L1 and L2 penalties.

These methods are useful when the feature space is large, redundant, or partly irrelevant. They can improve both stability and interpretability, especially for engineered descriptors.

12. Why simple baselines remain scientifically valuable

  • they expose the information content of the representation
  • they are interpretable
  • they often perform surprisingly well in small-data regimes
  • they provide a realistic bar for more complex models

If a large neural model barely improves on ridge under a strong split, the scientific story is not “deep learning wins.”

13. Nonlinear baselines

Materials-property relations are often nonlinear, so we also consider models such as: - decision trees - random forests and gradient-boosted trees - shallow neural networks

These models can capture interactions between features, but they also raise the risk of overfitting and split exploitation.

14. Baseline comparisons must be fair

A fair benchmark requires: - the same split - the same feature preprocessing - the same target transformation - the same reporting metrics

Without this, model comparisons collapse into implementation accidents rather than scientific evidence.

15. Bias and variance in materials ML

  • High bias: the model is too simple to capture the relationship.
  • High variance: the model is too sensitive to idiosyncrasies of the training set.

Materials datasets are often modest in size and highly structured. That makes variance control particularly important.

16. Overfitting is often hidden by the split

  • In image tasks, overfitting may show up as a large train-test gap.
  • In materials tasks, overfitting can hide behind a random split because close chemical relatives appear in both sets.
  • The model then looks better than it actually is at extrapolating.

This is why split design is the center of the unit.

17. Metrics: what they do and do not say

  • MAE measures average absolute error in physical units.
  • RMSE penalizes rare large errors more strongly.
  • R^2 measures explained variance relative to a constant baseline.

Metrics summarize error on the chosen evaluation set. They do not certify that the evaluation set represents the real deployment challenge.

18. Metric choice should match the decision problem

  • Use MAE when average physical deviation matters.
  • Use RMSE when large mistakes are especially costly.
  • Use ranking metrics if the goal is candidate prioritization rather than calibrated prediction.

The correct metric comes from the scientific action that follows the prediction.

19. Random train-test splits

  • Random splits are easy to implement and often look statistically clean.
  • But in materials datasets they commonly place very similar compounds, prototypes, or polymorphs in both train and test.
  • The model therefore receives a much easier task than the intended discovery scenario.

Random splits are not always wrong, but their scientific meaning is often narrow.

20. Grouped chemistry-aware splits

  • A grouped split withholds entire chemistry families or composition groups.
  • This tests whether the model transfers beyond near neighbors.
  • It is typically harsher, but it is much closer to the discovery question “can we predict a new family?”

If the claim is generalization to new chemistry, the split must enforce new chemistry.

21. Prototype-aware splits

  • Chemistry shift and structure shift are not the same.
  • A model might generalize across composition changes inside one structural motif but fail on a new prototype.
  • Prototype-aware splits therefore test a different and often harder claim.

This distinction matters for any paper that talks about extrapolation.

22. Validation, test, and model selection

  • training data fit parameters
  • validation data select hyperparameters and model family
  • test data estimate final performance once

If the test set is used repeatedly during tuning, the final performance estimate becomes optimistic.

23. Nested validation, conceptually

Nested validation separates model selection from final evaluation. It is especially useful when datasets are small and many hyperparameter choices are possible.

The practical message is simple: every tuning decision spends evaluation credibility. Nested validation limits that damage.

24. Feature-target leakage

Leakage in materials ML often appears as: - duplicate or nearly duplicate structures across splits - preprocessing fit on the full dataset - target-adjacent quantities sneaking into the feature set - data curation steps that accidentally mix train and test information

Leakage is dangerous because it produces a high score without genuine generalization.

25. Learning curves as diagnosis

  • If training error is low and validation error is high, the model has high variance.
  • If both are high, the representation or model is too weak.
  • If validation error keeps improving strongly with more data, data scarcity is the likely bottleneck.

Learning curves help decide whether to collect more data, redesign features, or simplify the model.

26. Residual analysis by chemistry family

A model can have a respectable global MAE and still fail badly for one chemistry family.

Residual analysis should therefore ask: - where are the largest errors concentrated? - are some structure families systematically biased? - does error grow with target magnitude?

These questions are often more informative than another decimal place in the leaderboard table.

27. Heteroscedastic noise

  • Not all materials examples carry equal uncertainty.
  • Some regions of chemical space are intrinsically harder.
  • Some targets have error levels that change across the range of values.

Ignoring heteroscedasticity can make the model look uniformly reliable when it is not.

28. Worked example: random versus grouped split

Consider band-gap prediction on a dataset containing many related compounds.

  • Under a random split, ridge regression may achieve a very good MAE because close neighbors appear in both train and test.
  • Under a grouped chemistry-aware split, the same model may degrade sharply.

The lesson is not that the model became worse. The lesson is that the earlier evaluation question was easier.

29. Worked example: linear versus random forest

Suppose a random forest improves on ridge by a small margin.

The right questions are: - does the improvement survive grouped evaluation? - are the hard chemistry families improved? - is the added complexity justified by the use case?

Model choice must be judged under the split that matches the scientific claim.

30. Domain shift across databases

  • A model trained on one database may fail on another even for the same nominal target.
  • Causes include different DFT functionals, different relaxation settings, different curation choices, and different coverage of chemical space.

This is domain shift, and it means cross-database evaluation can be more informative than a single in-database benchmark.

31. When should a surrogate be trusted?

Before acting on model predictions, ask: - does the model beat strong simple baselines? - is the split aligned with the intended deployment scenario? - do residuals look acceptable in important chemistry families? - is there evidence of graceful degradation under OOD evaluation? - have leakage risks been addressed?

These are trust criteria, not optional extras.

32. Interpretability remains useful

  • Linear and sparse models help identify which descriptors carry signal.
  • Even when a nonlinear model wins, interpretable baselines can reveal whether the improvement is physically plausible.
  • Interpretability is especially important when data are scarce and expert judgment matters.

33. The anti-pattern of leaderboard optimization

An easy way to produce a misleading result is: - tune aggressively on a weak split - compare only against weak baselines - report one metric - claim discovery power

This is methodologically weak because the benchmark no longer matches the scientific question.

34. Summary

  • Regression in materials science is not only about fitting a function.
  • The target, representation, model class, metric, and split jointly define the scientific claim.
  • Grouped evaluation, learning curves, residuals, and leakage checks are essential.
  • A useful surrogate is one that remains credible under the deployment scenario it is meant for.

35. Bridge to Unit 8

  • Unit 8 will introduce neural surrogates for materials properties.
  • The central discipline from Unit 7 remains unchanged: more flexible models must earn their complexity under honest validation.