Materials Genomics
Unit 7: Regression and Generalization in Materials Data
FAU Erlangen-Nürnberg
By the end of this unit, students should be able to: - formulate a materials-property prediction task as a supervised regression problem - explain the role of linear, regularized, and nonlinear baselines - choose evaluation metrics that match the scientific question - explain why random splits are often misleading in materials datasets - diagnose models using grouped splits, learning curves, residual analysis, and leakage checks
x.y might be band gap, formation energy, elastic modulus, conductivity, or another property.What changes is not only the objective function, but the meaning of success.
This unit is therefore about honest predictive modeling.
We fit model parameters theta by minimizing
hat(theta) = arg min_theta (1/N) sum_i ell(f_theta(x_i), y_i) + lambda Omega(theta)
ell is the prediction lossOmega regularizes model complexitylambda balances fit against stabilityThis is the standard machine-learning view, but the scientific question lies in what data and what split we choose.
Different targets imply different regression problems: - band gap: often sensitive to chemistry and electronic structure approximations - energy above hull: connected to thermodynamic competition and reference states - elastic modulus: may have broad scales and anisotropy effects - conductivity: may be noisy, sparse, and strongly nonlocal
Target choice determines both learnability and the meaning of evaluation.
A regression model cannot be more reliable than the target definition allows. This is why “dataset quality” is not a background issue; it is part of the modeling problem.
The first baseline should usually be linear regression:
hat(y) = w^T x + b
This tests whether the feature representation already contains a strong approximately linear signal. A simple model is not a weak baseline. It is a diagnostic tool for understanding whether additional flexibility is actually needed.
With mean squared error, linear regression chooses the vector Xw that best approximates y in the column space of X.
The geometry reminds us that model quality and feature quality cannot be separated cleanly.
Ridge regularization modifies least squares to
hat(w) = arg min_w ||y - Xw||_2^2 + lambda ||w||_2^2
Ridge is therefore one of the most important baseline models in materials regression.
L1 penalty and can set coefficients exactly to zero.L1 and L2 penalties.These methods are useful when the feature space is large, redundant, or partly irrelevant. They can improve both stability and interpretability, especially for engineered descriptors.
If a large neural model barely improves on ridge under a strong split, the scientific story is not “deep learning wins.”
Materials-property relations are often nonlinear, so we also consider models such as: - decision trees - random forests and gradient-boosted trees - shallow neural networks
These models can capture interactions between features, but they also raise the risk of overfitting and split exploitation.
A fair benchmark requires: - the same split - the same feature preprocessing - the same target transformation - the same reporting metrics
Without this, model comparisons collapse into implementation accidents rather than scientific evidence.
Materials datasets are often modest in size and highly structured. That makes variance control particularly important.
This is why split design is the center of the unit.
MAE measures average absolute error in physical units.RMSE penalizes rare large errors more strongly.R^2 measures explained variance relative to a constant baseline.Metrics summarize error on the chosen evaluation set. They do not certify that the evaluation set represents the real deployment challenge.
MAE when average physical deviation matters.RMSE when large mistakes are especially costly.The correct metric comes from the scientific action that follows the prediction.
Random splits are not always wrong, but their scientific meaning is often narrow.
If the claim is generalization to new chemistry, the split must enforce new chemistry.
This distinction matters for any paper that talks about extrapolation.
If the test set is used repeatedly during tuning, the final performance estimate becomes optimistic.
Nested validation separates model selection from final evaluation. It is especially useful when datasets are small and many hyperparameter choices are possible.
The practical message is simple: every tuning decision spends evaluation credibility. Nested validation limits that damage.
Leakage in materials ML often appears as: - duplicate or nearly duplicate structures across splits - preprocessing fit on the full dataset - target-adjacent quantities sneaking into the feature set - data curation steps that accidentally mix train and test information
Leakage is dangerous because it produces a high score without genuine generalization.
Learning curves help decide whether to collect more data, redesign features, or simplify the model.
A model can have a respectable global MAE and still fail badly for one chemistry family.
Residual analysis should therefore ask: - where are the largest errors concentrated? - are some structure families systematically biased? - does error grow with target magnitude?
These questions are often more informative than another decimal place in the leaderboard table.
Ignoring heteroscedasticity can make the model look uniformly reliable when it is not.
Consider band-gap prediction on a dataset containing many related compounds.
The lesson is not that the model became worse. The lesson is that the earlier evaluation question was easier.
Suppose a random forest improves on ridge by a small margin.
The right questions are: - does the improvement survive grouped evaluation? - are the hard chemistry families improved? - is the added complexity justified by the use case?
Model choice must be judged under the split that matches the scientific claim.
This is domain shift, and it means cross-database evaluation can be more informative than a single in-database benchmark.
Before acting on model predictions, ask: - does the model beat strong simple baselines? - is the split aligned with the intended deployment scenario? - do residuals look acceptable in important chemistry families? - is there evidence of graceful degradation under OOD evaluation? - have leakage risks been addressed?
These are trust criteria, not optional extras.
An easy way to produce a misleading result is: - tune aggressively on a weak split - compare only against weak baselines - report one metric - claim discovery power
This is methodologically weak because the benchmark no longer matches the scientific question.

© Philipp Pelz - Materials Genomics