Materials Genomics
Unit 7: Regression and Generalization in Materials Data

Prof. Dr. Philipp Pelz

FAU Erlangen-Nürnberg

Unit 6 asked how to represent a material.
Unit 7 asks how to predict a property from that representation and how to decide whether the result is scientifically trustworthy.
In materials ML, regression and evaluation must be taught together because weak validation can make a weak model look strong. :::

02. Learning outcomes

By the end of this unit, students should be able to:

formulate a materials-property prediction task as a supervised regression problem
explain the role of linear, regularized, and nonlinear baselines
choose evaluation metrics that match the scientific question
explain why random splits are often misleading in materials datasets
diagnose models using grouped splits, learning curves, residual analysis, and leakage checks

03. Connection to Unit 6

Unit 6 produced material-level feature vectors from local environments.
Those vectors are now the inputs \(\mathbf{x}\).
The target \(y\) might be band gap, formation energy, elastic modulus, conductivity, or another property.

What changes is not only the objective function, but the meaning of success.

04. Why this unit matters

A low error on a random split does not imply discovery power.
A complex model that beats a linear baseline by a tiny margin may not justify its complexity.
A benchmark with leakage or near-duplicate structures can produce impressive but scientifically empty results.

This unit is therefore about honest predictive modeling.

05. Regression as empirical risk minimization

We fit model parameters \(\boldsymbol{\theta}\) by minimizing the empirical risk \(R(\boldsymbol{\theta})\):

\[ \hat{\boldsymbol{\theta}} = \arg \min_{\boldsymbol{\theta}} \underbrace{\frac{1}{N} \sum_{i=1}^N \ell(f_{\boldsymbol{\theta}}(\mathbf{x}_i), y_i)}_{R(\boldsymbol{\theta})} + \lambda \Omega(\boldsymbol{\theta}) \]

\(\ell\) is the prediction loss

\(\Omega\) regularizes model complexity

\(\lambda\) balances fit against stability

This is the standard machine-learning view, but the scientific question lies in what data and what split we choose.

06. Choosing the target variable

Different targets imply different regression problems:

band gap: often sensitive to chemistry and electronic structure approximations
energy above hull: connected to thermodynamic competition and reference states
elastic modulus: may have broad scales and anisotropy effects
conductivity: may be noisy, sparse, and strongly nonlocal

Target choice determines both learnability and the meaning of evaluation.

07. Noise and target realism

Some targets are noisy because the underlying calculations are approximate.
Some targets mix multiple physical mechanisms.
Some datasets combine values generated under different computational settings.

A regression model cannot be more reliable than the target definition allows. This is why “dataset quality” is not a background issue; it is part of the modeling problem.

08. Start with the simplest honest baseline

The first baseline should usually be linear regression:

\[\hat{y} = \mathbf{w}^T \mathbf{x} + b\]

This tests whether the feature representation already contains a strong approximately linear signal. A simple model is not a weak baseline. It is a diagnostic tool for understanding whether additional flexibility is actually needed.

09. Least squares as projection

With mean squared error, linear regression chooses the vector \(\mathbf{X}\mathbf{w}\) that best approximates \(\mathbf{y}\) in the column space of \(\mathbf{X}\).

if the representation is informative, even a linear model may work well
if the representation is poor, increasing model complexity may only fit noise

The geometry reminds us that model quality and feature quality cannot be separated cleanly.

10. Ridge regression

Ridge regularization modifies least squares to

\[\hat{\mathbf{w}} = \arg \min_{\mathbf{w}} \|\mathbf{y} - \mathbf{X}\mathbf{w}\|_2^2 + \lambda \|\mathbf{w}\|_2^2\]

correlated features are common in materials descriptors
ridge shrinks coefficients and stabilizes the fit
the model becomes less sensitive to sampling noise

Ridge is therefore one of the most important baseline models in materials regression.

11. Lasso and elastic net

Lasso adds an \(L_1\) penalty and can set coefficients exactly to zero.
Elastic net mixes \(L_1\) and \(L_2\) penalties.

These methods are useful when the feature space is large, redundant, or partly irrelevant. They can improve both stability and interpretability, especially for engineered descriptors.

12. Why simple baselines remain scientifically valuable

they expose the information content of the representation
they are interpretable
they often perform surprisingly well in small-data regimes
they provide a realistic bar for more complex models

If a large neural model barely improves on ridge under a strong split, the scientific story is not “deep learning wins.”

13. Nonlinear baselines

Materials-property relations are often nonlinear, so we also consider models such as:

decision trees
random forests and gradient-boosted trees
shallow neural networks

These models can capture interactions between features, but they also raise the risk of overfitting and split exploitation.

14. Baseline comparisons must be fair

A fair benchmark requires:

the same split
the same feature preprocessing

the same target transformation
the same reporting metrics

Without this, model comparisons collapse into implementation accidents rather than scientific evidence.

15. Bias and variance in materials ML

High bias: the model is too simple to capture the relationship.
High variance: the model is too sensitive to idiosyncrasies of the training set.

Materials datasets are often modest in size and highly structured. That makes variance control particularly important.

16. Overfitting is often hidden by the split

In image tasks, overfitting may show up as a large train-test gap.
In materials tasks, overfitting can hide behind a random split because close chemical relatives appear in both sets.
The model then looks better than it actually is at extrapolating.

This is why split design is the center of the unit.

17. Metrics: what they do and do not say

MAE measures average absolute error in physical units.
RMSE penalizes rare large errors more strongly.
\(R^2\) measures explained variance relative to a constant baseline.

Metrics summarize error on the chosen evaluation set. They do not certify that the evaluation set represents the real deployment challenge.

18. Metric choice should match the decision problem

Use MAE when average physical deviation matters.
Use RMSE when large mistakes are especially costly.
Use ranking metrics if the goal is candidate prioritization rather than calibrated prediction.

The correct metric comes from the scientific action that follows the prediction.

19. Random train-test splits

Random splits are easy to implement and often look statistically clean.
But in materials datasets they commonly place very similar compounds, prototypes, or polymorphs in both train and test.
The model therefore receives a much easier task than the intended discovery scenario.

graph LR
    T[Split Taxonomy] --> R[Random]
    T --> G[Grouped]
    T --> P[Prototype]
    R --> R1[IID assumption]
    G --> G1[Chemistry/Family aware]
    P --> P1[Structural motif aware]

Random splits are not always wrong, but their scientific meaning is often narrow.

20. Grouped chemistry-aware splits

A grouped split withholds entire chemistry families or composition groups.
This tests whether the model transfers beyond near neighbors.
It is typically harsher, but it is much closer to the discovery question “can we predict a new family?”

If the claim is generalization to new chemistry, the split must enforce new chemistry.

21. Prototype-aware splits

Chemistry shift and structure shift are not the same.
A model might generalize across composition changes inside one structural motif but fail on a new prototype.
Prototype-aware splits therefore test a different and often harder claim.

This distinction matters for any paper that talks about extrapolation.

22. Validation, test, and model selection

training data fit parameters
validation data select hyperparameters and model family
test data estimate final performance once

graph LR
    Train[Training Set] --> Fit[Fit Parameters]
    Val[Validation Set] --> Tune[Hyperparameter Tuning]
    Test[Test Set] --> Eval[Final Evaluation]
    Fit --> Tune
    Tune -.-> Fit
    Tune --> Eval

If the test set is used repeatedly during tuning, the final performance estimate becomes optimistic.

23. Nested validation, conceptually

Nested validation separates model selection from final evaluation. It is especially useful when datasets are small and many hyperparameter choices are possible.

The practical message is simple: every tuning decision spends evaluation credibility. Nested validation limits that damage.

24. Feature-target leakage

Leakage in materials ML often appears as: - duplicate or nearly duplicate structures across splits - preprocessing fit on the full dataset - target-adjacent quantities sneaking into the feature set - data curation steps that accidentally mix train and test information

Leakage is dangerous because it produces a high score without genuine generalization.

25. Learning curves as diagnosis

If training error is low and validation error is high, the model has high variance.
If both are high, the representation or model is too weak.

If validation error keeps improving strongly with more data, data scarcity is the likely bottleneck.

Learning curves help decide whether to collect more data, redesign features, or simplify the model.

26. Residual analysis by chemistry family

A model can have a respectable global MAE and still fail badly for one chemistry family.

Residual analysis should therefore ask: - where are the largest errors concentrated? - are some structure families systematically biased? - does error grow with target magnitude?

These questions are often more informative than another decimal place in the leaderboard table.

27. Heteroscedastic noise

Not all materials examples carry equal uncertainty.
Some regions of chemical space are intrinsically harder.
Some targets have error levels that change across the range of values.

Ignoring heteroscedasticity can make the model look uniformly reliable when it is not.

28. Worked example: random versus grouped split

Consider band-gap prediction on a dataset containing many related compounds.

Under a random split, ridge regression may achieve a very good MAE because close neighbors appear in both train and test.
Under a grouped chemistry-aware split, the same model may degrade sharply.

The lesson is not that the model became worse. The lesson is that the earlier evaluation question was easier.

29. Worked example: linear versus random forest

Suppose a random forest improves on ridge by a small margin.

The right questions are: - does the improvement survive grouped evaluation? - are the hard chemistry families improved? - is the added complexity justified by the use case?

Model choice must be judged under the split that matches the scientific claim.

30. Domain shift across databases

A model trained on one database may fail on another even for the same nominal target.
Causes include different DFT functionals, different relaxation settings, different curation choices, and different coverage of chemical space.

This is domain shift, and it means cross-database evaluation can be more informative than a single in-database benchmark.

31. When should a surrogate be trusted?

Before acting on model predictions, ask:

does the model beat strong simple baselines?

is the split aligned with the intended deployment scenario?

do residuals look acceptable in important chemistry families?

is there evidence of graceful degradation under OOD evaluation?

have leakage risks been addressed?

These are trust criteria, not optional extras.

32. Interpretability remains useful

Linear and sparse models help identify which descriptors carry signal.
Even when a nonlinear model wins, interpretable baselines can reveal whether the improvement is physically plausible.
Interpretability is especially important when data are scarce and expert judgment matters.

33. The anti-pattern of leaderboard optimization

An easy way to produce a misleading result is: - tune aggressively on a weak split - compare only against weak baselines - report one metric - claim discovery power

This is methodologically weak because the benchmark no longer matches the scientific question.

34. Summary: 10 Key Exam Statements

Supervised regression in materials science maps material representations \(\mathbf{x}\) to properties \(y\) by minimizing empirical risk \(R(\boldsymbol{\theta})\).
Simple linear baselines (Ridge, Lasso) are essential to quantify the signal strength in a representation before moving to complex models.
Regularization (e.g., Ridge) is critical because materials features are often highly correlated and datasets are small.
Standard random splits often lead to optimistic performance estimates because they fail to separate chemically similar structures.
Grouped chemistry-aware or prototype-aware splits are required to evaluate a model’s true extrapolation and discovery power.
Evaluation metrics (MAE, RMSE, \(R^2\)) must be selected based on the specific scientific or engineering decision the model supports.
Data leakage, such as including near-duplicates in both train and test sets, can invalidate scientific claims of generalization.
Learning curves are a powerful diagnostic tool to distinguish between high bias (model too simple) and high variance (too little data).
Residual analysis should be performed by chemistry family to ensure the model does not have systematic biases in specific chemical regions.
A model’s complexity is only justified if it significantly outperforms a strong baseline under a split that matches the intended deployment scenario.

35. Bridge to Unit 8

Unit 8 will introduce neural surrogates for materials properties.
The central discipline from Unit 7 remains unchanged: more flexible models must earn their complexity under honest validation.

Materials GenomicsUnit 7: Regression and Generalization in Materials Data