This is the standard machine-learning view, but the scientific question lies in what data and what split we choose.
06. Choosing the target variable
Different targets imply different regression problems:
band gap: often sensitive to chemistry and electronic structure approximations
energy above hull: connected to thermodynamic competition and reference states
elastic modulus: may have broad scales and anisotropy effects
conductivity: may be noisy, sparse, and strongly nonlocal
Target choice determines both learnability and the meaning of evaluation.
07. Noise and target realism
Some targets are noisy because the underlying calculations are approximate.
Some targets mix multiple physical mechanisms.
Some datasets combine values generated under different computational settings.
A regression model cannot be more reliable than the target definition allows. This is why “dataset quality” is not a background issue; it is part of the modeling problem.
08. Start with the simplest honest baseline
The first baseline should usually be linear regression:
\[\hat{y} = \mathbf{w}^T \mathbf{x} + b\]
This tests whether the feature representation already contains a strong approximately linear signal. A simple model is not a weak baseline. It is a diagnostic tool for understanding whether additional flexibility is actually needed.
09. Least squares as projection
With mean squared error, linear regression chooses the vector \(\mathbf{X}\mathbf{w}\) that best approximates \(\mathbf{y}\) in the column space of \(\mathbf{X}\).
if the representation is informative, even a linear model may work well
if the representation is poor, increasing model complexity may only fit noise
The geometry reminds us that model quality and feature quality cannot be separated cleanly.
correlated features are common in materials descriptors
ridge shrinks coefficients and stabilizes the fit
the model becomes less sensitive to sampling noise
Ridge is therefore one of the most important baseline models in materials regression.
11. Lasso and elastic net
Lasso adds an \(L_1\) penalty and can set coefficients exactly to zero.
Elastic net mixes \(L_1\) and \(L_2\) penalties.
These methods are useful when the feature space is large, redundant, or partly irrelevant. They can improve both stability and interpretability, especially for engineered descriptors.
they expose the information content of the representation
they are interpretable
they often perform surprisingly well in small-data regimes
they provide a realistic bar for more complex models
If a large neural model barely improves on ridge under a strong split, the scientific story is not “deep learning wins.”
13. Nonlinear baselines
Materials-property relations are often nonlinear, so we also consider models such as:
decision trees
random forests and gradient-boosted trees
shallow neural networks
These models can capture interactions between features, but they also raise the risk of overfitting and split exploitation.
14. Baseline comparisons must be fair
A fair benchmark requires:
the same split
the same feature preprocessing
the same target transformation
the same reporting metrics
Without this, model comparisons collapse into implementation accidents rather than scientific evidence.
15. Bias and variance in materials ML
High bias: the model is too simple to capture the relationship.
High variance: the model is too sensitive to idiosyncrasies of the training set.
Materials datasets are often modest in size and highly structured. That makes variance control particularly important.
16. Overfitting is often hidden by the split
In image tasks, overfitting may show up as a large train-test gap.
In materials tasks, overfitting can hide behind a random split because close chemical relatives appear in both sets.
The model then looks better than it actually is at extrapolating.
This is why split design is the center of the unit.
17. Metrics: what they do and do not say
MAE measures average absolute error in physical units.
RMSE penalizes rare large errors more strongly.
\(R^2\) measures explained variance relative to a constant baseline.
Metrics summarize error on the chosen evaluation set. They do not certify that the evaluation set represents the real deployment challenge.
18. Metric choice should match the decision problem
Use MAE when average physical deviation matters.
Use RMSE when large mistakes are especially costly.
Use ranking metrics if the goal is candidate prioritization rather than calibrated prediction.
The correct metric comes from the scientific action that follows the prediction.
19. Random train-test splits
Random splits are easy to implement and often look statistically clean.
But in materials datasets they commonly place very similar compounds, prototypes, or polymorphs in both train and test.
The model therefore receives a much easier task than the intended discovery scenario.
graph LR T[Split Taxonomy] --> R[Random] T --> G[Grouped] T --> P[Prototype] R --> R1[IID assumption] G --> G1[Chemistry/Family aware] P --> P1[Structural motif aware]
Random splits are not always wrong, but their scientific meaning is often narrow.
20. Grouped chemistry-aware splits
A grouped split withholds entire chemistry families or composition groups.
This tests whether the model transfers beyond near neighbors.
It is typically harsher, but it is much closer to the discovery question “can we predict a new family?”
If the claim is generalization to new chemistry, the split must enforce new chemistry.
21. Prototype-aware splits
Chemistry shift and structure shift are not the same.
A model might generalize across composition changes inside one structural motif but fail on a new prototype.
Prototype-aware splits therefore test a different and often harder claim.
This distinction matters for any paper that talks about extrapolation.
22. Validation, test, and model selection
training data fit parameters
validation data select hyperparameters and model family
test data estimate final performance once
graph LR Train[Training Set] --> Fit[Fit Parameters] Val[Validation Set] --> Tune[Hyperparameter Tuning] Test[Test Set] --> Eval[Final Evaluation] Fit --> Tune Tune -.-> Fit Tune --> Eval
If the test set is used repeatedly during tuning, the final performance estimate becomes optimistic.
23. Nested validation, conceptually
Nested validation separates model selection from final evaluation. It is especially useful when datasets are small and many hyperparameter choices are possible.
The practical message is simple: every tuning decision spends evaluation credibility. Nested validation limits that damage.
24. Feature-target leakage
Leakage in materials ML often appears as: - duplicate or nearly duplicate structures across splits - preprocessing fit on the full dataset - target-adjacent quantities sneaking into the feature set - data curation steps that accidentally mix train and test information
Leakage is dangerous because it produces a high score without genuine generalization.
25. Learning curves as diagnosis
If training error is low and validation error is high, the model has high variance.
If both are high, the representation or model is too weak.
If validation error keeps improving strongly with more data, data scarcity is the likely bottleneck.
Learning curves help decide whether to collect more data, redesign features, or simplify the model.
26. Residual analysis by chemistry family
A model can have a respectable global MAE and still fail badly for one chemistry family.
Residual analysis should therefore ask: - where are the largest errors concentrated? - are some structure families systematically biased? - does error grow with target magnitude?
These questions are often more informative than another decimal place in the leaderboard table.
27. Heteroscedastic noise
Not all materials examples carry equal uncertainty.
Some regions of chemical space are intrinsically harder.
Some targets have error levels that change across the range of values.
Ignoring heteroscedasticity can make the model look uniformly reliable when it is not.
28. Worked example: random versus grouped split
Consider band-gap prediction on a dataset containing many related compounds.
Under a random split, ridge regression may achieve a very good MAE because close neighbors appear in both train and test.
Under a grouped chemistry-aware split, the same model may degrade sharply.
The lesson is not that the model became worse. The lesson is that the earlier evaluation question was easier.
29. Worked example: linear versus random forest
Suppose a random forest improves on ridge by a small margin.
The right questions are: - does the improvement survive grouped evaluation? - are the hard chemistry families improved? - is the added complexity justified by the use case?
Model choice must be judged under the split that matches the scientific claim.
30. Domain shift across databases
A model trained on one database may fail on another even for the same nominal target.
Causes include different DFT functionals, different relaxation settings, different curation choices, and different coverage of chemical space.
This is domain shift, and it means cross-database evaluation can be more informative than a single in-database benchmark.
31. When should a surrogate be trusted?
Before acting on model predictions, ask:
does the model beat strong simple baselines?
is the split aligned with the intended deployment scenario?
do residuals look acceptable in important chemistry families?
is there evidence of graceful degradation under OOD evaluation?
have leakage risks been addressed?
These are trust criteria, not optional extras.
32. Interpretability remains useful
Linear and sparse models help identify which descriptors carry signal.
Even when a nonlinear model wins, interpretable baselines can reveal whether the improvement is physically plausible.
Interpretability is especially important when data are scarce and expert judgment matters.
33. The anti-pattern of leaderboard optimization
An easy way to produce a misleading result is: - tune aggressively on a weak split - compare only against weak baselines - report one metric - claim discovery power
This is methodologically weak because the benchmark no longer matches the scientific question.
34. Summary: 10 Key Exam Statements
Supervised regression in materials science maps material representations \(\mathbf{x}\) to properties \(y\) by minimizing empirical risk \(R(\boldsymbol{\theta})\).
Simple linear baselines (Ridge, Lasso) are essential to quantify the signal strength in a representation before moving to complex models.
Regularization (e.g., Ridge) is critical because materials features are often highly correlated and datasets are small.
Standard random splits often lead to optimistic performance estimates because they fail to separate chemically similar structures.
Grouped chemistry-aware or prototype-aware splits are required to evaluate a model’s true extrapolation and discovery power.
Evaluation metrics (MAE, RMSE, \(R^2\)) must be selected based on the specific scientific or engineering decision the model supports.
Data leakage, such as including near-duplicates in both train and test sets, can invalidate scientific claims of generalization.
Learning curves are a powerful diagnostic tool to distinguish between high bias (model too simple) and high variance (too little data).
Residual analysis should be performed by chemistry family to ensure the model does not have systematic biases in specific chemical regions.
A model’s complexity is only justified if it significantly outperforms a strong baseline under a split that matches the intended deployment scenario.
35. Bridge to Unit 8
Unit 8 will introduce neural surrogates for materials properties.
The central discipline from Unit 7 remains unchanged: more flexible models must earn their complexity under honest validation.