Materials Genomics
Unit 8: Neural Networks for Materials Properties
FAU Erlangen-Nürnberg
By the end of this unit, students should be able to: - explain when a descriptor-based MLP is a sensible next baseline after ridge and random forest - distinguish raw dataset size from effective sample size in materials problems - discuss single-target and multi-target neural surrogates for materials properties - identify domain shift, extrapolation, and false-confidence failure modes - argue for or against the use of an MLP under a concrete materials benchmark design
So this is not a “deep learning replaces classical ML” unit. It is a “when does extra flexibility pay off?” unit.
Unit 8 assumes the representation is already fixed: - composition vectors - engineered structural descriptors - pooled local-environment features - other tabular materials features
We are only changing the predictor, not the representation itself. Learned representations come in Unit 9.
For fixed materials features, the relevant benchmark stack is: - ridge as the strong linear baseline - random forest as a robust nonlinear non-neural baseline - MLP as the flexible neural baseline
The scientific question is never whether an MLP can fit a dataset. The question is whether it improves the benchmark in a meaningful and defensible way.
The same network class behaves differently depending on the input: - with composition-only descriptors, it often interpolates within chemistry families - with structure-enriched descriptors, it can capture richer local and geometric effects - with weak descriptors, extra nonlinearity may mainly amplify noise
This is why model choice cannot be separated from representation quality.
An MLP is therefore not the default. It is a hypothesis to test.
An MLP is plausible when: - simple baselines show systematic underfitting - descriptor interactions appear important - the dataset is diverse enough to support additional flexibility - the use case benefits from a fast nonlinear surrogate
The decision is empirical, but it should be grounded in the data regime and the deployment goal.
In Materials Genomics, an MLP often acts as a surrogate for: - expensive DFT calculations - repeated property evaluation in screening loops - process-property mappings where simulation or experiment is costly
This changes the evaluation lens. We care not only about low error, but about whether wrong predictions are likely in the parts of space where we would use the surrogate.
For fixed descriptor inputs, architecture choice is often modest: - one to a few hidden layers - moderate width - output head chosen for one or multiple properties
Very large networks are rarely justified here. Limited data and high feature correlation usually favor smaller MLPs over deep architectures.
Raw row count can be misleading. A dataset with many close chemical relatives may contain far less independent information than it appears.
If capacity is chosen according to nominal dataset size rather than effective sample size, the MLP is likely to overfit family-specific patterns that do not transfer.
Materials datasets often contain: - near-duplicate structures - many compounds within one chemistry family - polymorph variations of the same system - entries produced by the same workflow and reference data
This means that ten thousand rows may behave statistically more like a much smaller dataset once correlations are respected.
This is why grouped evaluation matters even more for neural surrogates than for simpler baselines.
The simplest setup is one network for one property:
x -> MLP -> y
This is appropriate when: - the target is scientifically central - coupled outputs are weak or irrelevant - interpretability of task definition matters more than shared output structure
Many materials benchmarks should start here.
A multi-target network predicts several properties from one shared hidden representation:
x -> shared trunk -> (y_1, y_2, ..., y_k)
This can help when the targets share physical drivers, but it can also hurt if unrelated targets force the representation to compromise.
But this benefit is not automatic. It depends on the degree of shared signal in the data.
So multitask learning is not a generic upgrade. It is a materials hypothesis that needs evidence.
A neural surrogate fits the target formulation we give it, not the idealized property in our heads.
The MLP must be compared under: - the same grouped split - the same feature preprocessing logic - the same target transformation - the same evaluation metrics
Anything less turns model comparison into an artifact of protocol differences.
That is exactly the evidence we need to judge whether the model is useful.
Consider a band-gap benchmark using descriptor vectors: - ridge gives a stable baseline - random forest captures some nonlinear effects - an MLP may beat both under a random split
The key question is whether the gain survives a grouped chemistry-aware split. If not, the neural advantage is mostly in-domain interpolation.
Neural surrogates trained on one dataset can fail on another because: - the DFT functional changes - relaxation settings change - curation rules differ - chemistry coverage shifts
This matters because a model may partly learn the conventions of a database rather than a transportable structure-property rule.
MLPs are powerful interpolators, but they are usually unreliable extrapolators.
In materials discovery, the practical question is often whether the model can say something useful about a chemistry family not represented in training. That question is much harder than random-split evaluation suggests.
The core risk is not only error; it is error without warning.
Without explicit uncertainty modeling, an MLP gives point predictions, not trustworthy confidence estimates.
That means: - low average error does not imply calibrated trust - some domains may be much less reliable than others - uncertainty must be handled later with dedicated methods, not assumed from the neural architecture itself
For small or irregular tabular materials datasets, random forest may remain preferable: - strong low-data performance - reduced sensitivity to scaling - less fragile tuning - easier interpretation of failure patterns
Unit 8 should make this explicit so the lecture does not collapse into “NNs are the future.”
A good MLP use case looks like: - structure-enriched or descriptor-rich features - evidence of nonlinear interactions between descriptors - enough diversity to support the fit - a grouped validation protocol aligned with deployment - a need for fast repeated prediction
Under these conditions, the MLP becomes a justified surrogate rather than a fashionable choice.
One strength of Materials Genomics is that physically informed features can be combined with flexible nonlinear predictors.
This hybrid strategy often works well because: - domain knowledge enters through the representation - nonlinear interactions are still modeled - data demands remain lower than for end-to-end representation learning
It is often the right intermediate step before moving to learned representations.
This is the exact course-level transition the lecture should make visible.
Do not choose the MLP when: - simpler baselines already match its performance under the right split - the data regime is too small or too correlated - deployment requires extrapolation beyond training support - interpretability is central and the accuracy gain is negligible
Complexity needs a scientific return.
Before using a descriptor-based MLP in practice, ask: - does it beat strong baselines under the deployment-relevant split? - do residuals remain acceptable in important chemistry families? - is there evidence against leakage and dataset artifacts? - does it degrade gracefully under shift? - is the deployment domain close enough to training support?
If these questions cannot be answered, the model is not ready.

© Philipp Pelz - Materials Genomics