Materials Genomics
Unit 1: Materials Data as a Design Space

Prof. Dr. Philipp Pelz

FAU Erlangen-Nürnberg

FAU Logo IMN Logo CENEM Logo ERC Logo Eclipse Logo

01. Title + course role in program

  • Materials Genomics operationalizes discovery as a data-driven design process.
  • Unit 1 builds conceptual scaffolding for later representation and modeling units.

02. What “genomics” means (and what it does NOT mean)

  • Analogy: map compositional/structural diversity to property phenotypes.
  • Not literal biology transfer; it is an organizational discovery metaphor.

03. Learning outcomes Unit 1

  • Define design-space thinking and discovery-loop logic.
  • Diagnose bias/leakage risks before model comparison.

04. Discovery bottleneck in classical materials workflows

  • Classical trial-and-error is too slow for large combinatorial spaces.
  • Data-driven prioritization reduces expensive experimental/DFT evaluations.

05. Data-rich turn in materials science

  • Digitization of simulation and instrumentation created reusable data assets.
  • Opportunity comes with data governance and quality burdens.

06. Where this course connects to MFML and ML-PC

  • MFML supplies formal risk/validation language.
  • ML-PC supplies experimental-data cautionary patterns complementary to MG.

07. 90-minute roadmap

  • Motivation -> data assets -> representation assumptions -> validity -> exercise bridge.
  • Focus on scientific reliability, not benchmark chasing.

08. Checkpoint prompt: “Where does ML add value here?”

  • Ask where ML is additive versus redundant to existing pipelines.
  • Force explicit statement of decision target and cost.

09. Periodic table + structure space as searchable manifold

  • Treat composition+structure space as high-dimensional manifold.
  • Search requires representations that preserve relevant invariances.

10. PSPP graph for materials discovery

  • Discovery links processing, structure, properties, and performance.
  • MG emphasizes structure/property edges and candidate ranking.

11. Targets: formation energy, stability, bandgap, moduli, etc.

  • Common targets: formation energy, hull distance, bandgap, moduli.
  • Target semantics determine feasible model and metric choices.

12. Direct simulation vs surrogate modeling

  • Surrogates trade some fidelity for massive throughput gains.
  • Best practice: active loop with high-fidelity validation of top candidates.

13. Screening logic: rank then validate

  • Rank by expected utility, not only predicted mean.
  • Incorporate uncertainty to avoid overconfident exploitation.

14. Why uncertainty is required for candidate prioritization

  • Without uncertainty, candidate prioritization is fragile under shift.
  • Use uncertainty for exploration-exploitation balance.

15. Domain knowledge constraints

  • Charge neutrality, stability, symmetry act as hard/soft constraints.
  • Constraints reduce implausible regions and improve sample efficiency.

16. First-principles + data-driven hybrid strategy

  • Combine first-principles computations with learned surrogates.
  • Hybrid loops enable rapid hypothesis generation with physical anchoring.

17. Micro-case: when pure data fit fails physically

  • Show failure where model learns dataset origin instead of chemistry.
  • Use this to motivate robust split design.

18. Database landscape (MP, OQMD, AFLOW, NOMAD)

  • Materials Project, OQMD, AFLOW, NOMAD differ in scope/provenance.
  • Document source and version for reproducible claims.

19. What each database gives / misses

  • Databases provide broad coverage but uneven labels and metadata quality.
  • Absence patterns are informative and can induce bias.

20. Data object types: composition, structure, process metadata

  • Inputs include composition vectors, crystal graphs, process metadata.
  • Different object types imply different model classes.

21. Thermodynamic quantities used in ML datasets

  • Formation energy and hull distance are central for stability-aware tasks.
  • Clarify units and reference states before modeling.

22. Representation problem statement

  • Representation must encode invariance/equivariance requirements.
  • Poor representation can dominate model error.

23. Classical descriptors vs learned representations (preview)

  • Classical descriptors are interpretable but may saturate performance.
  • Learned representations can capture nonlinear interactions.

24. Symmetry and invariance constraints

  • Permutation, rotational, and lattice symmetries must be respected.
  • Violation causes data inefficiency and spurious patterns.

25. Data quality dimensions

  • Assess completeness, consistency, uncertainty, and provenance.
  • Quality gates should be explicit before training.

26. Metadata and provenance importance

  • Track workflow origin, method settings, and preprocessing steps.
  • Provenance enables debugging and scientific trust.

27. Dataset shift across generation pipelines

  • Shift arises across computational settings, labs, and synthesis protocols.
  • Evaluate robustness with split strategies reflecting deployment.

28. Bias map: coverage, publication, synthesis bias

  • Coverage and publication biases distort apparent performance.
  • Report limitations with subgroup diagnostics.

29. Leakage map in materials datasets

  • Family-level and polymorph-level leakage are common pitfalls.
  • Use grouped splits by composition/structure families.

30. Recap: data assumptions that must be explicit

  • State assumptions explicitly: target definition, split logic, and uncertainty handling.
  • Assumptions are part of the model.

31. Task formulations in MG

  • MG uses regression, classification, and ranking depending on decision stage.
  • Task choice should mirror downstream screening action.

32. Regression baseline + error interpretation

  • Start with simple baselines for sanity and residual analysis.
  • Inspect error heterogeneity across chemical subspaces.

33. Classification/ranking formulations for screening

  • Classification supports filter stages; ranking supports prioritization.
  • Ranking metrics may be more relevant than MSE in screening.

34. Train/val/test under compositional grouping

  • Random splits overestimate performance under compositional correlation.
  • Use grouped/time-aware splits to emulate deployment.

35. OOD behavior in chemical space

  • Evaluate extrapolation beyond seen chemistry/structure domains.
  • OOD uncertainty should increase; if not, treat as warning signal.

36. Uncertainty-aware ranking concept

  • Select by acquisition criteria combining mean and uncertainty.
  • Avoid deterministic top-k overconfident traps.

37. Exploration vs exploitation (concept only)

  • Pure exploitation misses novel regions; pure exploration wastes budget.
  • Balance adaptively with uncertainty-aware acquisition.

38. Decision rule with uncertainty and cost

  • Translate predictions to decisions via explicit utility/cost.
  • Document thresholds and rationale.

39. Explainability expectations in scientific ML

  • Scientific discovery requires interpretable hypotheses, not only scores.
  • Use attribution and counterfactual checks cautiously.

40. Reproducibility checklist for discovery claims

  • Share data split definitions, code, model version, and seeds.
  • Reproducibility is part of scientific validity.

41. Common failure post-mortems

  • Analyze false discoveries and missed candidates systematically.
  • Turn failures into design rules for next cycle.

42. Exercise objective and dataset

  • Reproduce a mini discovery pipeline on curated subset.
  • End with one defendable recommendation.

43. Step 1: query and clean materials table

  • Query and clean table; track missingness and units.
  • Define train/val/test with leakage safeguards.

44. Step 2: feature table v1 (simple descriptors)

  • Construct baseline descriptor set and document assumptions.
  • Run a simple baseline model before complexity.

45. Step 3: baseline model + grouped split

  • Evaluate with task-relevant metrics and uncertainty proxy.
  • Compare grouped vs random split outcomes.

46. Step 4: error and bias diagnosis

  • Diagnose one concrete bias artifact and propose mitigation.
  • Quantify impact on candidate ranking.

47. Step 5: one uncertainty proxy and discussion

  • Add uncertainty-aware selection and discuss changed priorities.
  • Reflect on risk tolerance in discovery context.

48. What to report in notebook (scientific style)

  • Include problem statement, split logic, metrics, error analysis, and recommendations.
  • Avoid reporting only best score without context.

49. Unit summary: 10 exam-relevant statements

  • Capture 10 exam-relevant statements on validity, uncertainty, and workflow.
  • Practice concise justification language.

50. References + reading for Unit 2

  • Data science in materials is interdisciplinary and explicitly tied to domain knowledge (Sandfeld Ch. 2.1).
  • ML is a method family within a broader AI/data-science ecosystem; avoid buzzword conflation (Sandfeld Ch. 2.1; McClarren Ch. 1).
  • Model trust requires explainability and uncertainty framing; “black-box by default” is not acceptable for scientific discovery (Neuer Ch. 1.1.2–1.1.3).
  • Lecture: design-space thinking, dataset caveats, validity criteria, uncertainty-aware decision logic.
  • Exercise: practical query/build/evaluate loop with one explicit bias diagnosis.
  • Sandfeld Ch. 2.1–2.3
  • Neuer Ch. 1.1.2–1.1.3
  • McClarren Ch. 1.1 + 1.5
Bishop, Christopher M. 2006. Pattern Recognition and Machine Learning. Springer.
McClarren, Ryan G. 2021. Machine Learning for Engineers: Using Data to Solve Problems for Physical Systems. Springer.
Murphy, Kevin P. 2012. Machine Learning: A Probabilistic Perspective. MIT Press.
Neuer, Michael et al. 2024. Machine Learning for Engineers: Introduction to Physics-Informed, Explainable Learning Methods for AI in Engineering Applications. Springer Nature.
Sandfeld, Stefan et al. 2024. Materials Data Science. Springer.