Materials Genomics
Unit 2: Simulation Methods as Data Generators

Prof. Dr. Philipp Pelz

FAU Erlangen-Nürnberg

FAU Logo IMN Logo CENEM Logo ERC Logo Eclipse Logo

01. Title: Simulation Methods as Data Generators

  • Unit scope and role.

02. Learning objectives

  • What students should be able to do.

03. Why simulations dominate materials data generation

  • Controlled mapping from assumptions to outputs.

04. Recap from Unit 1

  • Design space and validity.

05. Simulation as a map from assumptions to data

  • Inputs, solver, outputs, metadata.

06. Length and time scales in materials modeling

  • Continuum to atomistic to electronic.

07. FEM outputs

  • Stress, strain, fields, constitutive response.

08. MD outputs

  • Trajectories, forces, diffusion observables.

09. MC outputs

  • Thermodynamic sampling and phase averages.

10. DFT outputs

  • Energies, forces, band-related quantities.

11. Cost vs accuracy vs scale

  • Why no single simulator dominates all tasks.

12. Hidden bias from simulation choices

  • Functionals, force fields, boundary conditions.

13. What becomes an ML target

  • Labels, constraints, and proxy observables.

14. What remains metadata

  • Provenance needed for trust and reuse.

15. Simulation consistency vs physical accuracy

  • Reproducibility tradeoffs.

16. Which method for which property

  • Screening logic by task.

17. Failure mode: mismatched fidelity

  • Wrong labels for wrong question.

18. Failure mode: missing provenance

  • Irreproducible datasets.

19. Bridge to databases

  • Why records need method metadata.

20. Bridge to Week 3

  • Atomistic and electronic simulations in detail.

21. Bridge to Week 4

  • Stability and continuum outputs as ML context.

22. Feature leakage risks

  • Label proxies and duplicates.

23. Train/val/test with structure families

  • Grouped split rationale.

24. Distribution shift in crystal data

  • Dataset transfer issues.

25. Target examples

  • Bandgap, formation energy, stability.

26. Physical constraints in predictions

  • Plausibility checks.

27. Error analysis by structure class

  • Beyond aggregate metrics.

28. Uncertainty in structure-property models

  • Confidence-aware decisions.

29. Outliers and anomaly handling

  • Discovery vs data errors.

30. Data provenance importance

  • Reproducibility and trust.

31. FAIR perspective (light)

  • Reuse-oriented data practice.

32. Minimal baseline workflow

  • Parse, featurize, split, train, evaluate.

33. Metrics choice by target type

  • Regression vs classification.

34. Model card for materials task

  • Document assumptions.

35. Common failure mode #1

  • Overfit to narrow chemistry.

36. Common failure mode #2

  • Hidden duplicates/leakage.

37. Common failure mode #3

  • Domain shift across databases.

38. Mitigation checklist

  • Practical guardrails.

39. Case sketch: crystal subset study

  • End-to-end scaffold.

40. Case sketch: split comparison

  • Random vs grouped outcomes.

41. MFML dependency map

  • Terms reused from MFML.

43. Exercise scaffold: task setup

  • Dataset and constraints.

44. Exercise scaffold: parsing step

  • CIF processing.

45. Exercise scaffold: feature table

  • Baseline descriptors.

46. Exercise scaffold: split + model

  • Group-aware baseline.

47. Exercise scaffold: diagnostics

  • One bias/leakage analysis.

48. Exam-oriented key statements

  • High-yield concepts.

49. Summary slide

  • What to retain.

50. References + reading assignment

  • Transition to Unit 3.