Week 1 Summary: What makes materials data special?
Cross-Book Summary
1. The Concept of Data-Based Modeling (Neuer, Sandfeld, McClarren)
- Data-based vs. First-Principle Models: In engineering, we traditionally rely on first-principle models (bottom-up, derived from axioms like Newton’s laws). Machine Learning introduces data-based modeling (top-down), where the model is extracted from observed process data.
- Traceability (White, Grey, Black Boxes): First-principle models are “White-Box” (fully explainable). ML models are often labeled “Black-Box,” though techniques for “Grey-Box” (hybrid) and explainability are emerging to build scientific trust.
- Overfitting: A historical perspective (Sandfeld) shows that even Aristotle’s 56-sphere model of the heavens was a form of overfitting—being too complex for the task at hand.
2. Foundations of Data (Neuer, Bishop)
- Data Types: Nominal (names/categories), Ordinal (ordered), Cardinal (numeric), and Binary.
- Scales: Nominal, Ordinal, Interval (e.g., Celsius), and Ratio (e.g., Kelvin with absolute zero). Understanding the scale is critical for correct normalization and interpretation.
- Units and Uncertainty: Experimental data is meaningless without units and an estimation of measurement uncertainty.
3. Materials Science Specifics (ML-PC index, Sandfeld)
- The PSPP Paradigm: Processing–Structure–Property–Performance forms a dependency graph. Materials data is inherently multi-scale and multi-modal (images, spectra, logs).
- Experimental Noise: Unlike “clean” toy datasets, materials data contains physical noise, sampling artifacts (aliasing), and biases from instrument resolution.
- Data Scarcity: Obtaining high-quality materials data is slow and expensive, making standard “big data” approaches often inapplicable.
90-Minute Lecture Strategy (50 Slides)
Part 1: Introduction & Philosophy (Slides 1-10)
- Course goals and the AI 4 Materials program.
- The “Hype Cycle” vs. Reality in scientific AI.
- Why now? The convergence of high-throughput experiments, simulation, and ML.
Part 2: Models in Engineering (Slides 11-20)
- Defining a “Model”: Prediction vs. Explanation.
- First-Principles (physics-based) vs. Data-Driven approaches.
- The Black-Box stigma and the move toward White-Box ML.
Part 3: What Makes Materials Data Special? (Slides 21-35)
- The PSPP Chain as a data graph.
- Types of data: Micrographs (pixels), EBSD (orientations), EDS (spectra), Process logs (time-series).
- The “Small Data” problem in Materials Science.
- Physical Priors: How physics limits the possible data space.
Part 4: Data Foundations & Quality (Slides 36-45)
- Categorizing data: Nominal, Ordinal, Cardinal.
- The critical role of Metadata and Units.
- Measurement uncertainty and its propagation into ML models.
Part 5: The CRISP-DM Workflow for Labs (Slides 46-50)
- Adapting the industrial standard to the materials lab.
- From “Scientific Understanding” to “Deployment” in production.
- Correlation != Causality (The Ice Cream and Crime rate example).
Quarto Website Update (Summary)
Summary for ML-PC Week 1:
This unit introduces the transition from classical physics-based modeling to data-driven discovery in materials science. We explore the unique challenges of experimental materials data, including its multi-modal nature, high acquisition cost, and the fundamental Processing-Structure-Property-Performance (PSPP) relationships. Key concepts include data scales, measurement uncertainty, and the CRISP-DM process adapted for scientific workflows.