Back to Article
Week 1 Summary: What makes materials data special?
Download Notebook

Week 1 Summary: What makes materials data special?

Cross-Book Summary

1. The Concept of Data-Based Modeling (Neuer, Sandfeld, McClarren)

  • Data-based vs. First-Principle Models: In engineering, we traditionally rely on first-principle models (bottom-up, derived from axioms like Newton’s laws). Machine Learning introduces data-based modeling (top-down), where the model is extracted from observed process data.
  • Traceability (White, Grey, Black Boxes): First-principle models are “White-Box” (fully explainable). ML models are often labeled “Black-Box,” though techniques for “Grey-Box” (hybrid) and explainability are emerging to build scientific trust.
  • Overfitting: A historical perspective (Sandfeld) shows that even Aristotle’s 56-sphere model of the heavens was a form of overfitting—being too complex for the task at hand.

2. Foundations of Data (Neuer, Bishop)

  • Data Types: Nominal (names/categories), Ordinal (ordered), Cardinal (numeric), and Binary.
  • Scales: Nominal, Ordinal, Interval (e.g., Celsius), and Ratio (e.g., Kelvin with absolute zero). Understanding the scale is critical for correct normalization and interpretation.
  • Units and Uncertainty: Experimental data is meaningless without units and an estimation of measurement uncertainty.

3. Materials Science Specifics (ML-PC index, Sandfeld)

  • The PSPP Paradigm: Processing–Structure–Property–Performance forms a dependency graph. Materials data is inherently multi-scale and multi-modal (images, spectra, logs).
  • Experimental Noise: Unlike “clean” toy datasets, materials data contains physical noise, sampling artifacts (aliasing), and biases from instrument resolution.
  • Data Scarcity: Obtaining high-quality materials data is slow and expensive, making standard “big data” approaches often inapplicable.

90-Minute Lecture Strategy (50 Slides)

Part 1: Introduction & Philosophy (Slides 1-10)

  • Course goals and the AI 4 Materials program.
  • The “Hype Cycle” vs. Reality in scientific AI.
  • Why now? The convergence of high-throughput experiments, simulation, and ML.

Part 2: Models in Engineering (Slides 11-20)

  • Defining a “Model”: Prediction vs. Explanation.
  • First-Principles (physics-based) vs. Data-Driven approaches.
  • The Black-Box stigma and the move toward White-Box ML.

Part 3: What Makes Materials Data Special? (Slides 21-35)

  • The PSPP Chain as a data graph.
  • Types of data: Micrographs (pixels), EBSD (orientations), EDS (spectra), Process logs (time-series).
  • The “Small Data” problem in Materials Science.
  • Physical Priors: How physics limits the possible data space.

Part 4: Data Foundations & Quality (Slides 36-45)

  • Categorizing data: Nominal, Ordinal, Cardinal.
  • The critical role of Metadata and Units.
  • Measurement uncertainty and its propagation into ML models.

Part 5: The CRISP-DM Workflow for Labs (Slides 46-50)

  • Adapting the industrial standard to the materials lab.
  • From “Scientific Understanding” to “Deployment” in production.
  • Correlation != Causality (The Ice Cream and Crime rate example).

Quarto Website Update (Summary)

Summary for ML-PC Week 1:
This unit introduces the transition from classical physics-based modeling to data-driven discovery in materials science. We explore the unique challenges of experimental materials data, including its multi-modal nature, high acquisition cost, and the fundamental Processing-Structure-Property-Performance (PSPP) relationships. Key concepts include data scales, measurement uncertainty, and the CRISP-DM process adapted for scientific workflows.