ML for Characterization and Processing
Lecture 3: Data Quality, Preprocessing, and Robust Validation

Philipp Pelz

FAU Erlangen-Nürnberg

FAU Logo IMN Logo CENEM Logo ERC Logo Eclipse Logo

Welcome

Week 3 – Garbage In, Garbage Out

Course goals for today:

  • Understand the lifecycle of materials data from sensor to digital representation
  • Learn robust data cleaning and outlier detection strategies
  • Master data transformations: scaling, frequency domain, and time-frequency analysis
  • Explore labeling challenges and uncertainty in materials datasets
  • Implement robust validation to prevent data leakage and overfitting

Outline

  1. Data Cleaning & Outlier Detection
  2. Transformations & Scaling
  3. Labeling Challenges & Uncertainty
  4. Robust Validation (K-Fold, Data Leakage)
  5. Summary & Error Measures

1. Data Cleaning: The First Line of Defense

Data quality is not accidental.

Raw data often contains: - Structural problems: Typographical errors, inconsistent units, misaligned timestamps. - Duplicates: Multiple readings of the same physical event. - Missing values (NaNs): Caused by sensor drops or out-of-range events.

Strategy: “Always clean at the source first.” If hardware fixes aren’t possible, use digital repair (interpolation or markers).

Outlier Detection

Outliers are not just “wrong” data; they carry information.

  • Point Outliers: Individual measurements far from the distribution.
  • Contextual Outliers: Noise that deviates within a local context (e.g., temporal noise).
  • Collective Outliers: Groups of points that deviate together.

Crucial Decision: Is it an artifact of the instrument or a rare physical event (e.g., crack initiation)?

2. Transformations: Changing Perspectives

Scaling & Normalization

Algorithms like kNN and PCA are sensitive to absolute magnitudes.

  • Standardization: Mean 0, Std 1 (Z-score).
  • Min-Max Scaling: Mapping to [0,1].
  • Non-dimensionalization: Removing units to reveal underlying physics.

Advanced Transformations

  • Differentiation: \(f'(x)\) removes constant baselines and highlights dynamic changes.
  • FFT (Fourier): Moving from time/space to frequency domain to find oscillations.
  • Wavelets (CWT): Localizing time-frequency features (e.g., acoustic emissions).
  • Triggering: Extracting repetitive process windows from long streams.

3. Labeling Challenges

Microstructure = Latent Truth?

  • Hand-labeling: Subjective and slow.
  • Inter-Annotator Variance: Two experts rarely agree 100% on grain boundaries or phase masks.
  • Uncertainty: Labels should be probabilistic. Models should output confidence (Softmax).

Takeaway: Ground truth in materials science is often a consensus, not an absolute.

4. Robust Validation

Overfitting: The “Memorization” Trap

A model that fits noise perfectly will fail on new data. Parsimony (Occam’s Razor): Prefer the simplest model that explains the data.

Validation Strategies:

  • K-Fold CV: Iterative training/testing to average out split bias.
  • LOOCV: Leave-one-out for very small datasets.
  • Stratified Split: Maintaining class balance.

Data Leakage: The Silent Killer

Leakage occurs when test information “infects” the training process.

  • Temporal Leakage: Using future data to predict the past.
  • Group Leakage: Splitting multiple images from the same sample between train and test.

Solution: Group-based and time-aware splitting strategies.

5. Summary & Error Measures

How do we measure success?

  • Regression: MAE, MSE, RMSE, \(R^2\).
  • Classification: Precision, Recall, F1 (Dice), IoU (Jaccard).

Key Takeaway:

Robust ML requires domain knowledge in preprocessing and rigorous skepticism in validation.

Questions?

Use the chalkboard!

References