Welcome
Week 3 – Garbage In, Garbage Out
Course goals for today:
- Understand the lifecycle of materials data from sensor to digital representation
- Learn robust data cleaning and outlier detection strategies
- Master data transformations: scaling, frequency domain, and time-frequency analysis
- Explore labeling challenges and uncertainty in materials datasets
- Implement robust validation to prevent data leakage and overfitting
Outline
- Data Cleaning & Outlier Detection
- Transformations & Scaling
- Labeling Challenges & Uncertainty
- Robust Validation (K-Fold, Data Leakage)
- Summary & Error Measures
1. Data Cleaning: The First Line of Defense
Data quality is not accidental.
Raw data often contains: - Structural problems: Typographical errors, inconsistent units, misaligned timestamps. - Duplicates: Multiple readings of the same physical event. - Missing values (NaNs): Caused by sensor drops or out-of-range events.
Strategy: “Always clean at the source first.” If hardware fixes aren’t possible, use digital repair (interpolation or markers).
Outlier Detection
Outliers are not just “wrong” data; they carry information.
- Point Outliers: Individual measurements far from the distribution.
- Contextual Outliers: Noise that deviates within a local context (e.g., temporal noise).
- Collective Outliers: Groups of points that deviate together.
Crucial Decision: Is it an artifact of the instrument or a rare physical event (e.g., crack initiation)?
3. Labeling Challenges
Microstructure = Latent Truth?
- Hand-labeling: Subjective and slow.
- Inter-Annotator Variance: Two experts rarely agree 100% on grain boundaries or phase masks.
- Uncertainty: Labels should be probabilistic. Models should output confidence (Softmax).
Takeaway: Ground truth in materials science is often a consensus, not an absolute.
4. Robust Validation
Overfitting: The “Memorization” Trap
A model that fits noise perfectly will fail on new data. Parsimony (Occam’s Razor): Prefer the simplest model that explains the data.
Validation Strategies:
- K-Fold CV: Iterative training/testing to average out split bias.
- LOOCV: Leave-one-out for very small datasets.
- Stratified Split: Maintaining class balance.
Data Leakage: The Silent Killer
Leakage occurs when test information “infects” the training process.
- Temporal Leakage: Using future data to predict the past.
- Group Leakage: Splitting multiple images from the same sample between train and test.
Solution: Group-based and time-aware splitting strategies.
5. Summary & Error Measures
How do we measure success?
- Regression: MAE, MSE, RMSE, \(R^2\).
- Classification: Precision, Recall, F1 (Dice), IoU (Jaccard).
Key Takeaway:
Robust ML requires domain knowledge in preprocessing and rigorous skepticism in validation.
Questions?
Use the chalkboard!