Machine Learning in Materials Processing & Characterization
Unit 3: Data Quality, Preprocessing, and Robust Validation
FAU Erlangen-Nürnberg
Note
Unit 3 focuses on everything that happens before and after modeling — the steps most often skipped.
By the end of this unit, you can:
Today we focus on the gold boxes: Preprocessing and Evaluation.
These are where 80% of the work happens — and where 80% of the mistakes hide.
Transforming raw, potentially messy data into a structured format suitable for algorithms:
Slides 05–11
Why data goes missing:
Detection: df.isnull().sum() in Pandas — always the first thing to check.
Fix at source (ideal):
Digital repair (if source fix impossible):
Numerical markers: Using “impossible” values (e.g., -1000°C) to track NaNs without losing record count.
Three types of outliers:
Scenario: You find a hardness value 3× higher than all others in your dataset.
Answer: Always (C). Is it a cosmic ray on the detector? A typo? Or a rare but real physical event (crack initiation, phase transformation)?
Removing real outliers destroys the most interesting data points.
Note
If you remove 20% of your data without documenting why, your results are not reproducible.
Slides 12–20
Centering changes the origin but not the shape or spread of the distribution.
\[x' = \frac{x - x_{\min}}{x_{\max} - x_{\min}} \in [0, 1]\]
Good for: Neural network inputs (bounded activations like sigmoid expect [0,1]).
Bad for: Noisy lab data with occasional extreme values.
\[x' = \frac{x - \mu}{\sigma}\]
\[x' = \frac{x - \text{median}}{\text{IQR}}\]
sklearn.preprocessing.RobustScalerCaution: Log-transform requires strictly positive values. Check for zeros first!

Critical rule for all transformations: Scalers and transforms must be “fit” on training data only and “applied” to test data. Otherwise: preprocessing leakage.
Slides 21–24

Best practice: Have multiple annotators. Report the inter-annotator agreement. Your model should aim for human-level performance, not perfect accuracy.
The Bayesian perspective: treat model parameters as distributions, not point estimates. Prior + Likelihood = Posterior.
Slides 25–37
Underfitting (High Bias):
Overfitting (High Variance):

\[\text{Total Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Noise}\]
The sweet spot: enough complexity to capture the physics, not so much that you memorize the noise.
Occam’s Razor: Prefer the simplest model that explains the data.
“Entities must not be multiplied beyond necessity.”
Regularization adds a penalty for complexity:
Risk for small datasets: An unlucky split puts all “hard” cases in the test set → pessimistic estimate. Or all “easy” cases → optimistic estimate.
More stable than holdout: every sample gets to be in the test set exactly once.
Definition: When information from the test set “leaks” into training, producing over-optimistic results that vanish during real deployment.
Three types of leakage in materials ML:
Note
“If your accuracy is too good to be true, it probably is.”
Scenario: A sample is cut into 100 image patches. 80 go to train, 20 to test.
Problem: Patches from the same physical sample are highly correlated. The model “recognizes” the specific sample instead of learning general physics.
Scenario: Predict the quality of a weld based on sensor logs.
Problem: Using data from \(t = 50\) min to predict a property measured at \(t = 10\) min.
Solution: “Walk-forward” validation — only use data available at the time of prediction.
In time-series: never shuffle! The arrow of time must be respected in your splits.
Scenario: Calculate mean and standard deviation of the entire dataset before splitting.
Problem: Information about the test set distribution is now embedded in the training features.
Solution: Use Pipeline objects to encapsulate all preprocessing.
Setup: You have 20 steel samples. Each is imaged at 5 locations. You standardize all features, then randomly split the 100 images into 80 train / 20 test. You achieve R² = 0.95.
How many leakage errors are present?
Answer: Three!
Computationally expensive (\(5 \times 5 = 25\) model fits) but the gold standard for small datasets.
Slides 38–47
| Task | Common Metrics |
|---|---|
| Regression | MAE, MSE, RMSE, R² |
| Classification | Accuracy, Precision, Recall, F1, AUC |
| Segmentation | IoU (Jaccard), Dice, Pixel Accuracy |
\[\text{MAE} = \frac{1}{N}\sum_{i=1}^{N} |y_i - \hat{y}_i|\]
\[\text{MSE} = \frac{1}{N}\sum_{i=1}^{N} (y_i - \hat{y}_i)^2 \qquad \text{RMSE} = \sqrt{\text{MSE}}\]
\[R^2 = 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}\]
Pitfall: High R² doesn’t mean a useful model. Always check residual plots.
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Actually Positive | TP (True Positive) | FN (False Negative) |
| Actually Negative | FP (False Positive) | TN (True Negative) |
\[\text{Precision} = \frac{TP}{TP + FP}\]
“Of all predicted positives, how many were actually positive?”
Materials example: Automated defect detection in a production line.
High precision = few false alarms → operators trust the system.
\[\text{Recall} = \frac{TP}{TP + FN}\]
“Of all actually positive cases, how many did we find?”
Materials example: Safety-critical inspection of turbine blades.
High recall = we find (almost) every crack → no catastrophic failures.
\[F_1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}\]
\[\text{IoU} = \frac{|A \cap B|}{|A \cup B|} = \frac{TP}{TP + FP + FN}\]
\[L = -\sum_{c=1}^{C} y_c \log(\hat{y}_c)\]
Connection: Cross-entropy is the negative log-likelihood under a categorical distribution — it’s the Bayesian-correct loss for classification.
Slides 48–50
Reading:
Next Week: Unit 4 — From Classical Microstructure Metrics to Learned Representations

© Philipp Pelz - Machine Learning in Materials Processing & Characterization