Mathematical Foundations of AI & ML
Unit 8: Tree Ensembles for Tabular Learning
FAU Erlangen-Nürnberg
By the end of this lecture, students can:
\[\mathbb{E}[(y-\hat f(x))^2] = \underbrace{\mathrm{Bias}[\hat f]^2}_{\text{too rigid}} + \underbrace{\mathrm{Var}[\hat f]}_{\text{too sensitive}} + \underbrace{\sigma^2}_{\text{irreducible}}\]
You are a farmer with a new plot of land. For each tree you can measure only two things — the trunk’s diameter and its height — and you must decide: Apple, Cherry, or Oak?



A node is pure if it holds one class, impure if it is a mix. Entropy (also Gini) puts a number on it: 0 = pure, larger = more mixed.


A decision tree recursively partitions input space by axis-aligned splits.
At each node, pick the (feature, threshold) that maximizes impurity reduction:
\[ \Delta = I(\text{parent}) - \frac{N_L}{N} I(\text{left}) - \frac{N_R}{N} I(\text{right}). \]
Regression
Classification
All three peak at \(p=0.5\) and vanish at pure nodes — but misclassification error is piecewise-linear, so it cannot reward a split that purifies a child without flipping its majority. Gini/entropy are strictly concave and can.
This is precisely the regime where averaging helps. → Ensembles.
Strengths
Limitations
tree_data = {
tree_reseed;
const n = 60;
const rnd = d3.randomNormal(0, tree_noise);
const pts = [];
for (let i = 0; i < n; i++) {
const x = (i / (n - 1)) * 6;
pts.push({ x, y: Math.sin(2 * x) + 0.3 * x + rnd() });
}
return pts;
}
function buildRegTree(rows, depth, maxDepth) {
const ys = rows.map(d => d.y);
const mean = d3.mean(ys);
if (depth >= maxDepth || rows.length < 4) return { leaf: true, value: mean };
const xs = Array.from(new Set(rows.map(d => d.x))).sort((a, b) => a - b);
let best = null;
for (let i = 1; i < xs.length; i++) {
const thr = (xs[i - 1] + xs[i]) / 2;
const L = rows.filter(d => d.x <= thr);
const R = rows.filter(d => d.x > thr);
if (L.length < 2 || R.length < 2) continue;
const sse =
L.length * (d3.variance(L.map(d => d.y)) || 0) +
R.length * (d3.variance(R.map(d => d.y)) || 0);
if (best === null || sse < best.sse) best = { thr, sse, L, R };
}
if (best === null) return { leaf: true, value: mean };
return {
leaf: false,
thr: best.thr,
left: buildRegTree(best.L, depth + 1, maxDepth),
right: buildRegTree(best.R, depth + 1, maxDepth)
};
}
function predictTree(node, x) {
return node.leaf ? node.value : (x <= node.thr ? predictTree(node.left, x) : predictTree(node.right, x));
}
tree_fitted = {
const t = buildRegTree(tree_data, 0, tree_depth);
return d3.range(0, 6.001, 0.02).map(x => ({ x, y: predictTree(t, x) }));
}
Plot.plot({
width: 820, height: 430,
marginLeft: 60, marginBottom: 52, marginTop: 20,
style: {
background: "transparent",
color: "#e5edf7",
fontSize: "20px",
fontFamily: "Inter, sans-serif"
},
x: { domain: [0, 6], label: "x →", grid: true },
y: { domain: [-2, 3], label: "↑ y", grid: true },
marks: [
Plot.line(d3.range(0, 6.01, 0.05), { x: d => d, y: d => Math.sin(2 * d) + 0.3 * d, stroke: "#94a3b8", strokeDasharray: "4,4", strokeWidth: 2 }),
Plot.dot(tree_data, { x: "x", y: "y", r: 3.5, fill: "#60a5fa", fillOpacity: 0.55 }),
Plot.line(tree_fitted, { x: "x", y: "y", stroke: "#4ade80", strokeWidth: 3, curve: "step-after" })
]
})Bootstrap aggregating (Breiman 1996):
\[\mathrm{Var}\!\left(\tfrac1B\sum_b \hat f_b\right)=\rho\,\sigma^2+\frac{1-\rho}{B}\,\sigma^2.\]
As \(B\to\infty\) the second term vanishes; variance floors at \(\rho\sigma^2\).
RandomForestRegressor / RandomForestClassifier (Breiman 2001).With \(B\approx 500\) fully grown, feature-subsampled trees, RF is a strong, low-tuning baseline that usually crushes a single tuned tree.

ens_var = (rho, B) => rho + (1 - rho) / B;
var_curves = {
const rows = [];
for (let B = 1; B <= 200; B++) {
rows.push({ B, v: ens_var(rho_a, B), kind: `Bagging (ρ=${rho_a.toFixed(2)})` });
rows.push({ B, v: ens_var(rho_b, B), kind: `Random forest (ρ=${rho_b.toFixed(2)})` });
}
return rows;
}
Plot.plot({
width: 1000, height: 700,
marginLeft: 80, marginBottom: 64, marginTop: 50,
style: {
background: "transparent",
color: "#e5edf7",
fontSize: "21px",
fontFamily: "Inter, sans-serif"
},
x: { domain: [1, 200], label: "Number of trees B →", grid: true },
y: { domain: [0, 1], label: "↑ Ensemble variance (σ²=1)", grid: true },
color: { legend: true, range: ["#f87171", "#60a5fa"] },
marks: [
Plot.ruleY([rho_a], { stroke: "#f87171", strokeDasharray: "4,4", strokeWidth: 1.5 }),
Plot.ruleY([rho_b], { stroke: "#60a5fa", strokeDasharray: "4,4", strokeWidth: 1.5 }),
Plot.line(var_curves, { x: "B", y: "v", stroke: "kind", strokeWidth: 3.5 })
]
})MDI = Mean Decrease in Impurity. Reuse the per-split gain \(\Delta\) from “how splits are chosen”: credit each feature with the impurity it removed, weighted by how many samples passed through that split, summed over every node and averaged across all \(B\) trees.
\[ \text{MDI}_j \;=\; \frac{1}{B}\sum_{b=1}^{B}\;\sum_{\substack{\text{nodes in tree }b\\\text{that split on }j}} \frac{N_{\text{node}}}{N}\,\Delta_{\text{node}}, \qquad \text{then normalized so } \sum_j \text{MDI}_j = 1. \]
feature_importances_ in scikit-learn — free, computed during training, no extra passes.Impurity (Gini/MDI) importance
Better alternatives
For a materials paper claiming “feature X drives the property,” report permutation or SHAP, not raw impurity importance.
It answers one question: how much does the trained model actually rely on this feature?
predict and a score.
Bagging averages independent learners; boosting composes dependent ones.
For loss \(\mathcal{L}\), at iteration \(t\):
\(\eta\) = learning rate (shrinkage). Small \(\eta\) + many trees generalizes better than large \(\eta\) + few.

Squared error \(\mathcal{L}=\tfrac12 (y-\hat f)^2\)
Logistic / other losses
Boosting can and will overfit — it drives training loss toward zero. Controls:
gb_all = {
const n = 70;
const rndTr = d3.randomNormal(0, 0.3);
const rndVa = d3.randomNormal(0, 0.3);
const tr = [], va = [];
for (let i = 0; i < n; i++) {
const x = (i / (n - 1)) * 6;
tr.push({ x, y: Math.sin(2 * x) + 0.3 * x + rndTr() });
const xv = ((i + 0.5) / n) * 6;
va.push({ x: xv, y: Math.sin(2 * xv) + 0.3 * xv + rndVa() });
}
return { tr, va };
}
bestStump = function (rows) {
const xs = Array.from(new Set(rows.map(d => d.x))).sort((a, b) => a - b);
let best = null;
for (let i = 1; i < xs.length; i++) {
const thr = (xs[i - 1] + xs[i]) / 2;
const L = rows.filter(d => d.x <= thr);
const R = rows.filter(d => d.x > thr);
if (!L.length || !R.length) continue;
const cL = d3.mean(L, d => d.r), cR = d3.mean(R, d => d.r);
let sse = 0;
for (const d of L) sse += (d.r - cL) ** 2;
for (const d of R) sse += (d.r - cR) ** 2;
if (best === null || sse < best.sse) best = { thr, cL, cR, sse };
}
return best;
}
gb_curves = {
const { tr, va } = gb_all;
const stumps = [];
let predTr = tr.map(() => 0);
let predVa = va.map(() => 0);
const rmse = (arr, pred) => Math.sqrt(d3.mean(arr.map((d, i) => (d.y - pred[i]) ** 2)));
const rows = [];
for (let t = 1; t <= gb_T; t++) {
const resid = tr.map((d, i) => ({ x: d.x, r: d.y - predTr[i] }));
const s = bestStump(resid);
if (!s) break;
stumps.push(s);
predTr = tr.map((d, i) => predTr[i] + gb_eta * (d.x <= s.thr ? s.cL : s.cR));
predVa = va.map((d, i) => predVa[i] + gb_eta * (d.x <= s.thr ? s.cL : s.cR));
rows.push({ t, e: rmse(tr, predTr), kind: "Train RMSE" });
rows.push({ t, e: rmse(va, predVa), kind: "Validation RMSE" });
}
return rows;
}
Plot.plot({
width: 820, height: 430,
marginLeft: 64, marginBottom: 52, marginTop: 44,
style: {
background: "transparent",
color: "#e5edf7",
fontSize: "20px",
fontFamily: "Inter, sans-serif"
},
x: { domain: [1, gb_T], label: "Boosting iteration →", grid: true },
y: { domain: [0, 1.2], label: "↑ RMSE", grid: true },
color: { legend: true, range: ["#60a5fa", "#f87171"] },
marks: [
Plot.line(gb_curves, { x: "t", y: "e", stroke: "kind", strokeWidth: 3.5 })
]
})This is why XGBoost, not plain gradient boosting, is the general-purpose tabular workhorse — and the most documented one.
Three ideas, all aimed at categorical-heavy, modest-size tabular data (Prokhorenkova et al. 2018):
For the typical materials problem, start with CatBoost. Reach for XGBoost as the general-purpose workhorse, LightGBM when \(N>10^6\).
num_leaves cap / larger min_data_in_leaf.n_estimators.n_estimators for the final model.This recipe is mainly for XGBoost/LightGBM. CatBoost is usually near-optimal at its defaults — fit it first, only tune if it underperforms.
Trees / boosting win when
Neural networks win when
For most materials projects with tabular features: try gradient boosting first.

Global (whole model)
Local (one prediction)
Ensembles trade a single tree’s readability for accuracy — SHAP/PDP buy back the interpretability that materials science and regulators require.

Typical pattern: tree ensembles add 10–20 points of \(R^2\) over linear models at essentially no engineering cost.
| Method | Bias | Variance | Mechanism |
|---|---|---|---|
| Single deep tree | low | high | flexible greedy partition |
| Bagging | low | reduced | average bootstrap trees |
| Random Forest | low | strongly reduced | + decorrelate via feature subsets |
| Gradient Boosting | reduced | medium | sequentially fit residuals |
| XGBoost (regularized) | reduced | controlled | + leaf penalty, shrinkage, early stop |
Week 8: Tree Ensembles — RF & XGBoost on alloy regression

© Philipp Pelz - Mathematical Foundations of AI & ML