Mathematical Foundations of AI & ML
Unit 8: Tree Ensembles for Tabular Learning
FAU Erlangen-Nürnberg
By the end of this lecture, students can:
\[\mathbb{E}[(y-\hat f(x))^2] = \underbrace{\mathrm{Bias}[\hat f]^2}_{\text{too rigid}} + \underbrace{\mathrm{Var}[\hat f]}_{\text{too sensitive}} + \underbrace{\sigma^2}_{\text{irreducible}}\]
A decision tree recursively partitions input space by axis-aligned splits.
At each node, pick the (feature, threshold) that maximizes impurity reduction:
\[ \Delta = I(\text{parent}) - \frac{N_L}{N} I(\text{left}) - \frac{N_R}{N} I(\text{right}). \]
Regression
Classification
This is precisely the regime where averaging helps. → Ensembles.
Strengths
Limitations
tree_data = {
tree_reseed;
const n = 60;
const rnd = d3.randomNormal(0, tree_noise);
const pts = [];
for (let i = 0; i < n; i++) {
const x = (i / (n - 1)) * 6;
pts.push({ x, y: Math.sin(2 * x) + 0.3 * x + rnd() });
}
return pts;
}
function buildRegTree(rows, depth, maxDepth) {
const ys = rows.map(d => d.y);
const mean = d3.mean(ys);
if (depth >= maxDepth || rows.length < 4) return { leaf: true, value: mean };
const xs = Array.from(new Set(rows.map(d => d.x))).sort((a, b) => a - b);
let best = null;
for (let i = 1; i < xs.length; i++) {
const thr = (xs[i - 1] + xs[i]) / 2;
const L = rows.filter(d => d.x <= thr);
const R = rows.filter(d => d.x > thr);
if (L.length < 2 || R.length < 2) continue;
const sse =
L.length * (d3.variance(L.map(d => d.y)) || 0) +
R.length * (d3.variance(R.map(d => d.y)) || 0);
if (best === null || sse < best.sse) best = { thr, sse, L, R };
}
if (best === null) return { leaf: true, value: mean };
return {
leaf: false,
thr: best.thr,
left: buildRegTree(best.L, depth + 1, maxDepth),
right: buildRegTree(best.R, depth + 1, maxDepth)
};
}
function predictTree(node, x) {
return node.leaf ? node.value : (x <= node.thr ? predictTree(node.left, x) : predictTree(node.right, x));
}
tree_fitted = {
const t = buildRegTree(tree_data, 0, tree_depth);
return d3.range(0, 6.001, 0.02).map(x => ({ x, y: predictTree(t, x) }));
}
Plot.plot({
width: 820, height: 430,
x: { domain: [0, 6], label: "x" },
y: { domain: [-2, 3], label: "y" },
marks: [
Plot.line(d3.range(0, 6.01, 0.05), { x: d => d, y: d => Math.sin(2 * d) + 0.3 * d, stroke: "gray", strokeDasharray: "4,4" }),
Plot.dot(tree_data, { x: "x", y: "y", r: 3, fill: "steelblue", fillOpacity: 0.5 }),
Plot.line(tree_fitted, { x: "x", y: "y", stroke: "orange", strokeWidth: 2.5, curve: "step-after" })
]
})Bootstrap aggregating (Breiman 1996):
\[\mathrm{Var}\!\left(\tfrac1B\sum_b \hat f_b\right)=\rho\,\sigma^2+\frac{1-\rho}{B}\,\sigma^2.\]
As \(B\to\infty\) the second term vanishes; variance floors at \(\rho\sigma^2\).
RandomForestRegressor / RandomForestClassifier (Breiman 2001).With \(B\approx 500\) fully grown, feature-subsampled trees, RF is a strong, low-tuning baseline that usually crushes a single tuned tree.
ens_var = (rho, B) => rho + (1 - rho) / B;
var_curves = {
const rows = [];
for (let B = 1; B <= 200; B++) {
rows.push({ B, v: ens_var(rho_a, B), kind: `Bagging (ρ=${rho_a.toFixed(2)})` });
rows.push({ B, v: ens_var(rho_b, B), kind: `Random forest (ρ=${rho_b.toFixed(2)})` });
}
return rows;
}
Plot.plot({
width: 1000, height: 700,
x: { domain: [1, 200], label: "Number of trees B" },
y: { domain: [0, 1], label: "Ensemble variance (σ²=1)" },
color: { legend: true },
marks: [
Plot.ruleY([rho_a], { stroke: "#d62728", strokeDasharray: "3,3" }),
Plot.ruleY([rho_b], { stroke: "#1f77b4", strokeDasharray: "3,3" }),
Plot.line(var_curves, { x: "B", y: "v", stroke: "kind", strokeWidth: 2.5 })
]
})Impurity (Gini/MDI) importance
Better alternatives
For a materials paper claiming “feature X drives the property,” report permutation or SHAP, not raw impurity importance.
Bagging averages independent learners; boosting composes dependent ones.
For loss \(\mathcal{L}\), at iteration \(t\):
\(\eta\) = learning rate (shrinkage). Small \(\eta\) + many trees generalizes better than large \(\eta\) + few.
Squared error \(\mathcal{L}=\tfrac12 (y-\hat f)^2\)
Logistic / other losses
Boosting can and will overfit — it drives training loss toward zero. Controls:
gb_all = {
const n = 70;
const rndTr = d3.randomNormal(0, 0.3);
const rndVa = d3.randomNormal(0, 0.3);
const tr = [], va = [];
for (let i = 0; i < n; i++) {
const x = (i / (n - 1)) * 6;
tr.push({ x, y: Math.sin(2 * x) + 0.3 * x + rndTr() });
const xv = ((i + 0.5) / n) * 6;
va.push({ x: xv, y: Math.sin(2 * xv) + 0.3 * xv + rndVa() });
}
return { tr, va };
}
bestStump = function (rows) {
const xs = Array.from(new Set(rows.map(d => d.x))).sort((a, b) => a - b);
let best = null;
for (let i = 1; i < xs.length; i++) {
const thr = (xs[i - 1] + xs[i]) / 2;
const L = rows.filter(d => d.x <= thr);
const R = rows.filter(d => d.x > thr);
if (!L.length || !R.length) continue;
const cL = d3.mean(L, d => d.r), cR = d3.mean(R, d => d.r);
let sse = 0;
for (const d of L) sse += (d.r - cL) ** 2;
for (const d of R) sse += (d.r - cR) ** 2;
if (best === null || sse < best.sse) best = { thr, cL, cR, sse };
}
return best;
}
gb_curves = {
const { tr, va } = gb_all;
const stumps = [];
let predTr = tr.map(() => 0);
let predVa = va.map(() => 0);
const rmse = (arr, pred) => Math.sqrt(d3.mean(arr.map((d, i) => (d.y - pred[i]) ** 2)));
const rows = [];
for (let t = 1; t <= gb_T; t++) {
const resid = tr.map((d, i) => ({ x: d.x, r: d.y - predTr[i] }));
const s = bestStump(resid);
if (!s) break;
stumps.push(s);
predTr = tr.map((d, i) => predTr[i] + gb_eta * (d.x <= s.thr ? s.cL : s.cR));
predVa = va.map((d, i) => predVa[i] + gb_eta * (d.x <= s.thr ? s.cL : s.cR));
rows.push({ t, e: rmse(tr, predTr), kind: "Train RMSE" });
rows.push({ t, e: rmse(va, predVa), kind: "Validation RMSE" });
}
return rows;
}
Plot.plot({
width: 820, height: 430,
x: { domain: [1, gb_T], label: "Boosting iteration" },
y: { domain: [0, 1.2], label: "RMSE" },
color: { legend: true },
marks: [
Plot.line(gb_curves, { x: "t", y: "e", stroke: "kind", strokeWidth: 2.5 })
]
})This is why XGBoost, not plain gradient boosting, is the general-purpose tabular workhorse — and the most documented one.
Three ideas, all aimed at categorical-heavy, modest-size tabular data (Prokhorenkova et al. 2018):
For the typical materials problem, start with CatBoost. Reach for XGBoost as the general-purpose workhorse, LightGBM when \(N>10^6\).
num_leaves cap / larger min_data_in_leaf.n_estimators.n_estimators for the final model.This recipe is mainly for XGBoost/LightGBM. CatBoost is usually near-optimal at its defaults — fit it first, only tune if it underperforms.
Trees / boosting win when
Neural networks win when
For most materials projects with tabular features: try gradient boosting first.
Global (whole model)
Local (one prediction)
Ensembles trade a single tree’s readability for accuracy — SHAP/PDP buy back the interpretability that materials science and regulators require.
Typical pattern: tree ensembles add 10–20 points of \(R^2\) over linear models at essentially no engineering cost.
| Method | Bias | Variance | Mechanism |
|---|---|---|---|
| Single deep tree | low | high | flexible greedy partition |
| Bagging | low | reduced | average bootstrap trees |
| Random Forest | low | strongly reduced | + decorrelate via feature subsets |
| Gradient Boosting | reduced | medium | sequentially fit residuals |
| XGBoost (regularized) | reduced | controlled | + leaf penalty, shrinkage, early stop |
Week 8: Tree Ensembles — RF & XGBoost on alloy regression

© Philipp Pelz - Mathematical Foundations of AI & ML