Regularization in gradient boosted regression trees are applied to the leaf values and not the feature coefficients like in lasso/ridge regression. For this blog, I will break down the explanation into three steps:
- Lasso & Ridge Regression
- A brief re-cap of lasso and ridge regression - Gradient Boosted Regression Trees
- Leaf values in boosted regression trees - L1 & L2 in XGBoost
- Adding penalties for residual leaves
Lasso & Ridge Regression
Please recall, lasso and ridge regression applies an additional penalty term to the loss function. (A standard loss function for regression is the squared error, and I’ll be using this throughout the blog.)The regularization’s objective is to counter overfitting models by lowering variance while increasing some bias. Lasso(L1) adds the sum of the absolute beta coefficients, and Ridge(L2) adds the sum of the beta coefficients squared.
In linear regression, the predicted output is the sum of the feature coefficient(beta) multiplied by the X input variable.
Gradient Boosted Regression Trees
Contrary to linear regression, the output of a boosted regression tree is a linear(additive) combination of M trees’ leaf scores. The model is initialized with a tree containing the means of the dependent. The leaves of the later trees are the residual(actual — predicted) estimations where the ‘predicted’ is the previous trees’ sum. Therefore, if the boosted model consisted of 100 trees, the predicted output would be the mean plus 99 residual estimations.
Like other decision trees, splits are chosen based on information gain, and the leaf values(residual estimations) are calculated where the loss is minimized for each value addition.
L1 & L2 in XGBoost
Unlike linear regression, I will need to solve the leaf values and not the coefficients for the features. Also, XGBoost uses a Second-Order Taylor Approximation for the summation of the squared errors. Below, I will try to solve for O_value(leaf values) by taking partial derivatives to see how the L1 and L2 values penalize in regression trees.
To find the output value(O_value) with minimal loss, I take the partial derivative with respect to output and let that derivative equal 0. After solving for the output value, I will then solve for g and h.
To simply put, if the sum of the residuals is greater than alpha(L1 parameter), the numerator of the output value decreases by alpha. If the sum of the residuals is less than negative alpha, the numerator is increased by alpha. In both cases, the numerator is adjusted by adding/subtracting, so the value approaches 0. If the sum of the residuals is in between alpha and negative alpha, the output value is 0. Therefore, the L1 regularization decreases the output value and promotes sparsity by forcing output values to be 0.
The lambda (L2 parameter) decreases the output value in a more smoothing way where the lambda term increases the denominator.