Extreme Gradient Boosting (XGBoost)

In the formula above, $f^{(t-1)}(x)$ is the sum of the first $t-1$ weak learners, i.e. $f^{(t-1)}(x) = f^{(0)}(x) + \sum_{s=1}^{t-1} f_s(x)$, and $\Omega(f_t)$ is the regularization term which measures the complexity of $f_t$.

Then we find a regression tree $f_t$ that minimizes the objective function, as

\[f_t = \operatorname*{argmin}_{f_t} \text{Obj}^{(t)}.\]

XGBoost use Newton’s method to minimize the objective function iteratively. As shown in my earlier post, we use Taylor expansion of second order to get the approximation of objective function. Recall Taylor expansion: $f(x+\Delta x) \approx f(x)+f'(x)\Delta x + \frac{1}{2} f''(x)\Delta x^2$. We do that for $L(y_i, f_{t-1}(x)+h_t(x_i))$, and we have

\[\begin{align*} \text{Obj}^{(t)} &= \sum_{i=1}^{n} L\big(y_i, f^{(t-1)}(x_i) + f_t(x_i)\big) + \Omega(f^{(t-1)}) + \Omega(f_{t}) \\ & \approx \sum_{i=1}^{n} \left[L(y_i, f^{(t-1)}(x)) + g_{i} f_t(x_i) + \frac{1}{2}h_{it}f_t^2(x_i)\right] + \Omega(f^{(t-1)}) + \Omega(f_{t}), \end{align*}\]

where

\[g_{it} = \frac{∂L(y_i,f^{(t-1)}(x_i))}{∂f^{(t-1)}(x_i)},\ h_{it} = \frac{∂^2L(y_i,f^{(t-1)}(x_i))}{∂{f^{(t-1)}}^2(x_i)}.\]

We can remove the constant terms to obtain the following simplified objective at step $t$.

\[\tilde{\text{Obj}}^{(t)} := \sum_{i=1}^{n} \left[g_{it} f_t(x_i) + \frac{1}{2}h_{it}f_t^2(x_i)\right] + \Omega(f_{t}).\]

In the following we omit the subscript $t$ of $g_{it}$ and $h_{it}$, as $g_i$ and $h_i$, for simplicity.

Here is a brief description of XGBoost algorithm:

[1] Initialize $\hat{f}^{(0)} = \operatorname{argmin}_{f} \sum_{i=1}^{n} L(y_i, f(x_i)) + \Omega(f)$.

[2] For $t = 1,…,M$, execute (a) and (b):

(a) $f_t = \operatorname*{argmin}_{f_t} \tilde{\text{Obj}}^{(t)} .$

(b) $f^{(t)}(x) = f^{(t−1)}(x) + \eta f_t(x)$.

[3] Output $\hat{f}(x) = f^{(M)}(x)$.

where $\eta$ is called step-size or learning rate, usually set around $0.1$.

If there is no regularization term, by Newton’s method, we have $f^{(t)}(x) = f^{(t-1)}(x) - \eta\frac{g_{it}}{h_{it}}$, then we can simply fit a regression tree $f_t$ to the targets $-\frac{g_{it}}{h_{it}}$ at each step $t$.

In the following we omit the subscript $t$ of $g_{it}$ and $h_{it}$, as $g_i$ and $h_i$, for simplicity.

Boosting with Trees

Model Complexity

We refine the definition of a tree

\[f_t(x)=w_{q_t(x)} \in \mathbb{R}^{T_t}.\]

Here $w$ is the vector of scores on leaves, and $q_t(x): \mathbb{R}^p \to \{1,2,\cdots, T_t\}$ is a function assigning each data point to the corresponding leaf, and $T_t$ is the number of leaves in $f_t(x)$.

In the following we omit the subscript $t$ of $q_t(x)$ and $T_t$, as $q(x)$ and $T$, for simplicity.

In terms of objective function, XGBoost adds a regularization term, which is defined as

\[\Omega(f_t) = \gamma T + \frac{1}{2}\lambda \sum_{j=1}^{T} w_j^2,\]

where $w_j$ is the prediction score of leaf $j$, and $\sum_{j=1}^{T} w_j^2$ is the L2 norm of leaf scores.

Why penalize $\sum_{j=1}^{T} w_j^2$ to control the complexity? In this case, a large value of $w_j$ would correspond to a terminal (leaf) node giving a very large and significant update to the prior model. However, the idea of a gradient booster is to carefully and slowly reduce the bias of the model by adding these trees one by one.

The Structure Score

Here is the magical part of the derivation.

Denote $I_j = \{i \mid q(x_i)=j\}$ as the set of indices of data points assigned to the $j$-th leaf.

After re-formulating the tree model, since all the data points on the same leaf get the same score, we can rewrite

\[\sum_{i=1}^n g_{i}w_{q(x_i)} = \sum_{j=1}^T \sum_{i \in I_j} g_{i}w_{q(x_i)} = \sum_{j=1}^T \left(\sum_{i \in I_j} g_{i}\right) w_j, \\ \sum_{i=1}^n h_{i}w^2_{q(x_i)} = \sum_{j=1}^T \sum_{i \in I_j} h_{i}w^2_{q(x_i)} = \sum_{j=1}^T \left(\sum_{i \in I_j} h_{i}\right) w^2_j.\]

Then we can write the objective value with the $t$-th tree as

\[\begin{align*} \tilde{\text{Obj}}^{(t)} &= \sum_{i=1}^{n} \left[ g_{i} w_{q(x_i)} + \frac{1}{2} h_{i} w^2_{q(x_i)} \right] + \gamma T + \frac{1}{2}\lambda \sum_{j=1}^{T} w_j^2 \\ &= \sum_{j=1}^T \left[ \left(\sum_{i \in I_j} g_{i}\right) w_j + \frac{1}{2} \left(\sum_{i \in I_j} h_{i} + \lambda\right) w^2_j \right] + \gamma T. \end{align*}\]

We can further compress the expression by defining $G_j = \sum_{i\in I_j} g_{i}$ and $H_j = \sum_{i\in I_j} h_{i}$:

\[\tilde{\text{Obj}}^{(t)} = \sum_{j=1}^T \left[ G_j w_j + \frac{1}{2} \left(H_j + \lambda\right) w^2_j \right] + \gamma T.\]

In this equation, $w_j$’s are independent with respect to each other. We can get the best $w_j$ by taking derivative of $\tilde{\text{Obj}}^{(t)}$ with respect to $w_j$ and set to $0$, i.e. $\frac{\partial \tilde{\text{Obj}}^{(t)}}{\partial w_j} = 0$, then we have

\[w_j^\ast = -\frac{G_j}{H_j+\lambda},\\ \text{Obj}^{(t)\ast} = -\frac{1}{2} \sum_{j=1}^T \frac{G_j^2}{H_j+\lambda} + \gamma T.\]

Here $\text{Obj}^{(t)\ast}$ is called the structure score, which measures how good a tree structure $q(x)$ is. The smaller the score is, the better the structure is. This score $\text{Obj}^{(t)\ast}$ is like the impurity measure in a decision tree, except that it also takes the model complexity into account.

Node Splitting Criteria

The score after spiting is $-\frac{1}{2} \left( \frac{G_L^2}{H_L+\lambda} + \frac{G_R^2}{H_R+\lambda} \right) + 2\gamma$, where subscript $L$ refers to the left node and $R$ refers to the right node. The score before spiting is $-\frac{1}{2} \frac{(G_L+G_R)^2}{H_L+H_R+\lambda} + \gamma$. We use the score gain as the node splitting criteria:

\[\text{Gain} = \frac{1}{2} \left[ \frac{G_L^2}{H_L+\lambda} + \frac{G_R^2}{H_R+\lambda} - \frac{(G_L+G_R)^2}{H_L+H_R+\lambda} \right] - \gamma.\]

We select the splitting point that maximize the score gain. We can see an important fact here: if the gain is smaller than $\gamma$, we would do better not to add that branch. This is exactly the pruning techniques in tree based models.

Summary

XGBoost algorithm:

[1] Set hyper-parameters step-size $\eta$, and regularization strength $\gamma,\lambda$.

[2] Initialize a tree $f^{(0)}(x)$ that fits the original data, i.e. $f^{(0)} = \operatorname*{argmin}_{f^{(0)}} \sum_{i=1}^n L(y_i,f^{(0)}(x_i)) + \Omega(f^{(0)})$.

[3] For $t = 1,…,M$, execute following steps to find $f_t = \operatorname*{argmin}_{f_t} \tilde{\text{Obj}}^{(t)}$ and update $f^{(t)}(x)$:

(a) Calculate the derivatives $g_{i} = \frac{∂L(y_i,f^{(t-1)}(x_i))}{∂f^{(t-1)}(x_i)},\ h_{i} = \frac{∂^2L(y_i,f^{(t-1)}(x_i))}{∂{f^{(t-1)}}^2(x_i)}$ for $i=1,2,\cdots,n$.

(b) Build a regression tree with the node splitting criteria $\text{Gain} = \frac{1}{2} \left[ \frac{G_L^2}{H_L+\lambda} + \frac{G_R^2}{H_R+\lambda} - \frac{(G_L+G_R)^2}{H_L+H_R+\lambda} \right] - \gamma$, where $G_j = \sum_{i\in I_j} g_{i}$ and $H_j = \sum_{i\in I_j} h_{i}$.

(d) Update $f^{(t)}(x) = f^{(t−1)}(x) + \eta f_t(x)$.

[4] Output $\hat{f}(x) = f^{(M)}(x)$.

Besides the regularization term and learning rate, XGBoost also applies column (feature) subsampling to prevent overfitting like that in random forest, since column subsampling reduces the correlation between weak learners and thus reduces the variance of the whole model.

XGBoost has some advantages over GBDT:

Regularization term to reduce overfitting;
Second order Taylor expansion of objective function but not the first order;
We can define the loss function, but note that it should be second order differentiable;
We can define the form of weak learner $f_t(\cdot)$, such as tree or linear classifier.

References:

Chen, T., & Guestrin, C. (2016, August). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining (pp. 785-794).

“Introduction to Boosted Trees.” Introduction to Boosted Trees - Xgboost 1.1.0-SNAPSHOT Documentation, https://xgboost.readthedocs.io/en/latest/tutorials/model.html.

Comments