8 minute read

Derivation of XGBoost

Consider a dataset of \(n\) observations \(\mathcal{D} = \{ (x_i, y_i) \}_{i=1}^n\) where \(x_i \in \mathbb{R}^d\) are the observed features and \(y_i \in \mathbb{R}\) are the targets.

An additive tree model produces a prediction by summing the predictions of \(K\) trees:

\[\hat{y_i} = \sum_{k=1}^K f_k(x_i) \\ f(x) = w_{q(x)} \quad q : \mathbb{R}^T \rightarrow T \quad w \in \mathbb{R}^T\]

Where \(w\) is a vector of \(T\) leaf scores and \(q(x)\) maps data to a leaf node index.

The objective used to learn the model consists of a loss function that measures the difference between predictions \(\hat{y_i}\) and targets \(y_i\) combined with a regularization term \(\Omega\) that penalizes model complexity:

\[\mathcal{L} = \sum_i l(y_i, \hat{y_i}) + \sum_k \Omega(f_k)\]

The regularization term for a single tree is defined as:

\[\Omega(f_t) = \gamma T + \frac{1}{2} \lambda \sum_{j=1}^T w_j^2\]

The model is learned in a greedy, additive fashion, learning a new tree at each iteration. The prediction and corresponding objective at iteration \(t\) is as follows:

\[\begin{align*} \hat{y}^{(t)} &= \hat{y}_i^{(t-1)} + f_t(x_i) \\ \mathcal{L}^{(t)} &= \sum_i l(y_i, \hat{y}_i^{(t-1)} + f_t(x_i)) + \Omega(f_t) \end{align*}\]

The objective is approximated with 2nd order Taylor expansion:

\[\mathcal{L}^{(t)} \simeq \sum_i \Bigl[ l(y_i, \hat{y}_i^{(t-1)}) + g_i f_t(x_i) + \frac{1}{2} h_i f_t^2(x_i) \Bigr] + \Omega(f_k) \\ g_i = \frac{\partial l(y_i, \hat{y}_i^{(t-1)})}{\partial \hat{y}_i^{(t-1)}} \quad h_i = \frac{\partial^2 l(y_i, \hat{y}_i^{(t-1)})}{\partial \hat{y}_i^{(t-1)}}\]

Remove constants with respect to the tree being learned \(f_t\):

\[\tilde{\mathcal{L}}^{(t)} = \sum_i \Bigl[ g_i f_t(x_i) + \frac{1}{2} h_i f_t^2(x_i) \Bigr] + \gamma T + \frac{1}{2} \lambda \sum_{j=1}^T w_j^2\]

Next, let’s define \(I_j = \{ i \lvert q(x_i) = j \}\) as the set of instance indices belonging to leaf node \(j\) and rearrange the objective to loop over leaf nodes instead of instances:

\[\tilde{\mathcal{L}}^{(t)} = \sum_j \Bigl[ (\sum_{i \in I_j} g_i) w_j + \frac{1}{2} (\sum_{i \in I_j} h_i + \lambda) w_j^2 \Bigr] + \gamma T\]

From here, we want to solve for optimal leaf scores \(w_j\) that minimize the objective. The loss contributed by a particular leaf \(j\) is:

\[\mathcal{L}_j = (\sum_{i \in I_j} g_i) w_j + \frac{1}{2} (\sum_{i \in I_j} h_i + \lambda) w_j^2\]

And the optimal leaf score for that leaf is the value that minimizes that leaf’s contribution to the overall objective \(w_j^* = \arg \min_{w_j} \: \mathcal{L}_j\).

Set derivative equal to zero and solve for \(w_j^*\)

\[\frac{\partial \mathcal{L}_j}{\partial w_j^*} = 0 = \sum_{i \in I_j} g_i + (\sum_{i \in I_j} h_i + \lambda) w_j^* \\ w_j^* = - \frac{\sum_{i \in I_j} g_i} {\sum_{i \in I_j} h_i + \lambda}\]

We can plug in \(w_j^*\) into the objective to obtain a scoring function to measure the quality of a tree. This score is similar to an impurity measure, but more generalized:

\[\begin{align*} \tilde{\mathcal{L}}^{(t)} &= \sum_j \Bigl[ (\sum_{i \in I_j} g_i) w_j + \frac{1}{2} (\sum_{i \in I_j} h_i + \lambda) w_j^2 \Bigr] + \gamma T \\ &= \sum_j \Bigl[ (\sum_{i \in I_j} g_i) \Biggl(- \frac{\sum_{i \in I_j} g_i} {\sum_{i \in I_j} h_i + \lambda} \Biggr) + \frac{1}{2} (\sum_{i \in I_j} h_i + \lambda) {\Biggl(- \frac{\sum_{i \in I_j} g_i} {\sum_{i \in I_j} h_i + \lambda} \Biggr)}^2 \Bigr] + \gamma T \\ &= \sum_j \Bigl[ -\frac{(\sum_{i \in I_j} g_i)^2} {\sum_{i \in I_j} h_i + \lambda} + \frac{1}{2} \frac{(\sum_{i \in I_j} g_i)^2} {\sum_{i \in I_j} h_i + \lambda} \Bigr] + \gamma T \\ &= - \frac{1}{2} \sum_{j=1}^T \frac{(\sum_{i \in I_j} g_i)^2} {\sum_{i \in I_j} h_i + \lambda} + \gamma T \end{align*}\]

It is impossible to enumerate all possible tree structures, so we use a greedy algorithm that builds tree \(f_t\) by iteratively adding branches (i.e., chooses a feature and a corresponding split value) according to the magnitude of loss reduction:

\[\mathcal{L}_{split} = \mathcal{L}_{left} + \mathcal{L}_{right} - \mathcal{L}_{root}\]

XGBoost for Binary Classification

The equations above are general; we are interested in deriving them for the binary classifcation task. Equations of interest are the optimal leaf score and the tree quality score. For these, we first need gradient statistics \(g_i\) and \(h_i\) for a particular datapoint.

Shorten the notation for this section by dropping the \(i\) subscripts.

We will use negative log likelihood (a.k.a. cross entropy) as our loss function for binary classifcation

\[J = -[y log(p) + (1-y) log(1-p)]\]

The gradient statistics are the first and second derivatives of the loss function with respect to the previous iteration’s model out. An important note is that the model is learned in log odds space rather than probability space \(\hat{y} = log(\frac{p}{1-p})\). Also note that inversely, \(p = \frac{e^{\hat{y}}}{1+e^{\hat{y}}} = \frac{1}{1+e^{-\hat{y}}}\).

First, we must re-express the loss as a function of log odds:

\[\begin{align*} l(y, \hat{y}) &= -[y log(p) + (1-y) log(1-p)] \\ &= -y \hat{y} + log(1 + e^{\hat{y}}) \end{align*}\]

Now, take its derivative:

\[\begin{align*} g &= \frac{dJ}{d \hat{y}} \\ &= \frac{d}{d \hat{y}} (-y \hat{y}) + \frac{d}{d \hat{y}} log(1+e^{\hat{y}}) \\ &= -y + \frac{1}{1+e^{\hat{y}}}\frac{d}{d \hat{y}}(1 + e^{\hat{y}}) \\ &= -y + \frac{e^{\hat{y}}}{1+e^{\hat{y}}} \\ &= p - y \end{align*}\]

Now, calculate the second derivative:

\[\begin{align*} h &= \frac{d^2 J}{d \hat{y}^2} \\ &= \frac{d g}{d \hat{y}} \\ &= \frac{e^{\hat{y}}}{1 + e^{\hat{y}}} \frac{1}{1 + e^{\hat{y}}} \\ &= p (1 - p) \end{align*}\]

Tree quality score intuition

The tree quality score for binary classification can be expressed as:

\[\tilde{\mathcal{L}}^{(t)} = - \frac{1}{2} \sum_{j=1}^T \frac{(\sum_{i \in I_j} (p_i^{t-1} - y_i))^2} {\sum_{i \in I_j} p_i^{t-1} (1 - p_i^{t-1}) + \lambda} + \gamma T\]

When the algorithm is evaluating ways to split a parent node into two child nodes, it will choose the split that minimizes the above expression. That is, maximizing the numerator and minimizing the denominator

Gradient intuition

It will want to define the split such that the numerator (gradient) term \((\sum_{i \in I_j} (p_i^{t-1} - y_i))^2\) is maximized for a child node (technically, maximizing the sum of this term for both children). Given that the \(p_i\)’s are all somewhere \(\in [0, 1]\), this term is maximized when all instances in a node are of the same class (i.e., all \(y_i = 1\) or all \(y_i = 0\)). This makes intuitive sense because we know a decision tree should be built in such a way that leaf nodes become more and more pure.

Hessian intuition

Similarly, to minimize the denominator (Hessian), all \(p_i^{t-1}\)’s should be very near \(0\) or \(1\) (the max of this function is at \(p_i^{t-1} = 0.5\)). Such extreme instance probabilities can be considered “mature” since the previous iteration of the model was quite certain of its predictions. The instance-level Hessians are also sometimes viewed as “weights” – if the model is already quite certain of its prediction for a particular instance, that instance should not be given as much consideration for the remaining training iterations. If the model is already certain about a set of instances, the algorithm has no problem with blowing up the logits even further to minimize the loss. This behavior can lead to overfitting, however. The lambda \(\lambda\) term in the denominator is usually set to \(1\) prevents the term from exploding to a very large number (nearing infinity), and the xgboost package also supplies a min_child_weight hyperparameter that limits splitting to nodes that have a hessian sum greater than the specified value.

Optimal leaf score intuition

The optimal leaf score for leaf \(j\) in the binary classification setting can now be expressed as:

\[w_j^* = - \frac{\sum_{i \in I_j} (p_i^{t-1} - y_i)} {\sum_{i \in I_j} p_i^{t-1} (1 - p_i^{t-1}) + \lambda}\]

Remember that \(w_j^*\) is a logit \(\in (-\inf, \inf)\), not a probability.

The numerator (gradient) is basically saying to set the score to the average residual of the previous model’s predictions on that set of instances, which makes intuitive sense.

The denominator (hessian) is saying to blow this score up if the previous model was already certain about these instances, or step a little more carefully if it was uncertain.