Recall that we use step functions, where the range of predictor $X$ is broken into several bins and in each bin the average of the observed response, certainly a constant, is used as the prediction. Step functions can be viewed as a special case of tree-based methods, the latter of which could be applied to multiple predictors.
Decision Trees
A Simple Example
The idea of a decision tree is quite straight and here a simple example is given. Suppose we have to predictors $X_1, X_2$ in a data set which we use to predict the response $Y$. A decision tree would produce some result like:
- For data in region $R_1 = \{ X \mid X_1 < s_1 \}$, the predicted response is $\hat y_{R_1}$;
- For data in region $R_2 = \{ X \mid X_1 \ge s_1, X_2 < s_2 \}$, the predicted response is $\hat y_{R_2}$;
- For data in region $R_3 = \{X \mid X_1 \ge s_1, X_2 \ge s_2 \}$, the predicted response is $\hat y_{R_3}$.
In this tree, there are two splits:
- For the full data set, there is a split at $X_1 = s_1$;
- For data satisfying $X_1 \ge s_1$, there is a split at $X_2 = s_2$.
These two splits are internal nodes of the tree. The three resultant regions $R_1, R_2, $ and $R_3$ are terminal nodes or leaves.
Process of Building a Decision Tree
In general, we build a decision tree in a process of two steps, described as below:
- We divide the predictor space into $J$ distinct and non-overlapping regions, $R_1, R_2, \dots, R_J$.
- For each observation that falls into the region $R_j$, we make the same prediction, which is the mean of the response values for the training observations in $R_j$.
In practice, for simplicity and for ease of interpretation, we divide the predictor space into high-dimensional rectangles, or boxes. (Notice that a certain predictor can act as the criterion of different splits. Hence, we can easily use a decision tree to construct step functions when there is only a single predictor.) How do we obtain the splits to construct the regions? With the aim of decreasing the bias of our model, we want to minimize the RSS: $$ \sum_{j=1}^J\sum_{i:x_i \in R_j} (y_i - \hat y _{R_j})^2 $$ The most naive idea is to try every possible region division that produces $J$ boxes and select the best one, which, however, faces the problem of computational feasibility. We instead take a top-down, greedy approach that is known as recursive binary splitting. The procedure is as below: Step 1: Initially we have the region $R$ of all observations in the training set. Step 2: For region $R$, we consider all predictors $X_1, \dots, X_p$ and all possible values of the cutpoint $s$ for each of the predictors, and choose the predictor and cutpoint such that the resulting tree has the lowest RSS. In other words, we wish to find $j$ and $s$ which define the pair of half-planes $$ R_1(j, s) = \{X\mid X_j <s\} \text{ and }R_2(j, s) = \{X\mid X_j \ge s\} $$ and which minimize $$ \sum _ {i: x _ i \in R _ 1(j, s)} (y _ i-\hat y _ {R _ 1})^2 +\sum _ {i: x _ i \in R _ 2(j, s)} (y _ i-\hat y _ {R _ 2})^2 $$ We then split $R$ into $R_1$ and $R_2$. Step 3: If some criterion is reached, for example if no regions contains more than 5 observations, go to the next step. Otherwise, from all the regions that we have now, select one to be $R$ and repeat step 2. Step 4: For all the $J$ regions that we have created, use the mean of training observations in the region as our prediction.
Tree Pruning
Using the aforementioned procedure, we may end up with a model with high variance due to the possible high complexity of the resultant tree. It is almost apparent, since we only considered reducing RSS (which indicates the bias) in our model, imposing no restriction on the variance. Now we need a smaller tree. One possible solution is to build the tree so long as the decrease in the RSS due to each split exceeds some (high) threshold. This strategy will yield smaller trees but is too short-sighted since (we are using a greedy algorithm and) a seemingly worthless split could bring about a very good split following. A better solution is to first grow a very large tree $T_0$ and then prune it back in order to obtain a subtree. How do we know what is the best way to prune a large tree? Again, we could consider a naive solution where every possible subtree is enumerated and its cross-validation error estimated. Too much computation. We use cost complexity pruning, which somewhat resembles Lasso, to find the best subtree. For each nonnegative value of the tuning parameter $\alpha$, it turns out that we can (quite easily) find a subtree $T \subset T _ 0$ such that $$ \sum _ {m=1}^{T}\sum_{i:x_i \in R_m} (y_i - \hat y_{R_m})^2 + \alpha T $$ is minimized, where $T$ represents the number of terminal nodes of $T$. Notice that in the above formula the number of terminal nodes, which indicates the variance of the model, is penalized. Thus, for a sequence of $\alpha$, we can accordingly produce a sequence of $T$ indexed by $\alpha$, and then select among the sequence the best $T$ that minimizes the cross-validation error.
Classification Trees
If the response is qualitative rather than quantitative, we make use of a classification tree. Compared with a regression tree, there are three major differences:
- In each region (terminal node of the tree), we choose the most commonly occurring class in the training observations as our prediction.
- When building a tree, we don’t use RSS as our criterion when deciding a split. Rather, we may make use of the following three alternatives (where $\hat p_{mk}$ represents the proportion of training observations in the $m$th region that are from the $k$th class):
- The classification error rate, $E = 1 - \max\limits_{k}(\hat p_{mk}) $. It turns out not to be sufficiently sensitive for tree-growing and hence is typically less preferable than the following two.
- The Gini index, $G = \sum_{k=1}^K \hat p_{mk}(1-\hat p_{mk})$, which is a measure of total variance across the $K$ classes. If in this region, one class has a proportion close to 1 and all others close to 0, the Gini index is quite small, representing the “purity” of data in this region. On such occasion, the classification is certainly preferred.
- The cross-entropy, $D = -\sum_{k=1}^K \hat p_{mk} \log \hat p_{mk}$, which behaves quite similarly to the Gini index. (It not hard to see that functions $f_1(x)= x(1-x)$ and $f_2(x) = -x \log x$ has similar shapes on $[0, 1]$.)
- When pruning the tree, we could still use the above three criteria. Here the classification error rate is preferred if prediction accuracy is important for the final tree.
When we use the Gini index or the cross-entropy when building a classification tree, we may end up with some splits yielding two terminal nodes that have the same prediction. Such splits, albeit seemingly useless in classification, can contribute to node purity. As a result, one terminal node may have more confidence in yielding the prediction than the other. When the predictor takes qualitative values, we don’t split it by a cutpoint but select a certain subset of it as the criteria.
Trees Versus Linear Models
One could outperform the other in different scenarios, in terms of test error. It depends on the particular problem which one we should choose. Advantages of trees may be preferred if the test error is not the only criterion taken into consideration.
Advantages and Disadvantages of Trees
Advantages:
- Ease of interpretation.
- Some people believe it better mirrors the human decision-making process.
- Ease of visualization.
- Trees can better handle qualitative predictors.
Disadvantages:
- Trees have lower predictive accuracy compared with other models. However, this can be improved dramatically by bagging, random forests, and boosting.
Bagging, Random Forests, and Boosting
Bagging
Recall that we could use pruning to reduce the variance of our prediction. However, bagging offers another different approach to achieve this effect based on what we know from (a simplified version of) the central limit theorem: if we have $n$ independent observations, $Z_1, \dots, Z_n$, each with variance $\sigma^2$, the variance of the mean $\bar Z$ is reduced to $\sigma^2 / n$. In other words, averaging a set of observations reduces variance. One practical problem is that we often do not have the access to a set of training observations. We often only have one. Fortunately, we can generate multiple bootstrapped training sets from one training set and the problem gets solved. Based on each bootstrapped training set, we build a different tree. For any value $x$ we wish to predict its response, we average the responses of all trees: $$ \hat f_\text{bag}(x)= \frac 1 B \sum _ {b=1}^B \hat f^{* b}(x) $$ where $B$ is the number of the bootstrapped training sets and $\hat f^{*b}(x)$ is the estimated response to $x$ of the tree built based on the $b$th bootstrapped training set. This is called bagging. (Bagging can be generalized beyond decision trees.) In bagging, each tree is not pruned and has low bias and high variance. Averaging the $B$ trees reduces the variance. In a classification problem, instead of taking the average of the $B$ tress, we take a majority vote: the overall prediction is the most commonly occurring class among the $B$ predictions. Notice that we can use a very large value of $B$, which would not cause overfitting. When $B$ is sufficiently large, the test error just fluctuates and basically stops decreasing.
Out-of-Bag Error Estimation
We may use cross validation or validation set approach to estimate the test error of bagging. However, there is a simpler alternative: out-of-bag error estimation. It can be shown with calculus that in each bootstrapped training set, about one third ($1/e$ to be exact) observations are absent and are referred to as out-of-bag observations. Therefore, on average any observation is out-of-bag in about one third of the $B$ bootstrapped training sets. We can directly use the average prediction of the trees built upon these training sets to estimate the test error (by OOB MSE or classification error). It turns out that when $B$ is large enough, OOB error is virtually equivalent to LOOCV error.
Variable Importance Measures
Bagging improves the prediction accuracy with the expense of interpretability. However, we could measure the variable importance of different predictors by recording the decrease of RSS or the Gini index due to the splits of a given predictor and averaging them across the $B$ trees.
Random Forests
Recall it is necessary in the central limit theorem that the $n$ observations are independent. If they are somehow correlated, the reduction in variance of the prediction can be quite limited. Random forests, which improves bagging in this aspect, aims to decorrelate the trees built on the bootstrapped training sets. How? The simple idea is actually reasonable yet seemingly crazy: in the process of building a tree, only a random sample of predictors could be considered. Typically, for each tree, only about $m \approx \sqrt p$ predictors are randomly selected and then considered when deciding each split. This could prevent some dominating predictors from largely affecting the trees we build many times. Clearly, when $m = p$, this simply amounts to bagging. Selecting a small value of $m$ can be useful when many of the predictors are correlated.
Boosting
Like bagging, boosting is another general statistical learning approach that can have wide applications. Here we restrict our discussion to decision trees. Boosting constructs a sequence of trees, each aimed at improving the performance of previously grown trees. The procedure is as follows:
- Initially set $\hat f(x) \leftarrow 0$ and $r_i \leftarrow y_i$ for all $i$ in the training set.
- For $b$ in $1,\dots, B$, repeat:
- Fit a tree $\hat f^b$ with $d$ splits to the training data $(X, r)$.
- Update $\hat f$ with by adding in a shrunken version of the new tree: $\hat f(x) \leftarrow \hat f(x) + \lambda \hat f ^ b (x)$.
- Update the residuals: $r_i \leftarrow r_i - \lambda \hat f ^ b (x)$.
- The resultant boosted model, which can be written as $\hat f(x) = \sum _{b=1}^B \lambda \hat f^b(x)$, is the output.
In this process, by restricting $d$ to be small, we each time build a small tree. Thus, the boosting approach learns slowly by each time making improvements on areas it previously performances badly. Here we have three tuning parameters:
- The number of trees, $B$. When $B$ is too large, the model tends to (slowly) overfit the data. We use cross validation to select an appropriate value of $B$.
- The shrinkage parameter, $\lambda$. A small value leads the model to learn slowly. Typical values include $0.01$ and $0.001$. An appropriate value depends on the particular problem.
- The number $d$ of splits in each tree. Often $d = 1$ works well, where each tree is a stump and the boosted stump ensemble is fitting an additive model.