Tree-Based Methods – John Canque

There are several approaches to regression and classification problems that can boost performance over linear regression. Classification and regression trees (CART) involve segmenting the predictor space into a number of simple regions and our prediction involves using the mean of the training observations where it belongs.

Decision trees alone might be intuitive and easy to interpret but might be light on prediction performance. We’ll look at different ways of ensemble methods to improve the results of decision trees:

bagging
boosting
random forests

Regression Trees

In order to prepare data, we’ll need to get rid of missing dependent variables and log-transform the y value so that the distribution has a typical bell-shape. The top split assigns observations in half using the mean response value (half are higher than the mean, half are lower than the mean).

Splits Half Each Data. Salaries are split in half at 4.5 years. Of those over 4.5 years, players that hit over 117.5 make more than the average.

In the above example of baseball salary using years played and the number of hits in the last season, the tree stratifies or segments the players into 3 regions of predictor space:

players with less than 4.5 years of experience
players with more than 4.5 years with less than 117.5 hits last year
players with more than 4.5 years and more than 117.5 hits.

The figures at the end of the trees (5.11, 6, 6.74) are known as terminal nodes or leaves. The points along the tree where the predictor space is split are referred to as internal nodes. Branches are the segments of the trees that connect the nodes. The height of the branches show how much impact that node has on the predictor.

Prediction via Stratification of the Feature Space

The regions could have any shape, but we chose to divide the predictor space into high-dimensional rectangles (boxes) for simplicity and for ease of interpretation of the predictive model. The goal is to find boxes R_1… R_j that minimizes the RSS.

Because we cannot consider every partition of the feature space into J boxes (we can create an infinite number of boxes), one approach is to take a top-down, greedy approach that is known as recursive binary splitting. We start at the top of the tree and split into halves (binary). It’s greedy because the best split at each step is made. It does not look into the future to see whether a better split would result in lower or future nodes. We consider all predictors and all possible values of the cutpoint s for each of the predictors, and then choose the predictor and cutpoint such that the resulting tree has the lowest RSS.

We look at the half planes and find the j,s that has the lowest RSS.

We repeat the process (recursive) looking for the best predictor and cutpoint in order to split the data further to minimize the RSS within each of the resulting regions. We do the cut only based on the region that we are on and continue to split until a stopping criterion is reached (for ex. continue splits until no region contains more than 5 observations).

Tree Pruning

This simple approach we just went through might be interpretable, but it doesn’t have the best predictive value on the test set because the tree might become too complex (large tree). We find that a smaller tree with fewer splits might lead to lower variance and better interpretation at the cost of a little bias. In our example, instead of building the tree so that each node has no more than 5 observations, we can build the tree so long as the RSS on each split is higher than a threshold (this should be a high threshold).

The main drawback of our original approach is that it is greedy. We don’t look into the future to see if there’s a better split in the lower leaves. We are ignoring the fact that a better cut can happen in the future that will significantly reduce RSS. The better strategy is to grow a very large tree T_0 and prune it back in order to obtain a subtree. Given a subtree, we can estimate its test error using cross-validation.

Cost complexity pruning (aka weakest link pruning) is another approach that gives us a way to skip having to build the full tree and then prune. We use a tuning parameter and see the effect of more nodes on the RSS. When our tuning parameter is 0, our subtree will be our regular regression tree. But as we increase the tuning parameter, the price/cost for having more terminal nodes increase.

We are trying to get the smallest error but not at the point of increasing complexity. The quantity will tend to be minimized for smaller subtrees (similar to lasso). Since this process of increasing the tuning parameter follows a predictable nested pattern, it would be easy for us to find the correct tuning parameter value by cross-validation on the full training set.

How do we choose alpha (our tuning parameter)? We go in a range: the higher the number of trees, the higher the MSE. The higher our alpha value the more we’ll penalize more complex models. If our alpha is small, we would be able to pick larger number of leaves.

Sample full decision tree without pruning. We’ll perform 6=fold cv on this data on different levels of the tuning parameter.

The result of our cv on the test set. Instead of the tuning parameter on the y-axis, we’ll use tree depth as an easier way of understanding data (vs calling it a tuning parameter).

As we grow the tuning parameter (thus penalizing complexity), MSE rises after 3 levels. We are displaying the number of leaves because the alpha (tuning parameter) corresponds to tree depth. The higher the alpha, the lower the number of leaves.

Classification Trees

Classification trees predict a qualitative response rather than a quantitative one. Our prediction uses the most commonly occurring class of training observations in the region it belongs instead of the mean of the terminal node. We are interested in both class prediction and the class proportions among training observations that fall into that region.

We still use recursive binary splits to grow our tree, but can’t use RSS for qualitative data, so we use our trusted classification error rate, which is the fraction of the training observations in that region that do not belong to the most common class.

(8.5) p^_mk p is training observation in the m^th region that are from the k^th class. A small p^_mk means that there’s a good split.

We take a look at the Gini Index which is a measure of total variance across K classes. A small Gini Index value means that a node contains predominantly observations from a single class. This matters because we get to see whether or not our split was correct. Entropy is an alternative to the Gini Index. We take the logarithm and values that are close to 0 or 1 (similar to logistic regression) are nodes that are pure. In terms of total variability if the Gini Index is small, then one class is favored We call a node pure if one a split results in a leaf having just one class.

The good thing about decision trees (DT) is that even with the presence of qualitative predictor variables, DT can still be constructed. When we look at heart data, a split on one of the qualitative variables (‘sex’, ‘thallium stress test’, and ‘chest pain’) means that we are assigning some of the qualitative values to one branch and assigning the remaining to the other branch. In places where there are more than 2 classes (ie. ‘chest pain’), we’ll simply split based on multiple classes. Ex. we’ll split region in half and have the left node include the 1st and 3rd class while the right node will have classes 2, 4, and 5.

In the special case where a node splits have the same predicted value (a split has leaves that both have 0 or both splits result in True). We’ll view the split as the value that leads to better node purity:

The right node has 9 observations are 0
The left node has 7 observations out of 11 that are 0

In this case, the right node has a better predictive value. Node purity in important because in the case of the right node: we’ll be more confident that the answer truly is 0. If the observation falls on the left node, we are less confident.

Tree Versus Linear Models

Which model is better? It depends on the problem at hand. if the relationship between features and response is linear, then linear regression would suffice. If there’s a non-linear and complex relationship between features and the response, then decision trees will work better.

Examples of when to use linear regression and when to use decision trees. Which bisect data. into boxes (non-overlapping regions)

Advantages and Disadvantages of Trees

Trees are easy to explain
Decision trees more closely mirror human decision-making than other approaches
Trees can be displayed graphically.
Qualitative Data doesn’t need to be put into dummy variables

Cons

DTs do not have predictive accuracy as other regression and classification approaches
Trees can be non-robust: a small change in data can lead to a large change in the estimated tree.

Bagging, Random Forests, and Boosting

Bagging

Regular decision trees suffer from high variance (small changes in data lead to a large change in the estimated model). An example is when we split training data into two parts at random and fit a DT to both halves, the results could be very different. Low variance means that models will look similar no matter how we split. For bagging, we don’t need to prune. We let the trees get large (bushy) because these will have low bias, but large variance. We then take the average of several bagged trees to significantly lower variance.

Averaging a set of observations reduces variance. A natural way to reduce variance and increase predictive accuracy is to take many training sets from the population, build a separate prediction model using each training set, and average the resulting predictions. Because we don’t have infinite training models to choose from, we can bootstrap by taking repeated samples from a training data set. We generate B different bootstrapped training data sets. We then train our method on the b^th bootstrapped training set ain order to get F^*b(x) and average all the predictions. This is bagging.

When we use bagging we are letting the trees grow deep and don’t prune. Each individual tree has high variance and low bias. When we average the trees, it will lower the variance significantly, but only affect bias minimally. For a given test observation, we can record the class predicted by each of the B trees, and take a majority vote: the overall prediction is the most commonly occurring class among the B predictions.

Out Of Bag Error Estimation

A way to estimate the test error of a bagged model without performing cross-validation or validation set approach. We assume that each bagged tree makes use of only 2/3 of the observations. The 1/3 not used are the out-of-bag observations (OOB). In order to obtain a single prediction for the i^th observation, we can average these predicted responses (regression) or take a majority vote (classification). The OOB error is a valid estimate of the test error for the bagged model because the response for each observation is predicted using only trees that were not fit using that observation

In other words, we can predict the response for the ith observation using each of the trees where that observations are OOB. This will yield (B/3) predictions, which we average. In order to compute error for an observation we can use the OOB data with observations that we want to compute and take the error rate of that.

Variable Importance Measures

When we average trees, we’ll get better prediction results but we’ll also get less interpretable results. We can obtain an overall summary of the importance of each predictor using the RSS (regression trees) or the Gini index (classification trees)

We can record the total amount that RSS decreases due to splits over a given predictor averaged over all boosted trees. A large value means that is an important predictor.

Variable Importance Plot for bagged trees.

Random Forests

Random Forests provide an improvement over bagged trees by way of a small tweak that decorrelates the trees. A random sample of m predictors is chosen as split candidates from the full set of p predictors, the split then uses only one of those m predictors, then a fresh sample of m predictors is taken at each split. So in building a random forest (RF), at each split, the algorithm is not even allowed to consider a majority of the available predictors. Suppose that there’s one very strong predictor in the data set with several other moderately strong predictors. If we kept choosing the strongest predictor, all the trees would look very similar. Averaging many highly correlated quantities does not lead to a large reduction in variance as averaging many uncorrelated quantities.

Random forests overcome this problem by forcing each split to consider only a subset of the predictors. So for a percentage ( (p-m)/p ) of the splits, RF will not even consider the strongest predictor which is similar to decorrelating the trees. We’ll have the average of the trees less variable and be more reliable. The difference between bagging and RF is the chose of the number of predictor subsets. If RF has the same subsets as there are predictors then it will be equal to bagging. Using a small value for m (subset of predictors) will typically be helpful when we have a large number of correlated predictors.

In practice we can use RF to reduce 20,000 genes (potential predictors) to the 500 genes that have the largest variance in training set (subset of predictors). We chose the predictors with the largest variance because these will most likely show a reason for change when we calculate. If we chose a variable with low variance, we will not know whether the change is significant. Like bagging, RF will not overfit with the more bagged trees we use. In practice we use a value of B large enough for the error rate to settle down.

Boosting

Boosting is another approach for improving the predictors resulting from a decision tree. Like bagging, boosting is a general approach that can be applied to many statistical learning methods. Bagging involves creating multiple copies of the original training set using the bootstrap, fitting a separate decision tree to each copy, and then combining all of the trees in order to create a single predictive model. Boosting is similar except that the trees are grown sequentially: each tree is grown using information from previously grown trees. Boosting requires that each tree is fit on a modified version of the original set (not bootstrap sampling)

[?] When can RF be used? Can we use RF for for classification and regression? Can we get feature importance as well?

Unlike fitting a single decision tree (DT) to the data, which amounts to fitting the data hard and potentially overfitting, the boosting approach instead learns slowly (usually leading to better predictive performance). Given the current model, we fit a DT to the residuals from the model. We fit a tree using the current residuals as the response (instead of the actual outcome Y). We’ll then add this DT into the fitted function in order to update the residuals.

These DT then are going to be small as fitting small trees to the residuals will slowly improve f^ in areas where it does not perform well. Unlike bagging, boosting will make each tree dependent on the trees that have already been grown.

Unlike bagging and random forests, boosting can be overfit if B is too large, although it does happen slowly. We’ll use cross validation to select B. The shrinkage parameter values are typically .01 or .001, depending on the problem set. A very small shrinkage parameter will recquire a very large value of B in order to achive good performance.

The number of splits (d) in each tress controls the complexity of the boosted ensemble. Usually, d=1 works well (this 1 one split decision tree is called a stump) Since we are using only one variable for each tree, we are fitting an additive model. The number of splits (d) is the interaction depth and controls the interaction order of the boosted model since d number of splits can involve at most d number of variables. In boosting, because the growth of a particular tree takes into account the other trees that have already been grown, smaller trees are typically sufficient. Using smaller trees can aid in interpretability as well.

Difference of tree depth with each DT model. Boosting using stumps preforms slightly better that trees with a depth of 2. The standard errors for each are around .02 and means that these differences are not significant.

Notes on the practice lab

Classification Trees

A small deviance indicated a tree provides a good fit to the training data.
We use cross-validation in order to determine the optimal level of tree complexity; cost complexity pruning is used to select a sequence of trees for consideration.
Not only has pruning produced a more interpretable tree, but it also has improved the classification accuracy.

Regression Trees

Bagging and Random Forests

Two measures of variable importance is reported. The former is based upon the mean decrease of accuracy in predictions on the out of bag samples when a given variable is excluded from the model.
The latter is a measure of the total decrease in node impurity that results from splits over that variabel, averaged overall trees. (Training RSS for regression trees, deviance for classification trees)

Boosting

Questions For Future:

[ ] What are best stopping criterion parameters ]