How Trees are Pruned
Pruning DecisionTrees
One of the classic problems in building decision trees is the question of how large a tree to build. Early programs such as AID (Automatic Interaction Detection) used stopping criteria such as the improvement from splits to decide when to stop. This is known as forward pruning or prepruning. But analysis of trees generated by these programs showed that they often were not of the optimal size.DTREG does not use its stopping criteria as the primary means for deciding how large a tree should be. Instead, it uses relaxed stopping criteria and builds an overly-large tree. It then analyzes the tree by using V-fold cross-validation and prunes it back to the optimal size. This is known as backward pruning. Backward pruning requires significantly more calculations than forward pruning, but the tree sizes are much more accurately calculated.
Why Tree Size Is Important
There are two reasons why it is desirable to generate trees of the optimal size.
First, if a situation can be described and explained equally well by two descriptions, the description that is simpler and more concise is generally preferred. The same is true with decision trees: if two trees provide equivalent predictive accuracy, the simpler tree is preferred because it is easier to understand and faster to use for making predictions.
Second, and more importantly, smaller trees may provide greater predictive accuracy for unseen data than larger trees. This is a non-intuitive fact that warrants explanation.
When creating a decision tree, a learning dataset is used. This dataset contains a set of rows that are a representative sample of the overall population. The process used to build the decision tree selects optimal splits to fit the tree to the learning dataset. Once the tree has been built, the records in the learning dataset can be run through the tree to see how well the tree fits the data. The rate of classification errors measured when running the learning dataset through a tree built using that dataset is known as the “resubstitution cost” for the tree. (It is called resubstitution because the same data is being rerun through the tree that was built from it.)
For the learning dataset, the accuracy of the fit always improves (resubstitution cost decreases) as the tree is grown larger. It is always possible to grow a sufficiently large tree to provide 100% accuracy in predicting the learning dataset. In an extreme case, the tree might be grown so large that every row of the learning dataset ended up in its own terminal node. Obviously, with such a tree, an exactly correct value of the target value for every row could be predicted.
However, it is desirable that a decision tree not only accurately reflect the learning dataset from which it was built, but also that it be able to predict the values of other cases that are presented to it later after it has been constructed. This is known as generalization.
While a large tree may fit the learning dataset with extreme accuracy, its size may reduce its generalization accuracy. As an analogy, consider fitting a suit of clothes. Manufactured clothes sold in stores are made to fit various sizes, but they are designed so that there is some slack and leeway around a specified size. In contrast, a custom tailored suit is made precisely to fit a specific individual. While the custom tailored suit will fit one person extremely well, it will not fit other people in the same size range as well as a generic suit. In the same way, adding extra nodes to a tree to “custom tailor” it to the learning dataset may introduce misclassifications when it is later applied to a different dataset.
Another way to understand why large trees can be inferior to smaller trees is that the large trees fit and model minor “noise” in the data, whereas smaller trees model only the significant data factors.
The primary goal of the pruning process is to generate the optimal size tree that can be generalized to other data beyond the learning dataset.