Building and Using Decision Trees
There are two steps to making productive use of decision trees (1) building a decision tree model, and (2) using the decision tree to draw inferences and make predictions.
The Tree Building Process
The first step in building a decision trees is to collect a set of data values that DTREG can analyze. This data is called the "learning" or "training" dataset because it is used by DTREG to learn how the value of a target variable is related to the values of predictor variables. This dataset must have instances for which you know the actual value of the target variable and the associated predictor variables. You might have to perform a controlled study to collect this data, or you might be able to obtain it from previously-collected historical records.
The data you provide to DTREG for an analysis is called a “dataset”. It consists of an entry for each case to be analyzed. Each case provides values for the target and predictor variables for a specific customer, patient, company, etc. Each entry is known as a “case,” “row,” “record” or “observation”.
The question often arises as to “How much data is required for the learning dataset?” There is no simple answer other than “As much as possible.” In general, DTREG will not split a node that has fewer than 10 rows in it. So a tree that has three levels and four terminal nodes must have an absolute minimum of 20 records, but the predictive accuracy would be greatly improved by having four or more times this many records. DTREG is designed to handle virtually an unlimited number of records; it is quite feasible to analyze datasets with millions of records, although the computation time may be lengthy.
Once you obtain enough data for the learning dataset, this data is fed into DTREG which performs a complex analysis on it and builds a decision tree that models the data.
Using Decision Trees
Once DTREG has created a decision tree, there are several uses you can make of the tree.
- You can use the tree to make inferences that help you understand the “big picture” of the model. This is one of the great advantages of decision trees over classical regression and neural networks – decision trees are easy to interpret even by non-technical people. For example, if the decision tree models product sales, a quick glance might tell you that men in the South buy more of your product than women in the North. If you are developing a model of health risks for insurance policies, a quick glance might tell you that smoking and age are important predictors of health.
- You can use the decision tree to identify target groups. For example, if you are looking for the best potential customers for a product, you can identify the terminal nodes in the tree that have the highest percentage of sales, and then focus your sales effort on individuals described by those nodes.
- You can predict the target value for specific cases where you know only the predictor variable values. This is known as "scoring”.