36 Classification

Dr R. Baskaran

Classification

In this session, we continue our examination of classification methods for data mining. One attractive classification method involves the construction of a decision tree, a collection of decision nodes, connected by branches, extending downward from the root node until terminating in leaf nodes. Beginning at the root node, which by convention is placed at the top of the decision tree diagram, attributes are tested at the decision nodes, with each possible outcome resulting in a branch. Each branch then leads either to another decision node or to a terminating leaf node. Figure 1 provides an example of a simple decision tree. The target variable for the decision tree in Figure 1 is credit risk, with potential customers being classified as either good or bad credit risks.

The predictor variables are savings (low, medium, and high), assets (low or not low), and income (≤$50,000 or >$50,000). Here, the root node represents a decision node, testing whether each record has a low, medium, or high savings level (as defined by the analyst or domain expert). The data set is partitioned, or split, according to the values of this attribute.

Those records with low savings are sent via the leftmost branch (savings = low) to another decision node. The records with high savings are sent via the rightmost branch to a different decision node. The records with medium savings are sent via the middle branch directly to a leaf node, indicating the termination of this branch. Why a leaf node and not another decision node? Because, in the data set (not shown), all of the records with medium savings levels have been classified as good credit risks. There is no need for another decision node, because our knowledge that the customer has medium savings predicts good credit with 100% accuracy in the data set. For customers with low savings, the next decision node tests whether the customer has low assets. Those with low assets are then classified as bad credit risks; the others are classified as good credit risks. For customers with high savings, the next decision node tests whether the customer has an income of at most $30,000.

Customers with incomes of $30,000 or less are then classified as bad credit risks, with the others classified as good credit risks. When no further splits can be made, the decision tree algorithm stops growing new nodes. For example, suppose that all of the branches terminate in “pure” leaf nodes, where the target variable is unary for the records in that node (e.g., each record in the leaf node is a good credit risk). Then no further splits are necessary, so no further nodes are grown. However, there are instances when a particular node contains “diverse” attributes (with nonunary values for the target attribute), and yet the decision tree cannot make a split. For example, suppose that we consider the records from Figure 1 with high savings and low income (≤$30,000).

Suppose that there are five records with these values, all of which also have low assets. Finally, suppose that three of these five customers have been classified as bad credit risks and two as good credit risks, as shown in Table 1. In the real world, one often encounters situations such as this, with varied values for the response variable, even for exactly the same values for the predictor variables.

such customers is “bad,” with 60% confidence, as determined by the three-fifths of customers in this node who are bad credit risks. Note that not all attributes are tested for all records. Customers with low savings and low assets, for example, are not tested with regard to income in this example.

Table 1: Sample records that lead to pure leaf node

Certain requirements must be met before decision tree algorithms may be applied:

Decision tree algorithms represent supervised learning, and as such require pre-classified target variables. A training data set must be supplied which provides the algorithm with the values of the target variable.
This training data set should be rich and varied, providing the algorithm with a healthy cross section of the types of records for which classification may be needed in the future. Decision trees learn by example, and if examples are systematically lacking for a definable subset of records, classification and prediction for this subset will be problematic or impossible.
The target attribute classes must be discrete. That is, one cannot apply decision tree analysis to a continuous target variable. Rather, the target variable must take on values that are clearly demarcated as either belonging to a particular class or not belonging.

Why in the example above, did the decision tree choose the savings attribute for the root node split? Why did it not choose assets or income instead? Decision trees seek to create a set of leaf nodes that are as “pure” as possible, that is, where each of the records in a particular leaf node has the same classification. In this way, the decision tree may provide classification assignments with the highest measure of confidence available.

However, how does one measure uniformity, or conversely, how does one measure heterogeneity? We shall examine two of the many methods for measuring leaf node purity, which lead to the two leading algorithms for constructing decision trees:

Classification and regression trees (CART) algorithm C4.5 algorithm.

COMPARISON OF THE C5.0 AND CART ALGORITHMS APPLIED TO REAL DATA

Next, we apply decision tree analysis using Clementine on a real-world data set. The data set adult was abstracted from U.S. census data by Kohavi and is available online from the University of California at Irvine Machine Learning Repository.

Here, we are interested in classifying whether or not a person’s income is less than $50,000, based on the following set of predictor fields.

Numerical variables

Age, Years of education, Capital gains, Capital losses, Hours worked per week.

Categorical variables, Race, Gender, Work class, Marital status.

The numerical variables were normalized so that all values ranged between zero and 1.Some collapsing of low-frequency classes was carried out on the work class and marital status categories.Clementine was used to compare the C5.0 algorithm (an update of the C4.5 algorithm) with CART, examining a training set of 24,986 records. The decision tree produced by the CART algorithm.