13 Decision Tree Algorithm ID3
Welcome to the e-PG Pathshala Lecture Series on Machine Learning. In this module we will be discussing the ID3 heuristic for choosing the attributes of a Decision Tree.
Learning Objectives:
The learning objectives of this module are as follows:
• To explain greedy algorithm for Decision tree induction
• To outline the ID3 heuristic for choosing attributes
• To explain the concepts of entropy, impurity and information gain
• To illustrate with an example the building of a Decision tree using ID3
14.1 Introduction
The decision tree can be defined as a tree with decision nodes which partitions the examples into 2 subsets based on the value of the attribute representing the decision node. The leaves of this indicates classification of an example. The class of a new sample can be determined by starting at the root and choosing alternatives of the decision node according to the values of the attributes until a leaf node indicating the value of the target variable is reached.
14.2 Decision Tree Algorithms
The basic idea behind any decision tree algorithm is choosing the best attribute(s) to split the remaining instances and make that attribute a decision node. We repeat this process recursively for each child. The stopping criterion is generally one of the following, either all the instances have the same target attribute value, or there are no more attributes or there are no more instances to handle.
14.2.1 Basic Algorithm – Decision Tree Induction
The basic algorithm for construction of a decision tree is greedy in nature. In this method the decision tree is constructed in a top-down recursive divide-and-conquer manner. At start, all the training examples are at the root. Attributes are categorical and the examples are partitioned recursively based on selected attributes. The test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain). Let us understand this algorithm using the example training set given in Table 14.1. The decision attributes age, income, whether student or not, credit rating are used to classify people based on whether they would buy a computer or not. Figure 14.1 shows one sample decision tree for the table. Remember that we can construct more than one decision tree for the table based on the order in which the decision attributes are chosen. In figure 14.1, the first decision attribute chosen is age.
14.3 Criterion for Attribute Selection
Now the question is how to choose the best attribute. The best choice of attribute will result in the smallest decision tree. Now the basic heuristic is to choose the attribute that produces the “purest”. We will discuss what purity is later on. One popular purity criterion is information gain. As the average purity of the partitions obtained based on the chosen attribute increases, the information gain increases. Therefore the strategy to be adopted is to choose attribute that results in greatest information gain. Now we need a good measure of purity – we need a way of knowing when the purity is maximal and when the purity is minimal.
14.3.1 Choosing Attributes
The basic methodology of creating a decision tree is the same for most of the decision tree algorithms. The crucial difference lies in how we select the attributes for the tree, or the order in which we select the attributes for the construction of the decision tree.
We will first focus on the ID3 algorithm developed by Ross Quinlan (Figure 14.2) in 1975. ID3 (Iterative Dichotomiser 3) is an algorithm invented by Ross Quinlan. The algorithm is used to generate a decision tree from a dataset using Shannon Entropy.
In the decision tree given in Figure 14.3, we need to classify the length of the travel as short, medium or long based on when we leave, whether there is an accident on the way and whether the traffic was stalled. How did we know that we first need to split with Leave At attribute and then go on to split with Stall on left side and Accident on the right? This is the question we need to answer.
14.3.3 Construction of Decision Trees
The algorithm works in a top down recursive divide-and-conquer fashion where the first attribute is selected as root node and branch is created for each possible attribute value. Then the instances are split into disjoint subsets (one for each branch extending from the decision node). This partitioning of the subsets of the instances is repeated recursively for each branch of the decision node. The recursive process stops when all instances that belong to a subset have the same class.
14.3.4 ID3 Heuristic
ID3 employs top-down induction of decision tree. Attribute selection is the fundamental step to construct a decision tree. ID3 employs a top-down greedy search through the space of possible decision trees. The algorithm is called greedy because the highest values are always picked first and there is no backtracking. The idea is to select the attribute that is at that point most useful for classifying examples.
To determine the best attribute, we look at the ID3 heuristic. ID3 splits attributes based on their entropy. Entropy is the measure of disinformation. It comes from information theory. Higher the entropy, higher is the information content. In other words we select the attribute that has the highest information gain. We will go into the details of entropy and information gain a little later.
14.3.4 Steps of the ID3 Algorithm
1. The first step of the algorithm is the selection of the attributes that will become nodes of the decision tree. As already discussed there are two terms entropy and information gain that are used as the basis for attribute selection.
2. Once the attribute is selected for the current node, the child nodes are generated, one for each possible value of the selected attribute.
3. The examples are then partitioned using the possible values of this attribute and these partitioned subsets of the examples are assigned to the appropriate child node
4. The steps 1 to 3 are repeated for each child node until all examples associated with a node are either all positive or all negative
14.3.5 ID3 in Gaming
Black & White, a game developed by Lionhead Studios, released in 2001 used ID3. The algorithm was used to predict a player’s reaction to a certain creature’s action. In this model, a greater feedback value means the creature should attack.
14.4 Information Gain
As we have discussed we need to determine which attribute best classifies the data. Figure 14.4 shows an example of classifying whether a particular person is likely to pay the insurance or not. The figure shows splits based on two attributes, Balance and whether the applicant is Employed. The split over Balance £ 50K and >50K shows that the examples are mixed and not clearly demarcated. However the split over Employed shows that the split of examples is not mixed. This shows that the test based on Employed is more informative.
Figure 14.5 (a) shows partitions of samples, one which have both types of samples and is therefore impure, while the other has only one type of samples and is therefore pure. Figure 14.5 (b) similarly shows partitions that are very impure, less impure and one with minimum impurity.
Information gain is the statistical quantity measuring how well an attribute classifies the data. In other words in order to select the best attribute we need to first calculate the information gain for each attribute and then choose the attribute with the greatest information gain.
14.4.1 Entropy
One common way to measure impurity was given by Claude Shannon. He defined a mathematical function, Entropy, which measures information content of a random process. Entropy has the largest value when events are equiprobable and the smallest value when only one event has non-zero probability. Entropy comes from information theory. The higher the value of entropy the more is the information content. Entropy can be defined as given below:
in the set. Now let us understand entropy and learning from examples. Entropy is minimized when all values of the target attribute are the same. For example if we know that an attribute commute time will always be short, then it’s entropy is zero. Entropy is maximized when there is an equal chance of all values for the target attribute (i.e. the result is random). For example if commute time is short in 3 instances, medium in 3 instances and long in 3 instances, entropy is maximized.
In other words Entropy H(S) is a measure of the amount of uncertainty in the (data) set S. Essentially Entropy measures the impurity of an arbitrary collection S of examples. Therefore, more the entropy, measures information content of random process (Figure 14.7).
For a collection S having positive and negative examples (Figure 14.8) , the entropy is given as:
Entropy(S) = -p+log2p+ – p-log2p-
– where p+ is the proportion of positive examples &
– p- is the proportion of negative examples
Entropy(S) = 0 if all members of S belong to the same class and Entropy(S) = 1 (maximum) when all members are split equally.
14.4.2 Two-Class Cases
Let us consider the example given in Figure 14.9. What is the entropy of a group in which all examples belong to the same class?
– entropy = – 1 log21 = 0
However such a group is not a good training set for learning.What is the entropy of a group with 50% in either class?
– entropy = -0.5 log20.5 – 0.5 log20.5 =1
This group is good training set for learning.
14.4.3 Entropy and Information Gain
As we have discussed before our objective is to determine which attribute in a given set of training feature vectors is most useful for discriminating between the classes to be learned. Information gain tells us how important a given attribute of the feature vector is. We will use this information to decide the ordering of attributes in the nodes of a decision tree. Now information gain of an attribute can be described in terms on entropy and determines the information gained by partitioning the original data set into subsets. With this basis information gain is defined as the measure of the difference in entropy from before to after the set is split on an attribute.
Here H(S) is the entropy of set S before splitting. T contains the subsets created after splitting S by attribute A such that S = is split into subsets t e T. Here p(t) is the proportion of number of elements in t to the number of elements in the whole set S . H(t) is the entropy of each subset t.
We can calculate information gain as a measure in the expected reduction in entropy. The higher the information gain, more is the expected reduction in entropy.
here values(A) is the set of all possible values for attribute A, Sv is the subset of S for which attribute A has value v.
Figure 14.9 shows an example with 30 instances or samples of which 14 belongs to one class and the rest (16) belong to the second class. The overall information gain is the entropy(parent)-average entropy of it’s children. Parent entropy is given below:
14.5 Illustrative Example for ID3 Algorithm
Let us consider the sample training data given in Table 14.2 to determine whether an animal lays eggs. The training data set consists of 6 samples having four attributes namely Warm-blooded, Feathers, Fur and Swims and we need to find out whether an animal lays eggs.
The first step in the algorithm is to examine the decision values in the sample set S and calculate the Entropy of S. We see that of the 6 samples, 4 samples are for Yes and 2 for No. Hence the Entropy(S) is calculated as given in Figure 14.10. Now we need to first find the decision values associated with each of the four attributes (Figure 14.11) in order to find the entropy corresponding to the subset for that attribute. From this we can find the Gain obtained by using the attribute for each decision value as given below in Figure 14.12:
Now, we have to find the Information Gain (IG) for all four attributes Warm-blooded, Feathers, Fur & Swims.
Figure 14.13 shows the calculation of Gain(S,Warm-Blooded) and Gain(S,Feathers). From the example we know that the set S has 4 Y and 2 N. Considering the 5 Y values of warm-blooded, 3 are Yes for Lays Eggs and 2N. Therefore Entropy(SYes) can be calculated. Similarly we can calculate Entropy(SNo). From these calculations and Entropy(S) (already found) we can find Gain(S,Warm-Blooded) = 0.10916 using Equation given in Figure 14.12. Similarly we can find Gain(S,Feathers) = 0.45914. Similarly as shown in Figure 14.4 we can find Gain(S,Fur) = 0.3167 and Gain(S,Swims) = 0.04411
By comparing the Gains of the four attributes we see that Gain(S,Feathers) is the maximum and this is chosen as the root node of the decision tree.
On studying the part of the table with Feathers we see that knowing that Features as Y we see that the animals Ostrich, Raven and Albatross Lays Eggs without checking any other attribute. However for the case of Feathers (N), we are not able to unambiguously determine whether the animals Lays Eggs. For this we need to consider the reduced table and calculate the corresponding Entropy of new set S (Figure 14.17). Now we calculate the gain of the three attributes Warm-Blooded, Fur and Swims as shown in Figure 14.18.
Now we find that the Gain(S,Warm-Blooded) is the highest and that is chosen as the next attribute. Now we continue constructing the tree (Figure 14.19). At this stage all the sample are taken care of with single class in each subset. Thus we have obtained the final tree (Figure 14.19)
14.6 Try Out Example – Factors affecting sunburn
Table 14.3 shows an example that can be tried out by you.
Summary
- Explained the greedy algorithm for Decision tree induction
- Outlined the ID3 heuristic for choosing attributes
- Explained the concepts of entropy, impurity and information gain
- Used an example to illustrate the building of a Decision tree using ID3
you can view video on Decision Tree Algorithm ID3 |
Web Links
- www.cs.cmu.edu/~aarti/Class/…/recitation/decisiontree_modelselection.ppt
- www.cse.lehigh.edu/~munoz/CSE497/classes/Storey_DecisionTrees.ppt
- coitweb.uncc.edu/~ras/KDD-02/Decision-Trees.ppt
- www.cs.kent.edu/~jin/DM07/ClassificationDecisionTree.ppt
- www2.gsu.edu/~wwwkcl/MGS3100/MGS3100_Slides8b.ppt
- https://www.cse.ust.hk/~twinsen/Decision_Tree.ppt
- http://www.csc.villanova.edu/~tway/course
- http://www.cs.bu.edu/fac/gkollios/ada05/
- http://www.cs.ccsu.edu/~markov/ccsu_course
- homes.cs.washington.edu/~shapiro/EE596/notes/InfoGain.pdf