# Decision Tree: Important Things to Know Decision Tree: Important Things to KnowDecision Tree organized a series roots in a Tree Structure. It is one of the most practical methods for non-parametric, supervised learning. Our goal in this article is to demonstrate how to create a decision tree that predict the value of a target by lending decision rules inferred from the training data. Let’s first have a look at the some important terminologies of decision tree.

The root node is at the beginning of a tree. It represents the entire population being analyzed. Splitting is the process of dividing a node into sub-nodes when a sub-node can be splitted into further nodes. Then it’s called decision nodes. Leaf nodes are the nodes of a tree that have no further splitted nodes. When we remove sub-nodes of a decision node, this process is called pruning.

This is the opposite process of splitting, which is used to prevent overfitting. Now let’s demonstrate a process of creating a decision tree from data.

First of all, let’s inspect the dataset. In this example, there are two classes and two features size and the color for the training data. We can construct to some decision Rules from the features.

For example, here are two simple decision rules. The first is whether the size is greater or equal to two. And the second is whether the color is yellow or not. In real application, There are usually a lot of decision rules we can choose from. So we need to find a way to choose the best decision rule to split the current node starting from the root node.

We are going to iterate through all the decision rules and calculate the information gain of the current split. Calculating information gain is a centerpiece of this step. So what is information gain? Let’s have a closer look at each group within the decision rule. The group is pure when all of its recorsd belong to the same class.

Here we use Gini impurity to measure how pure the current node is. When the group is pure Gini impurity is zero. When the group is half and half mixed Gini impurity is 0.5. For any decision split, a parent group is split into two child groups. The information gain of this split is the impurity of the parent group minus weighted average impurity of the child group.

Once we calculate the information gain of each splitted decision rule. We now can select the decision rule, which has the largest information gain to split the current node. Then we can split the child node recursively and only consider the decision rule never selected before in the current branch. We stop splitting if there are no decision rules left of the group impurity is zero.

Sometimes pruning and early stop condition are used to prevent a large number of splits and overfitting. Now the decision tree can predict the probability of each class at the ratio of each class in the node. In this case, for the record, for size greater or equal to two and color is yellow. There is a 50 percent chance the class record is a circle. So that is a brief demonstration of how to construct the decision tree from data.

It has a lot of advantages like easy to interpret and straightforward for visualizations. However, it has a big disadvantage. Single decision tree model are prone to overfitting, especially when a tree is particularly deep. One way to combat this issue is by setting a max depth of the tree.

This will limit the risk of overfitting, but this will be at the expense of increasing bias. Random forest is a good way to prevent overfitting without sacrificing bias. It is simply a collection of decision trees whose results are aggregated into one final result. It is a strong ensemble modeling technique and much more robust than a single decision tree.

So in real world application, people usually go directly into using random forest for modelling. I hope you like this article.