Decision Tree: The "Smart Tree" in Machine Learning

In the vast forest of machine learning,Decision tree（Decision Tree) is a unique and powerful tree"Wisdom Tree"。

It is a supervised learning algorithm that can be used for both classification tasks and regression tasks, simulating human decision-making processes through tree structures.

This article will help you understandDecision tree, starting from the basic concept, explain step by step how to build a decision tree, commonly used algorithms and its practical applications.

1. Overview

Decision tree（Decision Tree) As a classic supervised learning algorithm in machine learning, it simulates the human decision-making process through a tree structure.

It can handle classification problems (such as determining whether an email is spam) and solve regression problems (such as predicting housing prices).

Decision treeThrough a series of decision rules, divide the data into different categories or predict target values.

Its structure is like aInverted tree, at the top is the root node, representing the entire data set, and then through a series of feature judgments, the branches are constantly forked out, and finally reaching the leaf node, each leaf node represents a decision result.

Decision treeThe biggest advantage isStrong explanation. Its decision-making process is like the human thinking process, making judgments through a series of "if... then..." rules.

For example, in a simple loan approval decision tree, it may first determine whether the applicant's income is above a certain threshold, and if it is above, then determine whether the credit history is good, and finally decide whether to approve the loan.

This intuitive decision-making process makes decision trees very popular in many scenarios where interpretability is needed, such as medical diagnosis, financial risk assessment and other fields.

2. Basic Process

The basic process of the decision tree is4step:

Feature selection: This is the starting point for the construction of the decision tree. We need to select a feature that best distinguishes the data from many features as the basis for dividing the current node.
Create a node: Create a decision node based on the selected features. This node divides the data into several subsets according to different values of the features.
Recursive division: For each subset, repeat the above process, continue to select features and create nodes until the stop condition is met.
Pruning（Optional): To avoid overfitting of decision trees that are too complex, the tree is usually pruned.

Among them, aboutPruningThe steps will be introduced in detail in other subsequent articles, and this article mainly discussesThe first 3step.

Among the first 3 steps, the most important one is the first stepFeature selection, that is, how to divide the data set at each fork of the decision tree.

3. Common algorithms for dividing data sets

generateDecision treeIn the process of , there are three common algorithms for dividing data sets.

3.1. Information-based gain (ID3)

Information GainIt is based on information theory concept. It measures how much uncertainty in data is reduced after using a feature to divide.

Specifically,Information GainThe difference between the entropy of the data before division (representing the degree of confusion of the data) and the weighted average of the entropy of each subset after division.

If the information gain of a feature is greater, it means that the more uncertainty of the data is reduced after using this feature division, the more valuable the feature is.

Its calculation formula: $ \text{Gain}(D,A) = H(D) - \sum_{v \in Values(A)} \frac{|D^v|}{|D|} H(D^v) $

where $H(D) = -\sum p_i \log p_i $ is information entropy.

It's very simple to use with the scikit-learn library:

from import DecisionTreeClassifier
 from import load_iris
 from sklearn.model_selection import train_test_split
 from import accuracy_score

 # Load the dataset
 iris = load_iris()
 X =
 y =

 # Divide the training set and the test set
 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

 # Build a decision tree using information gain
 clf = DecisionTreeClassifier(criterion='entropy', random_state=42)
 (X_train, y_train)

 # Predict and calculate accuracy
 y_pred = (X_test)
 accuracy = accuracy_score(y_test, y_pred)
 print(f'Accuracy of decision tree based on information gain: {accuracy}')

 ## Output result:
 '''
 Decision tree accuracy based on information gain: 0.977777777777777777777777777777777777777777
 '''

Here parameterscriterion='entropy'It means useID3Algorithms to divide decision trees.

3.2. Based on gain rate (C4.5)

Gain rateis an improved version of information gain, which takes into account the impact of the number of branches generated after feature division on information gain.

Because if a feature has many values, even if the entropy after it is divided is not reduced much, a larger information gain may be obtained, which may cause the decision tree to tend to select features with more values.

The gain rate corrects this bias by dividing the information gain by the information entropy generated by the feature division (called split information).

Its formula is: $ \text{Gain_Ratio}(D,A) = \frac{\text{Gain}(D,A)}{\text{IV}(A)} $

where $ \text{IV}(A) = -\sum_{v} \frac{|D^v|}{|D|} \log \frac{|D^v|}{|D|} $

existscikit-learnThere is no option to directly implement the gain rate, but it can be approximately implemented by adjusting the parameters.

C4.5The algorithm isID3The concept of gain rate is added on the basis of the algorithm, and the processing of continuous values is supported.

# Use C4.5 algorithm (approximate implementation)
 clf_c45 = DecisionTreeClassifier(criterion='entropy', splitter='best', random_state=42)
 clf_c45.fit(X_train, y_train)

 # Predict and calculate accuracy
 y_pred_c45 = clf_c45.predict(X_test)
 accuracy_c45 = accuracy_score(y_test, y_pred_c45)
 print(f'Gain rate-based decision tree accuracy: {accuracy_c45}')

 ## Output result:
 '''
 Decision tree accuracy based on information gain: 0.977777777777777777777777777777777777777777
 '''

The accuracy obtained by the approximate implementation method here is the same as that of the ID3 algorithm.

3.3. Based on Gini coefficient (CART)

Gini IndexIt is another method to measure data impurity. Its calculation is relatively simple. The smaller the Gini index, the higher the purity of the data.

For a feature, we calculate the weighted average of the Gini index of each subset after dividing according to this feature, and then select the feature with the Gini index that decreases the most as the dividing feature.

Gini IndexMore inclined to choose features that can divide the data into a more "pure".

Its calculation formula is: $ \text{Gini}(D) = 1 - \sum p_i^2 $

Select the features that minimize the Gini index for division.

The implementation example is as follows:

# Build a decision tree using Gini Index
 clf_cart = DecisionTreeClassifier(criterion='gini', random_state=42)
 clf_cart.fit(X_train, y_train)

 # Predict and calculate accuracy
 y_pred_cart = clf_cart.predict(X_test)
 accuracy_cart = accuracy_score(y_test, y_pred_cart)
 print(f'Gini index-based decision tree accuracy: {accuracy_cart}')

 ## Output result:
 '''
 Decision tree accuracy based on information gain: 1.0
 '''

Judging from the results,CARTThe accuracy of the algorithm is higher than the first two.

4. Comparison of different algorithms

These three algorithms have their own advantages and disadvantages and applicable scenarios, and are selected according to the actual data situation when used.

algorithm	advantage	shortcoming	Applicable scenarios
Information Gain	Strong explanation	Positively select features with many values	Datasets with fewer feature values
Gain rate	Suppress overfitting	Comparatively complex	Classification tasks with more categories
Gini Index	High computing efficiency	In some cases it may be oversensitive, resulting in overfitting	Large-scale data/scenarios that require rapid training

5. Summary

Decision treeIt is a powerful and intuitive machine learning algorithm that classifies or regresses data through a series of decision rules.

Information Gain、Gain rateandGini IndexThey are three commonly used feature selection criteria, each with its advantages and disadvantages, and are suitable for different application scenarios.

passscikit-learnWith this powerful machine learning library, we can easily implement decision tree models based on these standards and apply them to practical problems.