In the vast forest of machine learning,Decision tree(Decision Tree
) is a unique and powerful tree"Wisdom Tree"。
It is a supervised learning algorithm that can be used for both classification tasks and regression tasks, simulating human decision-making processes through tree structures.
This article will help you understandDecision tree, starting from the basic concept, explain step by step how to build a decision tree, commonly used algorithms and its practical applications.
1. Overview
Decision tree(Decision Tree
) As a classic supervised learning algorithm in machine learning, it simulates the human decision-making process through a tree structure.
It can handle classification problems (such as determining whether an email is spam) and solve regression problems (such as predicting housing prices).
Decision treeThrough a series of decision rules, divide the data into different categories or predict target values.
Its structure is like aInverted tree, at the top is the root node, representing the entire data set, and then through a series of feature judgments, the branches are constantly forked out, and finally reaching the leaf node, each leaf node represents a decision result.
Decision treeThe biggest advantage isStrong explanation. Its decision-making process is like the human thinking process, making judgments through a series of "if... then..." rules.
For example, in a simple loan approval decision tree, it may first determine whether the applicant's income is above a certain threshold, and if it is above, then determine whether the credit history is good, and finally decide whether to approve the loan.
This intuitive decision-making process makes decision trees very popular in many scenarios where interpretability is needed, such as medical diagnosis, financial risk assessment and other fields.
2. Basic Process
The basic process of the decision tree is4
step:
- Feature selection: This is the starting point for the construction of the decision tree. We need to select a feature that best distinguishes the data from many features as the basis for dividing the current node.
- Create a node: Create a decision node based on the selected features. This node divides the data into several subsets according to different values of the features.
- Recursive division: For each subset, repeat the above process, continue to select features and create nodes until the stop condition is met.
- Pruning(Optional): To avoid overfitting of decision trees that are too complex, the tree is usually pruned.
Among them, aboutPruningThe steps will be introduced in detail in other subsequent articles, and this article mainly discussesThe first 3step.
Among the first 3 steps, the most important one is the first stepFeature selection, that is, how to divide the data set at each fork of the decision tree.
3. Common algorithms for dividing data sets
generateDecision treeIn the process of , there are three common algorithms for dividing data sets.
3.1. Information-based gain (ID3)
Information GainIt is based on information theory concept. It measures how much uncertainty in data is reduced after using a feature to divide.
Specifically,Information GainThe difference between the entropy of the data before division (representing the degree of confusion of the data) and the weighted average of the entropy of each subset after division.
If the information gain of a feature is greater, it means that the more uncertainty of the data is reduced after using this feature division, the more valuable the feature is.
Its calculation formula: $ \text{Gain}(D,A) = H(D) - \sum_{v \in Values(A)} \frac{|D^v|}{|D|} H(D^v) $
where $H(D) = -\sum p_i \log p_i $ is information entropy.
It's very simple to use with the scikit-learn library:
from import DecisionTreeClassifier
from import load_iris
from sklearn.model_selection import train_test_split
from import accuracy_score
# Load the dataset
iris = load_iris()
X =
y =
# Divide the training set and the test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Build a decision tree using information gain
clf = DecisionTreeClassifier(criterion='entropy', random_state=42)
(X_train, y_train)
# Predict and calculate accuracy
y_pred = (X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy of decision tree based on information gain: {accuracy}')
## Output result:
'''
Decision tree accuracy based on information gain: 0.977777777777777777777777777777777777777777
'''
Here parameterscriterion='entropy'
It means useID3
Algorithms to divide decision trees.
3.2. Based on gain rate (C4.5)
Gain rateis an improved version of information gain, which takes into account the impact of the number of branches generated after feature division on information gain.
Because if a feature has many values, even if the entropy after it is divided is not reduced much, a larger information gain may be obtained, which may cause the decision tree to tend to select features with more values.
The gain rate corrects this bias by dividing the information gain by the information entropy generated by the feature division (called split information).
Its formula is: $ \text{Gain_Ratio}(D,A) = \frac{\text{Gain}(D,A)}{\text{IV}(A)} $
where $ \text{IV}(A) = -\sum_{v} \frac{|D^v|}{|D|} \log \frac{|D^v|}{|D|} $
existscikit-learn
There is no option to directly implement the gain rate, but it can be approximately implemented by adjusting the parameters.
C4.5
The algorithm isID3
The concept of gain rate is added on the basis of the algorithm, and the processing of continuous values is supported.
# Use C4.5 algorithm (approximate implementation)
clf_c45 = DecisionTreeClassifier(criterion='entropy', splitter='best', random_state=42)
clf_c45.fit(X_train, y_train)
# Predict and calculate accuracy
y_pred_c45 = clf_c45.predict(X_test)
accuracy_c45 = accuracy_score(y_test, y_pred_c45)
print(f'Gain rate-based decision tree accuracy: {accuracy_c45}')
## Output result:
'''
Decision tree accuracy based on information gain: 0.977777777777777777777777777777777777777777
'''
The accuracy obtained by the approximate implementation method here is the same as that of the ID3 algorithm.
3.3. Based on Gini coefficient (CART)
Gini IndexIt is another method to measure data impurity. Its calculation is relatively simple. The smaller the Gini index, the higher the purity of the data.
For a feature, we calculate the weighted average of the Gini index of each subset after dividing according to this feature, and then select the feature with the Gini index that decreases the most as the dividing feature.
Gini IndexMore inclined to choose features that can divide the data into a more "pure".
Its calculation formula is: $ \text{Gini}(D) = 1 - \sum p_i^2 $
Select the features that minimize the Gini index for division.
The implementation example is as follows:
# Build a decision tree using Gini Index
clf_cart = DecisionTreeClassifier(criterion='gini', random_state=42)
clf_cart.fit(X_train, y_train)
# Predict and calculate accuracy
y_pred_cart = clf_cart.predict(X_test)
accuracy_cart = accuracy_score(y_test, y_pred_cart)
print(f'Gini index-based decision tree accuracy: {accuracy_cart}')
## Output result:
'''
Decision tree accuracy based on information gain: 1.0
'''
Judging from the results,CART
The accuracy of the algorithm is higher than the first two.
4. Comparison of different algorithms
These three algorithms have their own advantages and disadvantages and applicable scenarios, and are selected according to the actual data situation when used.
algorithm | advantage | shortcoming | Applicable scenarios |
---|---|---|---|
Information Gain | Strong explanation | Positively select features with many values | Datasets with fewer feature values |
Gain rate | Suppress overfitting | Comparatively complex | Classification tasks with more categories |
Gini Index | High computing efficiency | In some cases it may be oversensitive, resulting in overfitting | Large-scale data/scenarios that require rapid training |
5. Summary
Decision treeIt is a powerful and intuitive machine learning algorithm that classifies or regresses data through a series of decision rules.
Information Gain、Gain rateandGini IndexThey are three commonly used feature selection criteria, each with its advantages and disadvantages, and are suitable for different application scenarios.
passscikit-learn
With this powerful machine learning library, we can easily implement decision tree models based on these standards and apply them to practical problems.