When the decision tree encounters dirty data: a solution to continuous and missing values

In machine learning, decision tree algorithms are widely used because of their simplicity and comprehensibility.

However, data in the real world is often complex and varied, especiallyContinuous valueandMissing valueThe existence of the decision tree has brought many challenges.

Continuous value(such as age, income) cannot be directly used for discrete split points of the decision tree and need to be converted into "discrete intervals".
Missing value(such as questionnaire items that the user did not fill out) may lead to information loss or model deviation.

This article will analyze in depth how the decision tree handles continuous and missing values, and compares the effects of different implementation methods through actual cases.

1. Continuous value processing

1.1. What is continuous value processing

Decision treeIt is a model based on feature splitting, and its core idea is to divide the data into different regions.

However,Continuous valueFeatures cannot be used directly for the division of discrete split points.

For example, for an age characteristic, we cannot simply divide it into "age" and "non-age", but instead need to convert it into"Discrete interval", such as "≤30 years old" and ">30 years old".

That's rightContinuous valuehandling.

1.2. Processing strategies

There are three common strategies for processing continuous values:

dichotomy（Binary Splitting）

Dichotomy is a common continuous value processing method.

The principle is to traverse all possible split points, select the optimal threshold, and divide the continuous features into two intervals.

For example, for age characteristics, we can choose 30 years old asSplit Point, divide it into "≤30 years old" and ">30 years old".

CARTalgorithm

existCARTIn the algorithm, the selection of split points is based onGini Index(Classification Tree) orMean Square Error(Return to the Tree).

Multi-fork split(C4.5 extension)

Some decision tree algorithms (such as C4.5) support multi-interval division, i.e.Multi-fork split。

This approach allows for more detailed division of continuous features, but may lead to increased complexity of the tree.

1.3. Handle cases

Use belowscikit-learnA simple example of a library implementation that demonstrates the processing of continuous values by a decision tree model.

In this example, use the classic iris dataset, which containsContinuous value features(such as the length and width of the petals, calyxes), and then construct a decision tree classifier to classify iris.

from import load_iris
 from sklearn.model_selection import train_test_split
 from import DecisionTreeClassifier
 from import accuracy_score
 import as plt

 # Loading the iris dataset
 iris = load_iris()
 X = # Feature data, containing continuous value features
 y = # tag data

 # Divide the training set and the test set
 X_train, X_test, y_train, y_test = train_test_split(
     X, y, test_size=0.3, random_state=42
 )

 # Create a decision tree classifier
 clf = DecisionTreeClassifier(random_state=42)

 # Train the model on the training set
 (X_train, y_train)

 # Make predictions on test sets
 y_pred = (X_test)

 # Calculate accuracy
 accuracy = accuracy_score(y_test, y_pred)
 print(f"Model Accuracy: {accuracy:.2f}")

 # Print node information of the decision tree
 n_nodes = clf.tree_.node_count
 children_left = clf.tree_.children_left
 children_right = clf.tree_.children_right
 feature = clf.tree_.feature
 threshold = clf.tree_.threshold

 for i in range(n_nodes):
     if children_left[i] != children_right[i]: # Internal node
         print(f"Node {i}: Feature {iris.feature_names[feature[i]]} <= {threshold[i]}")

 ## Output result:
 '''
 Model accuracy: 1.00
 Node 0: Feature petal length (cm) <= 2.449999988079071
 Node 2: Feature petal length (cm) <= 4.75
 Node 3: Feature petal width (cm) <= 1.600000023841858
 Node 6: Feature petal width (cm) <= 1.75
 Node 7: Feature petal length (cm) <= 4.950000047683716
 Node 9: Feature petal width (cm) <= 1.550000011920929
 Node 11: Feature petal length (cm) <= 5.450000047683716
 Node 14: Feature petal length (cm) <= 4.8500001430511475
 Node 15: Feature sepal width (cm) <= 3.100000023841858
 '''

From the running results, it can be seen that the decision tree model can automatically process continuous value features and classify them based on these features.

2. Missing value processing

2.1. What is missing value processing

In the dataMissing valueIt may lead to data sparsity, information loss and even model deviation.

So, in the decision tree, how to deal with itMissing valueIt is a key issue.

2.2. Processing strategies

deal withMissing valueIt can be used during the data preprocessing stage, or using the built-in processing mechanism of the algorithm.

existPreprocessingIn stages, the methods that can be used are:

deleteSamples with missing values: This methodSimple and direct, but it can lead to excessive data loss, especially whenMissing valueMore often.
fillingMethod: UseMean、MedianorModeIsometric statistic fills the missing values. This approach preserves the integrity of the data, but may introduce bias.

forMissing value, algorithmicBuilt-in processingMechanisms include:

CART's weighting strategy: During splitting, the weight distribution of missing value samples is also considered. For example, for samples with missing features, they can be allocated proportionally to left and right child nodes.
Alternative value method for C4.5: Assign probability weights to missing values and participate in information gain calculation.
Sparse perception algorithm of XGBoost/LightGBM: Automatically identify missing values and optimize split paths.

2.3. Handle cases

existscikit-learnIn this paper, the decision tree model itself cannot directly process missing values, but we can preprocess the data containing the missing values first, and then use the decision tree model for analysis.

Here is an example showing how to process data with missing values and use a decision tree model for classification.

import numpy as np
 from import load_iris
 from sklearn.model_selection import train_test_split
 from import DecisionTreeClassifier
 from import SimpleImputer
 from import accuracy_score

 # Loading the iris dataset
 iris = load_iris()
 X =
 y =

 # Artificially introduce missing values
 (42)
 missing_mask = (*) < 0.1
 X_with_missing = ()
 X_with_missing[missing_mask] =

 # Divide the training set and the test set
 X_train, X_test, y_train, y_test = train_test_split(
     X_with_missing, y, test_size=0.3, random_state=42)

 # Use mean to fill missing values
 Impute = SimpleImputer(strategy='mean')
 X_train_imputed = importer.fit_transform(X_train)
 X_test_imputed = (X_test)

 # Create a decision tree classifier
 clf = DecisionTreeClassifier(random_state=42)

 # Train the model on the filled training set
 (X_train_imputed, y_train)

 # Make predictions on the filled test set
 y_pred = (X_test_imputed)

 # Calculate accuracy
 accuracy = accuracy_score(y_test, y_pred)
 print(f"Model Accuracy: {accuracy:.2f}")

 ## Running results:
 '''
 Model accuracy: 0.91
 '''

In this example, first load the iris dataset and then randomly10%The data set toMissing value, to simulate the missing value situation in reality.

Before training the model, useAverage（strategy='mean') to fillMissing value, the final training results also91%accuracy.

3. Handle the limitations of continuous and missing values

Decision tree processingContinuous valueandMissing valueIt is not omnipotent, and it also has certain limitations.

forContinuous valueLimitations of processing:

Low efficiency for high-dimensional sparse data: In high-dimensional sparse data, traversing all possible split points will lead to a significant increase in computational complexity.
Split point selection may be affected by outliers: outliers may cause the selection of split point to deviate from the optimal value, thereby affecting model performance.

forMissing valueLimitations of processing:

Built-in methods may increase computational complexity: The algorithm's built-in missing value processing mechanisms (such as CART's weighting strategy) may increase computational complexity.
Model performance degradation at extreme missing rates: When the proportion of missing values in the data is too high, any processing method may not be able to effectively restore the integrity of the data, resulting in a degradation in model performance.

4. Summary

Decision tree is being processedContinuous valueandMissing valueThe core idea at the time isflexibilityandrobustness。

Through appropriate continuous value processing methods (such as dichotomy, multifork splitting) and missing value processing strategies (such as preprocessing fill, algorithm built-in mechanism), we can significantly improve the performance of decision tree models.

However, these methods also have their limitations and need to be selected and optimized according to specific data characteristics and business needs.