Pipeline in scikit-learn: building efficient, maintainable machine learning processes

我们使用scikit-learn进行机器学习的模型训练时，The data used and the algorithm parameters will be adjusted accordingly to the specific situation.，

However, the entire process of model training is actually much the same, generallyLoad data，Data preprocessing，feature selection，model trainingand several other links.

If the results of the training are not satisfactory from theData preprocessingStart and retrain again.

Introducing today'sPipeline(Chinese name:production line), a tool that integrates multiple machine learning steps.

It helps us to simplify the machine learning process.

1. What is Pipeline

existscikit-learn Middle.PipelineIt's like an industrial production line that takes theData preprocessing、feature selection、model trainingMultiple links such as these are connected in a sequential manner.

For example, a typical machine learning process might include data normalization, principal component analysis (PCA) for feature extraction, and finally classification using a classifier (e.g., support vector machine).

in the absence ofPipelineWhen pipelining, you need to process each step separately, manually passing the output of one step to the next. WhilePipelineAllows you to encapsulate these steps into a single object to handle the entire machine learning process in a more concise and efficient way.

From a code perspective, theproduction lineIt is made up of a series of(key, value)Pair composition.

included among thesekeyis a customized name that identifies the step;

valueis a program that implements thefit_transformmethodologicalscikit-learn converter (for data preprocessing, feature extraction, etc.), or one that only implements thefitEstimator of the method (for model training and prediction).

2. Role and advantages of Pipeline

2.1 Streamlining the training process

utilizationPipelineThe biggest benefit that can come out of this isstreamliningTraining of machine learning modelsworkflows，

We don't have to manually invoke the steps of data preprocessing, feature engineering, and model training one by one each time we train a model or make a prediction.

For example, in the following example, there is noPipelineHours:

from import StandardScaler
from import PolynomialFeatures
from sklearn.linear_model import LinearRegression
import numpy as np

# Generate some simulation data
X = (100, 1)
y = 2 * X + 1 + 0.1 * (100, 1)

# Data standardization
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Expansion of polynomial characterization
poly = PolynomialFeatures(degree = 2)
X_poly = poly.fit_transform(X_scaled)
# Linear regression model training
model = LinearRegression()
(X_poly, y)

rather than usingproduction line, the code can be simplified as follows:

from import Pipeline
from import StandardScaler
from import PolynomialFeatures
from sklearn.linear_model import LinearRegression
import numpy as np

# Generate some simulation data
X = (100, 1)
y = 2 * X + 1 + 0.1 * (100, 1)

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('poly', PolynomialFeatures(degree = 2)),
    ('model', LinearRegression())
])
(X, y)

This not only reduces the amount of code, but also makes the code structure clearer.

2.2 Avoiding data leakage

In machine learning.data leakis a serious problem.

For example, when performing data preprocessing and model selection, accidentally leaking information from the test data into the processing of the training data can lead to overly optimistic results in the evaluation of the model on the test set.

PipelineIt is possible to ensure that each step uses only the data that it should use in thePipelineIn this, the training data is processed sequentially according to the steps, and the test data is also processed in the same order and manner, so that data leakage can be well avoided.

And during the cross-validation process, thePipelinewill automatically fold each (fold) The data is processed in the correct sequence of steps.

If the individual steps are handled manually, it is easy to incorrectly use all of the data for preprocessing during cross validation, which can lead to data leakage.

2.3 Facilitating model tuning

It is possible to convert the entirePipelineThink of it as a model forparameterization。

For example, for a data preprocessor and classifier containing aPipeline, which can be searched through the grid (Grid Search) or a random search (Random Search) and other methods to simultaneously tune the preprocessing step and the parameters of the classifier.

Another example is a file containingstandardizationrespond in singingsupport vector machineclassifiersPipelineWe can simultaneously adjust the standardized parameters (e.g.with_meanrespond in singingwith_std) and support vector machine parameters (such asCrespond in singinggamma) to find the optimal model configuration.

3. Examples of Pipeline Usage

Examples are the best learning material, and the following uses thescikit-learn librarydatasets to separately construct theregression (statistics)、categorizationrespond in singingclusteringproblematicPipelineExample.

3.1 Examples of predicted diabetes

This example starts with the diabetes datastandardizationand then use thelinear regression modelMake house price forecasts.

from import load_diabetes
from import Pipeline
from import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from import mean_squared_error, r2_score

# Load the diabetes dataset
diabetes = load_diabetes()
x =
y =

# Split the training and test sets
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

pipeline = Pipeline([
    ('scaler', StandardScaler()), ('model', LinearResults())
    ('model', LinearRegression())
])

# Train the model on the training set
(x_train, y_train)

# Predict on the test set
y_pred = (X_test)

# Calculate the mean square error (MSE) to evaluate the model's performance on the test set
mse = mean_squared_error(y_test, y_pred)
print("Mean Square Error (MSE):", mse)

# Calculate the coefficient of determination (R² score) to further assess the model's goodness of fit
r2 = r2_score(y_test, y_pred)
print("Coefficient of determination (R² score):", r2)

Finally, use themean square error(MSE) andcoefficient of determination (CoD)(R² scores), two common metrics for regression evaluation, were used to measure the performance of the model on the test set to help understand the accuracy and fit of the model in predicting diabetes-related metrics.

3.2. Examples of Iris classification

formerstandardizationIris data, followed by the use ofsupport vector machineClassifiers for hand iris species.

from import load_iris
from import Pipeline
from import StandardScaler
from import SVC
from sklearn.model_selection import train_test_split
from import accuracy_score

# Loading the Iris dataset
iris = load_iris()
X =
y =

# Divide the training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', SVC())
])

# Training the model on the training set
(X_train, y_train)

# Predictions on the test set
y_pred = (X_test)

# Calculate the accuracy to evaluate the performance of the model on the test set
accuracy = accuracy_score(y_test, y_pred)
print("accuracy:", accuracy)

3.3. Example of handwritten digit clustering

The data is first subjected tostandardizationRe-useK-Means The algorithm handwrites digital image data toclustering, here simply assumed to be clustered into ** 10 classes**.

from import load_digits
from import Pipeline
from import StandardScaler
from import KMeans
from sklearn.model_selection import train_test_split

# Load the handwritten digits dataset
digits = load_digits()
X =

# Split the training set and test set (in the clustering scenario, the division of the training set is more of a routine example of operation, the actual cluster analysis according to the specific needs)
X_train, X_test = train_test_split(X, test_size=0.2, random_state=42)

pipeline = Pipeline([
    ('scaler', StandardScaler()), ('scaler', StandardScaler()), ('clusterer', StandardScaler())
    ('clusterer', KMeans(n_clusters=10)) # Assumed to be divided into 10 classes, since handwritten digits have 0-9
])

# Perform clustering training on the training set (which can be seen here as an example case of all data being used for clustering learning)
(X_train)

# Get the clustering labels
cluster_labels = pipeline['clusterer'].labels_

# Example of simply printing the cluster labels for a portion of the data on the test set
print("Example of clustering labels for part of the data on the test set:")
print(cluster_labels[:10])

classifier for sums of money: For the above example I've added a new file in the localsckilit-learn 1.5.2 All run through on the version.

4. Summary

Pipelineto our model training.

However, in order to use thePipelineThere are some things that we need to pay special attention to when we use it.

firstlysequence of stepsThe data will be processed sequentially in the order of the steps.

For example, if you are going to perform feature selection first and then normalize the data, then you need to place the feature selection step before the normalization step. If the order is wrong, the model performance may be degraded or may not function properly.

Second, the individual steps of theInterface compatibilityIt's important, too.PipelineEach step in the process needs to fulfill certain interface requirements.

For the data preprocessing step (converter), it is necessary to implement thefitrespond in singingtransform(orfit_transform) Methods;

For the model training step (estimator), it is necessary to implement thefitMethods.

If the customized steps do not implement these methods correctly, the pipeline will run with errors.

Finally, use thePipelineWhen making parameter adjustments, it is important to note thatNaming of parameters。

existPipelinein which the name of the parameter is a combination of the step name and the actual parameter name.

For example, if you have a file namedscalerThe normalization step with a parameterwith_mean, then during parameter tuning, the parameter name should bescaler__with_mean。

This naming scheme ensures that the parameters in each step are correctly adjusted.