The practical tutorial on credit card fraud detection: from data preprocessing to model optimization and analysis

Introduction: Why is credit card fraud detection required?

According to the Nelson report, the global losses caused by credit card fraud exceed US$25 billion each year, and financial institutions need to complete transaction risk assessments within 0.1 second. This article will take you from scratch to build a credit card fraud detection system based on machine learning, complete code + visual analysis, allowing you to master the core skills of handling unbalanced data, model parameter adjustment and evaluation.

1. Project preparation: Tools and data

(I) Technology stack list

Python 3.8+
Core library: pandas, numpy, matplotlib, seaborn
Machine Learning: scikit-learn, imbalanced-learn, xgboost
Evaluation indicator: classification_report

(II) Dataset Description

Using Kaggle's publicly available credit card transaction dataset, containing 284,807 transaction records, of which fraudulent transactions account for only 0.172% (typical unbalanced data). The data features have been processed by PCA, including 28 anonymous features + transaction amount + transaction time.

2. Data exploration: Understanding fraud model

import pandas as pd
 import as plt
 
 # Loading data
 df = pd.read_csv('')
 
 # View category distribution
 print(df['Class'].value_counts())
 # Output: 0 284315
 # 1 492
 
 # Visualize category distribution
 (figsize=(6,4))
 df['Class'].value_counts().()
 ('Transaction Class Distribution')
 ('Class (0: Normal, 1: Fraud)')
 ('Count')
 ()

Key Observation：

Fraudulent transactions account for only 0.17%, which is a serious unbalanced data;
Special processing techniques are required to avoid the model biasing towards majority classes;
The transaction amount (Amount) characteristics need to be standardized.

3. Data preprocessing: building a balanced training set

(I) Step 1: Standardize the transaction amount

from import StandardScaler
 
 # Individually standardized amount characteristics
 scaler = StandardScaler()
 df['Amount'] = scaler.fit_transform(df['Amount'].(-1,1))

(II) Step 2: Processing time characteristics

# Extract hourly features (fraud transactions often occur in specific periods)
 df['Hour'] = df['Time'].apply(lambda x: x//3600 % 24)

(III) Step 3: Sampling technology comparison

Sampling method	advantage	shortcoming
Simple oversampling	Simple implementation	Overfitting
SMOTE	Generate synthetic samples	Complicated complexity
Cluster sampling	Keep data distributed	You need to choose the right number of clusters
Undersampling	Reduce calculation amount	Important information may be lost

Choose a plan: Use SMOTE oversampling + random undersampling combination.

from imblearn.over_sampling import SMOTE
 from imblearn.under_sampling import RandomUnderSampler
 
 # Initialize the sampler
 smoke = SMOTE(sampling_strategy=0.5, random_state=42)
 under_sampler = RandomUnderSampler(sampling_strategy=0.5, random_state=42)
 
 # Split dataset
 X = (['Class', 'Time'], axis=1)
 y = df['Class']
 
 # Combination sampling
 X_resampled, y_resampled = smoke.fit_resample(X, y)
 X_resampled, y_resampled = under_sampler.fit_resample(X_resampled, y_resampled)

4. Feature Engineering: Building Effective Features

Feature selection method

Analysis of variance: Remove features with variance <0.8;
Correlation Analysis: Filter features with label correlation >0.1;
Recursive feature elimination: Use the model to sort features.

from sklearn.feature_selection import VarianceThreshold, SelectKBest, f_classif
 
 # Variance filtering
 var_threshold = VarianceThreshold(threshold=0.8)
 X_var = var_threshold.fit_transform(X_resampled)
 
 # Relevance selection
 selector = SelectKBest(score_func=f_classif, k=15)
 X_selected = selector.fit_transform(X_var, y_resampled)

V. Model construction: Random Forest Baseline Model

(I) Model training

from import RandomForestClassifier
 from sklearn.model_selection import train_test_split
 
 # Segmentation training test set
 X_train, X_test, y_train, y_test = train_test_split(
     X_selected, y_resampled, test_size=0.2, stratify=y_resampled, random_state=42)
 
 # Initialize the model
 rf = RandomForestClassifier(
     n_estimators=100,
     max_depth=8,
     class_weight='balanced',
     random_state=42
 )
 
 # Train the model
 (X_train, y_train)

(II) Model evaluation

from import classification_report, roc_auc_score, roc_curve
 
 # Predict probability
 y_pred_proba = rf.predict_proba(X_test)[:,1]
 
 # Calculate AUC
 auc = roc_auc_score(y_test, y_pred_proba)
 print(f'Baseline AUC: {auc:.4f}')
 
 # Draw ROC curves
 fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
 (fpr, tpr, label=f'RF (AUC = {auc:.2f})')
 ([0,1], [0,1], 'k--')
 ('False Positive Rate')
 ('True Positive Rate')
 ('ROC Curve Comparison')
 ()
 ()

6. Model optimization: XGBoost parameter adjustment practice

(I) Parameter grid design

param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1, 0.2],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0],
    'scale_pos_weight': [1, 5, 10]
}

(II) Grid search + cross-validation

from sklearn.model_selection import GridSearchCV
 
 xgb = XGBClassifier(use_label_encoder=False, eval_metric='auc')
 grid = GridSearchCV(xgb, param_grid, cv=5, scoring='roc_auc', n_jobs=-1)
 (X_train, y_train)
 
 # Best parameters
 print(f'Best Parameters: {grid.best_params_}')
 
 # Best Model Evaluation
 best_xgb = grid.best_estimator_
 y_pred_proba = best_xgb.predict_proba(X_test)[:,1]
 auc = roc_auc_score(y_test, y_pred_proba)
 print(f'Optimized AUC: {auc:.4f}')

7. Overfit control: key skills

Early stop mechanism: Set early_stopping_rounds;
Regularization: Adjust lambda and alpha parameters;
Feature selection: Use model feature importance sorting;
Cross-validation: Increase the proportion of verification sets.

8. Model deployment: Production environment optimization

(I) Performance optimization skills

Model compression: Use ONNX Runtime to accelerate inference;
Batch Forecast: Set the batch_size parameter;
Cache mechanism: Cache duplicate features;
Monitoring system: Establish a model drift detection mechanism.

(II) Code example (using ONNX acceleration)

import onnxruntime as rt
 
 # Transformation Model
 onnx_model = convert_model(best_xgb, 'xgboost', ['input'], ['output_probability'])
 
 # Create a session
 sess = (onnx_model.SerializeToString())
 
 # Accelerate prediction
 def onnx_predict(data):
     input_name = sess.get_inputs()[0].name
     pred_onx = (None, {input_name: })[0]
     return pred_onx[:,1]

9. In-depth analysis of evaluation indicators

index	Calculation formula	Fraud detection meaning
Accuracy	(TP+TN)/(TP+TN+FP+FN)	Overall prediction accuracy rate
Recall rate	TP/(TP+FN)	Ability to identify fraudulent transactions
Accuracy	TP/(TP+FP)	The credibility of predicting fraudulent transactions
F1 Score	2(Precision rateRecall rate)/(accuracy + recall rate)	Balanced accuracy and recall
AUC-ROC	Area under the curve	Overall performance of classifiers

Business advice: In financial scenarios, recall rates should be preferred over accuracy, ensuring that fraudulent transactions can be captured as much as possible.

10. Conclusion: The importance of continuous optimization

The fraud model continues to evolve, and it is recommended:

Retrain the model every month;
Changes in the importance of monitoring characteristics;
Combining rules engines to make hybrid decisions;
Explore deep learning models (such as Autoencoder).

Through this practice, you have mastered the core skills of dealing with imbalanced data and building fraud detection models. Start hands-on practice now and build your own intelligent risk control system!