Location>code7788 >text

The practical tutorial on credit card fraud detection: from data preprocessing to model optimization and analysis

Popularity:57 ℃/2025-04-09 20:50:57

Introduction: Why is credit card fraud detection required?

According to the Nelson report, the global losses caused by credit card fraud exceed US$25 billion each year, and financial institutions need to complete transaction risk assessments within 0.1 second. This article will take you from scratch to build a credit card fraud detection system based on machine learning, complete code + visual analysis, allowing you to master the core skills of handling unbalanced data, model parameter adjustment and evaluation.

1. Project preparation: Tools and data

(I) Technology stack list

  • Python 3.8+
  • Core library: pandas, numpy, matplotlib, seaborn
  • Machine Learning: scikit-learn, imbalanced-learn, xgboost
  • Evaluation indicator: classification_report

(II) Dataset Description

Using Kaggle's publicly available credit card transaction dataset, containing 284,807 transaction records, of which fraudulent transactions account for only 0.172% (typical unbalanced data). The data features have been processed by PCA, including 28 anonymous features + transaction amount + transaction time.

2. Data exploration: Understanding fraud model

import pandas as pd
 import as plt
 
 # Loading data
 df = pd.read_csv('')
 
 # View category distribution
 print(df['Class'].value_counts())
 # Output: 0 284315
 # 1 492
 
 # Visualize category distribution
 (figsize=(6,4))
 df['Class'].value_counts().()
 ('Transaction Class Distribution')
 ('Class (0: Normal, 1: Fraud)')
 ('Count')
 ()

Key Observation

  1. Fraudulent transactions account for only 0.17%, which is a serious unbalanced data;
  2. Special processing techniques are required to avoid the model biasing towards majority classes;
  3. The transaction amount (Amount) characteristics need to be standardized.

3. Data preprocessing: building a balanced training set

(I) Step 1: Standardize the transaction amount

from import StandardScaler
 
 # Individually standardized amount characteristics
 scaler = StandardScaler()
 df['Amount'] = scaler.fit_transform(df['Amount'].(-1,1))

(II) Step 2: Processing time characteristics

# Extract hourly features (fraud transactions often occur in specific periods)
 df['Hour'] = df['Time'].apply(lambda x: x//3600 % 24)

(III) Step 3: Sampling technology comparison

Sampling method advantage shortcoming
Simple oversampling Simple implementation Overfitting
SMOTE Generate synthetic samples Complicated complexity
Cluster sampling Keep data distributed You need to choose the right number of clusters
Undersampling Reduce calculation amount Important information may be lost

Choose a plan: Use SMOTE oversampling + random undersampling combination.

from imblearn.over_sampling import SMOTE
 from imblearn.under_sampling import RandomUnderSampler
 
 # Initialize the sampler
 smoke = SMOTE(sampling_strategy=0.5, random_state=42)
 under_sampler = RandomUnderSampler(sampling_strategy=0.5, random_state=42)
 
 # Split dataset
 X = (['Class', 'Time'], axis=1)
 y = df['Class']
 
 # Combination sampling
 X_resampled, y_resampled = smoke.fit_resample(X, y)
 X_resampled, y_resampled = under_sampler.fit_resample(X_resampled, y_resampled)

4. Feature Engineering: Building Effective Features

Feature selection method

  1. Analysis of variance: Remove features with variance <0.8;
  2. Correlation Analysis: Filter features with label correlation >0.1;
  3. Recursive feature elimination: Use the model to sort features.
from sklearn.feature_selection import VarianceThreshold, SelectKBest, f_classif
 
 # Variance filtering
 var_threshold = VarianceThreshold(threshold=0.8)
 X_var = var_threshold.fit_transform(X_resampled)
 
 # Relevance selection
 selector = SelectKBest(score_func=f_classif, k=15)
 X_selected = selector.fit_transform(X_var, y_resampled)

V. Model construction: Random Forest Baseline Model

(I) Model training

from import RandomForestClassifier
 from sklearn.model_selection import train_test_split
 
 # Segmentation training test set
 X_train, X_test, y_train, y_test = train_test_split(
     X_selected, y_resampled, test_size=0.2, stratify=y_resampled, random_state=42)
 
 # Initialize the model
 rf = RandomForestClassifier(
     n_estimators=100,
     max_depth=8,
     class_weight='balanced',
     random_state=42
 )
 
 # Train the model
 (X_train, y_train)

(II) Model evaluation

from import classification_report, roc_auc_score, roc_curve
 
 # Predict probability
 y_pred_proba = rf.predict_proba(X_test)[:,1]
 
 # Calculate AUC
 auc = roc_auc_score(y_test, y_pred_proba)
 print(f'Baseline AUC: {auc:.4f}')
 
 # Draw ROC curves
 fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
 (fpr, tpr, label=f'RF (AUC = {auc:.2f})')
 ([0,1], [0,1], 'k--')
 ('False Positive Rate')
 ('True Positive Rate')
 ('ROC Curve Comparison')
 ()
 ()

6. Model optimization: XGBoost parameter adjustment practice

(I) Parameter grid design

param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1, 0.2],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0],
    'scale_pos_weight': [1, 5, 10]
}

(II) Grid search + cross-validation

from sklearn.model_selection import GridSearchCV
 
 xgb = XGBClassifier(use_label_encoder=False, eval_metric='auc')
 grid = GridSearchCV(xgb, param_grid, cv=5, scoring='roc_auc', n_jobs=-1)
 (X_train, y_train)
 
 # Best parameters
 print(f'Best Parameters: {grid.best_params_}')
 
 # Best Model Evaluation
 best_xgb = grid.best_estimator_
 y_pred_proba = best_xgb.predict_proba(X_test)[:,1]
 auc = roc_auc_score(y_test, y_pred_proba)
 print(f'Optimized AUC: {auc:.4f}')

7. Overfit control: key skills

  1. Early stop mechanism: Set early_stopping_rounds;
  2. Regularization: Adjust lambda and alpha parameters;
  3. Feature selection: Use model feature importance sorting;
  4. Cross-validation: Increase the proportion of verification sets.

8. Model deployment: Production environment optimization

(I) Performance optimization skills

  1. Model compression: Use ONNX Runtime to accelerate inference;
  2. Batch Forecast: Set the batch_size parameter;
  3. Cache mechanism: Cache duplicate features;
  4. Monitoring system: Establish a model drift detection mechanism.

(II) Code example (using ONNX acceleration)

import onnxruntime as rt
 
 # Transformation Model
 onnx_model = convert_model(best_xgb, 'xgboost', ['input'], ['output_probability'])
 
 # Create a session
 sess = (onnx_model.SerializeToString())
 
 # Accelerate prediction
 def onnx_predict(data):
     input_name = sess.get_inputs()[0].name
     pred_onx = (None, {input_name: })[0]
     return pred_onx[:,1]

9. In-depth analysis of evaluation indicators

index Calculation formula Fraud detection meaning
Accuracy (TP+TN)/(TP+TN+FP+FN) Overall prediction accuracy rate
Recall rate TP/(TP+FN) Ability to identify fraudulent transactions
Accuracy TP/(TP+FP) The credibility of predicting fraudulent transactions
F1 Score 2(Precision rateRecall rate)/(accuracy + recall rate) Balanced accuracy and recall
AUC-ROC Area under the curve Overall performance of classifiers

Business advice: In financial scenarios, recall rates should be preferred over accuracy, ensuring that fraudulent transactions can be captured as much as possible.

10. Conclusion: The importance of continuous optimization

The fraud model continues to evolve, and it is recommended:

  1. Retrain the model every month;
  2. Changes in the importance of monitoring characteristics;
  3. Combining rules engines to make hybrid decisions;
  4. Explore deep learning models (such as Autoencoder).

Through this practice, you have mastered the core skills of dealing with imbalanced data and building fraud detection models. Start hands-on practice now and build your own intelligent risk control system!