Introduction: Why is credit card fraud detection required?
According to the Nelson report, the global losses caused by credit card fraud exceed US$25 billion each year, and financial institutions need to complete transaction risk assessments within 0.1 second. This article will take you from scratch to build a credit card fraud detection system based on machine learning, complete code + visual analysis, allowing you to master the core skills of handling unbalanced data, model parameter adjustment and evaluation.
1. Project preparation: Tools and data
(I) Technology stack list
- Python 3.8+
- Core library: pandas, numpy, matplotlib, seaborn
- Machine Learning: scikit-learn, imbalanced-learn, xgboost
- Evaluation indicator: classification_report
(II) Dataset Description
Using Kaggle's publicly available credit card transaction dataset, containing 284,807 transaction records, of which fraudulent transactions account for only 0.172% (typical unbalanced data). The data features have been processed by PCA, including 28 anonymous features + transaction amount + transaction time.
2. Data exploration: Understanding fraud model
import pandas as pd
import as plt
# Loading data
df = pd.read_csv('')
# View category distribution
print(df['Class'].value_counts())
# Output: 0 284315
# 1 492
# Visualize category distribution
(figsize=(6,4))
df['Class'].value_counts().()
('Transaction Class Distribution')
('Class (0: Normal, 1: Fraud)')
('Count')
()
Key Observation:
- Fraudulent transactions account for only 0.17%, which is a serious unbalanced data;
- Special processing techniques are required to avoid the model biasing towards majority classes;
- The transaction amount (Amount) characteristics need to be standardized.
3. Data preprocessing: building a balanced training set
(I) Step 1: Standardize the transaction amount
from import StandardScaler
# Individually standardized amount characteristics
scaler = StandardScaler()
df['Amount'] = scaler.fit_transform(df['Amount'].(-1,1))
(II) Step 2: Processing time characteristics
# Extract hourly features (fraud transactions often occur in specific periods)
df['Hour'] = df['Time'].apply(lambda x: x//3600 % 24)
(III) Step 3: Sampling technology comparison
Sampling method | advantage | shortcoming |
---|---|---|
Simple oversampling | Simple implementation | Overfitting |
SMOTE | Generate synthetic samples | Complicated complexity |
Cluster sampling | Keep data distributed | You need to choose the right number of clusters |
Undersampling | Reduce calculation amount | Important information may be lost |
Choose a plan: Use SMOTE oversampling + random undersampling combination.
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
# Initialize the sampler
smoke = SMOTE(sampling_strategy=0.5, random_state=42)
under_sampler = RandomUnderSampler(sampling_strategy=0.5, random_state=42)
# Split dataset
X = (['Class', 'Time'], axis=1)
y = df['Class']
# Combination sampling
X_resampled, y_resampled = smoke.fit_resample(X, y)
X_resampled, y_resampled = under_sampler.fit_resample(X_resampled, y_resampled)
4. Feature Engineering: Building Effective Features
Feature selection method
- Analysis of variance: Remove features with variance <0.8;
- Correlation Analysis: Filter features with label correlation >0.1;
- Recursive feature elimination: Use the model to sort features.
from sklearn.feature_selection import VarianceThreshold, SelectKBest, f_classif
# Variance filtering
var_threshold = VarianceThreshold(threshold=0.8)
X_var = var_threshold.fit_transform(X_resampled)
# Relevance selection
selector = SelectKBest(score_func=f_classif, k=15)
X_selected = selector.fit_transform(X_var, y_resampled)
V. Model construction: Random Forest Baseline Model
(I) Model training
from import RandomForestClassifier
from sklearn.model_selection import train_test_split
# Segmentation training test set
X_train, X_test, y_train, y_test = train_test_split(
X_selected, y_resampled, test_size=0.2, stratify=y_resampled, random_state=42)
# Initialize the model
rf = RandomForestClassifier(
n_estimators=100,
max_depth=8,
class_weight='balanced',
random_state=42
)
# Train the model
(X_train, y_train)
(II) Model evaluation
from import classification_report, roc_auc_score, roc_curve
# Predict probability
y_pred_proba = rf.predict_proba(X_test)[:,1]
# Calculate AUC
auc = roc_auc_score(y_test, y_pred_proba)
print(f'Baseline AUC: {auc:.4f}')
# Draw ROC curves
fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
(fpr, tpr, label=f'RF (AUC = {auc:.2f})')
([0,1], [0,1], 'k--')
('False Positive Rate')
('True Positive Rate')
('ROC Curve Comparison')
()
()
6. Model optimization: XGBoost parameter adjustment practice
(I) Parameter grid design
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [3, 5, 7],
'learning_rate': [0.01, 0.1, 0.2],
'subsample': [0.8, 1.0],
'colsample_bytree': [0.8, 1.0],
'scale_pos_weight': [1, 5, 10]
}
(II) Grid search + cross-validation
from sklearn.model_selection import GridSearchCV
xgb = XGBClassifier(use_label_encoder=False, eval_metric='auc')
grid = GridSearchCV(xgb, param_grid, cv=5, scoring='roc_auc', n_jobs=-1)
(X_train, y_train)
# Best parameters
print(f'Best Parameters: {grid.best_params_}')
# Best Model Evaluation
best_xgb = grid.best_estimator_
y_pred_proba = best_xgb.predict_proba(X_test)[:,1]
auc = roc_auc_score(y_test, y_pred_proba)
print(f'Optimized AUC: {auc:.4f}')
7. Overfit control: key skills
- Early stop mechanism: Set early_stopping_rounds;
- Regularization: Adjust lambda and alpha parameters;
- Feature selection: Use model feature importance sorting;
- Cross-validation: Increase the proportion of verification sets.
8. Model deployment: Production environment optimization
(I) Performance optimization skills
- Model compression: Use ONNX Runtime to accelerate inference;
- Batch Forecast: Set the batch_size parameter;
- Cache mechanism: Cache duplicate features;
- Monitoring system: Establish a model drift detection mechanism.
(II) Code example (using ONNX acceleration)
import onnxruntime as rt
# Transformation Model
onnx_model = convert_model(best_xgb, 'xgboost', ['input'], ['output_probability'])
# Create a session
sess = (onnx_model.SerializeToString())
# Accelerate prediction
def onnx_predict(data):
input_name = sess.get_inputs()[0].name
pred_onx = (None, {input_name: })[0]
return pred_onx[:,1]
9. In-depth analysis of evaluation indicators
index | Calculation formula | Fraud detection meaning |
---|---|---|
Accuracy | (TP+TN)/(TP+TN+FP+FN) | Overall prediction accuracy rate |
Recall rate | TP/(TP+FN) | Ability to identify fraudulent transactions |
Accuracy | TP/(TP+FP) | The credibility of predicting fraudulent transactions |
F1 Score | 2(Precision rateRecall rate)/(accuracy + recall rate) | Balanced accuracy and recall |
AUC-ROC | Area under the curve | Overall performance of classifiers |
Business advice: In financial scenarios, recall rates should be preferred over accuracy, ensuring that fraudulent transactions can be captured as much as possible.
10. Conclusion: The importance of continuous optimization
The fraud model continues to evolve, and it is recommended:
- Retrain the model every month;
- Changes in the importance of monitoring characteristics;
- Combining rules engines to make hybrid decisions;
- Explore deep learning models (such as Autoencoder).
Through this practice, you have mastered the core skills of dealing with imbalanced data and building fraud detection models. Start hands-on practice now and build your own intelligent risk control system!