[python] Implementing Data Anomaly Detection Based on PyOD Library

PyOD is a comprehensive and easy-to-use Python library specialized in detecting anomalies or outliers in multivariate data. Outliers are those data points that are significantly different from the majority of data points and may indicate errors, noise, or potentially interesting phenomena. Whether you are dealing with small projects or large datasets, PyOD offers more than 50 algorithms to meet the needs of the user.Features of PyOD include:

Unified and user-friendly interface for multiple algorithms.
A rich selection of models, from classic techniques to the latest PyTorch deep learning methods.
High performance and efficiency, utilizing numba andjoblibEnables on-the-fly compilation and parallel processing.
Fast training and prediction, realized with the SUOD framework.

The official PyOD repository address is:pyod, the official document address is:pyod-docThe PyOD installation command is as follows:

pip install pyod

1 Instructions for use
- 1.1 Background on PyOD
- 1.2 Description of use
  - 1.2.1 Realization of anomaly detection based on KNN
  - 1.2.2 Model combinations
  - 1.2.3 Thresholding
  - 1.2.4 Model saving and loading
2 Reference

1 Instructions for use

1.1 Background on PyOD

The PyOD authors have released a 45-page preprinted paper titledADBench: Anomaly Detection Benchmarkand the provision ofADBenchThe open source repository compares the performance of 30 anomaly detection algorithms on 57 benchmark datasets.The ADBench structure is shown below:

PyOD provides interface class implementations of these algorithms, see the interfaces corresponding to specific algorithms:pyod-implemented-algorithms. At the same time PyOD provides a unified API interface for these algorithms as follows:

(): train the model, for unsupervised methods the target variable y will be ignored.
.decision_function(): predicts the anomaly score of the input data using the trained detector.
(): predicts whether a particular sample is an outlier using the trained detector.
.predict_proba(): predicts the probability that a sample is an outlier using the trained detector.
.predict_confidence(): the confidence level of the prediction model for each sample (available in predict and predict_proba).
.decision_scores_: anomaly score of the training data. The higher the score, the more anomalous it is.
.labels_: binary labels of the training data. 0 means normal samples, 1 means abnormal samples.

PyOD also provides benchmark comparison results for different algorithms, see link for details:benchmark. The following figure shows the detection results of various algorithms with the actual results and indicates the number of misidentified samples:

1.2 Description of use

1.2.1 Realization of anomaly detection based on KNN

In this paper, we take KNN as an example to illustrate the general process of PyOD to realize anomaly detection.KNN (K-Nearest Neighbors) is a very commonly used machine learning method, and its core idea is very simple and intuitive: in the feature space, if most of the K nearest neighbors of a data point belongs to a particular category, then this data point most likely belongs to that category as well.

In anomaly detection, the KNN algorithm does not need to assume the distribution of the data; it determines whether a sample point is an anomaly by calculating the distance between each sample point and the other sample points. Anomalies are usually those points that are far away from most of the sample points. The following sample code demonstrates the creation of a KNN model for anomaly detection through the PyOD library:

Creating Data Sets

The following code creates a two-dimensional dataset of coordinate points, with normal data generated by a multivariate Gaussian distribution and outliers generated by a uniform distribution.

from import KNN
from import generate_data

# Set the percentage of outliers and the number of training and testing samples
contamination = 0.1 # Percentage of outliers
n_train = 200 # Number of training samples
n_test = 100 # Number of test samples

# Generate training and test dataset with normal data and outliers, default input data feature dimension is 2, labels are binary labels (0: normal points, 1: outliers)
# random_state is a random seed, guaranteed to reproduce the results
X_train, X_test, y_train, y_test = generate_data(n_train=n_train, n_test=n_test, contamination=contamination, random_state=42)
X_train.shape

(200, 2)

Training the KNN detector

# Train the KNN detector
clf_name = 'KNN' # Set the name of the classifier
clf = KNN() # Create kNN model instance
(X_train) # Fit the model using the training data

# Get predictive labels and anomaly scores for the training data
y_train_pred = clf.labels_ # Binary labels (0: normal points, 1: abnormal points)
y_train_scores = clf.decision_scores_ # Anomaly scores for training data

# Prediction on test data
y_test_pred = (X_test) # anomaly labeling for test data (0 or 1)
y_test_scores = clf.decision_function(X_test) # anomaly scores for test data

# Get the confidence level of the prediction
y_test_pred, y_test_pred_confidence = (X_test, return_confidence=True) # return prediction label and confidence (range [0,1])

Assessment results

from import evaluate_print # Import evaluation tool

# Evaluate and print the results
print("\nOn Training Data:") # Print evaluation results for training data
evaluate_print(clf_name, y_train, y_train_scores) # Evaluate training data
print("\nOn Test Data:") # Print the evaluation results of the test data
evaluate_print(clf_name, y_test, y_test_scores) # Evaluate the test data

On Training Data:
KNN ROC:0.9992, precision @ rank n:0.95

On Test Data:
KNN ROC:1.0, precision @ rank n:1.0

Visualization results

The following code shows the anomaly label prediction results of the model on the training and test sets, where inliers denote normal points and outliers denote anomalies.

from import visualize

# Visualization results
visualize(clf_name, X_train, y_train, X_test, y_test, y_train_pred,
          y_test_pred, show_figure=True, save_figure=False) # Display visualization images

png

Model Replacement

As mentioned in Section 1.1 of this paper, PyOD provides a unified API interface for different anomaly detection algorithms, with links to interface descriptions for each type of algorithm. In PyOD, the detection process of other algorithms is similar to the KNN algorithm, which is similar to the way sklearn's model is constructed. Taking PCA as an example, the model can be easily replaced by simply changing the initialization of the model as follows:

from import PCA
# Train the PCA detector
clf_name = 'PCA' # Set the name of the classifier
clf = PCA() # Create kNN model instance
(X_train) # Fit the model using the training data

PCA(contamination=0.1, copy=True, iterated_power='auto', n_components=None,
  n_selected_components=None, random_state=None, standardization=True,
  svd_solver='auto', tol=0.0, weighted=True, whiten=False)

Other codes are the same:

# Get predictive labels and anomaly scores for the training data
y_train_pred = clf.labels_ # Binary labels (0: normal points, 1: abnormal points)
y_train_scores = clf.decision_scores_ # Anomaly scores for training data

# Prediction on test data
y_test_pred = (X_test) # anomaly labeling for test data (0 or 1)
y_test_scores = clf.decision_function(X_test) # anomaly scores for test data

# Get the confidence level of the prediction
y_test_pred, y_test_pred_confidence = (X_test, return_confidence=True) # return prediction label and confidence (range [0,1])

from import evaluate_print # Import the evaluation tool

# Evaluate and print the results
print("\nOn Training Data:") # Print evaluation results for training data
evaluate_print(clf_name, y_train, y_train_scores) # Evaluate training data
print("\nOn Test Data:") # Print the evaluation results of the test data
evaluate_print(clf_name, y_test, y_test_scores) # Evaluate the test data

# Visualize the results
visualize(clf_name, X_train, y_train, X_test, y_test, y_train_pred,
          y_test_pred, show_figure=True, save_figure=False) # Display visualization image

On Training Data:
PCA ROC:0.8964, precision @ rank n:0.8

On Test Data:
PCA ROC:0.9033, precision @ rank n:0.8

png

1.2.2 Model combinations

Anomaly detection often faces the problem of model instability due to its unsupervised nature. Therefore, it is proposed to improve its robustness by combining the outputs of different detectors (e.g., by averaging).

This example demonstrates four scoring combination mechanisms:

Average: the average score of all detectors.
Maximize: the highest score of all detectors.
Average of Maximum (AOM): the base detectors are divided into subgroups and the highest score of each subgroup is taken. The final score is the average of all subgroup scores.
Maximum of Average (MOA): the base detectors are divided into subgroups and the average score of each subgroup is taken. The final score is the highest of all subgroup scores.

The code implementation of the above combination mechanism is provided by the combo library, a Python tool library for machine learning model combination (integrated learning). It provides a variety of model combining methods, including simple averaging, weighted averaging, median, and majority voting, as well as more complex methods such as Dynamic Classifier Selection and Stacking. combo library supports a variety of scenarios, such as classifier combining, raw result combining, cluster combining, and anomaly detector combining. combo library official repository. The official repository of the combo library is located at:comboThe installation command is as follows:

pip install combo

The following sample code demonstrates the implementation of exception detection through the PyOD library and combo library combination model:

Creating Data Sets

# Requires installationcombostorehouse，Using Commands pip install combo
from import aom, moa, median, average, maximization
from import generate_data, evaluate_print
from import standardizer
from sklearn.model_selection import train_test_split
import numpy as np

# Import the model and generate sample data
# n_train：Number of training samples，n_features：brochureXThe characteristic dimensions of the，train_only：Whether to include only the training set
X, y = generate_data(n_train=5000, n_features=2, train_only=True, random_state=42) # Load data
# test_size：Test set ratio
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4) # Divide the training set and test set

# Standardized data for processing
X_train_norm, X_test_norm = standardizer(X_train, X_test)

Creating Detectors

Initialize 10 KNN anomaly detectors, set different values of k, and get the anomaly score. k determines how many nearest neighbors are considered when making a prediction; smaller k values may result in sensitivity to noise, while larger k values may make the model too smooth and thus lose some detail. Of course this code can also combine different types of detectors and then get the anomaly score.

from import KNN
n_clf = 10 # number of base detectors

# Initialize n_clf base detectors for combination
k_list = list(range(1,100,n_clf))

train_scores = ([X_train.shape[0], n_clf]) # create array of training set scores
test_scores = ([X_test.shape[0], n_clf]) # create array of test set scores

print('Combining {n_clf} kNN detectors'.format(n_clf=n_clf)) # Output the number of combined KNN detectors

for i in range(n_clf).
    k = int(k_list[i]) # Get the number of neighbors of the current detector

    clf = KNN(n_neighbors=k, method='largest') # Initialize KNN detectors
    (X_train_norm) # Fit the training data

    train_scores[:, i] = clf.decision_scores_ # record training scores
    test_scores[:, i] = clf.decision_function(X_test_norm) # Record test scores

Combining 10 kNN detectors

Standardized test results

The results of the individual detectors need to be standardized to zero mean and unit standard deviation, this is because when combining model results, if the range of output scores of the individual models varies widely, direct combination may lead to biased results. By standardizing, it can be ensured that the scores of the individual models are on the same scale for effective combination:

# Before combining, the test results need to be standardized
train_scores_norm, test_scores_norm = standardizer(train_scores, test_scores)

Combined results

Use combo to combine results:

# Combine using the average
y_by_average = average(test_scores_norm)
evaluate_print('Combination by Average', y_test, y_by_average) # Output the evaluation of the combination by average

# Combine by using the maximum value
y_by_maximization = maximization(test_scores_norm)
evaluate_print('Combination by Maximization', y_test, y_by_maximization) # Output evaluation results for combinations by maxima

# Combination by using median
y_by_median = median(test_scores_norm)
evaluate_print('Combination by Median', y_test, y_by_median) # Output evaluation results for combination by median

# Combination by AOM. n_buckets is the number of subgroups
y_by_aom = aom(test_scores_norm, n_buckets=5)
evaluate_print('Combination by AOM', y_test, y_by_aom) # Output evaluation results for combinations by AOM

# Combination by MOA, n_buckets is the number of subgroups
y_by_moa = moa(test_scores_norm, n_buckets=5)
evaluate_print('Combination by MOA', y_test, y_by_moa) # Output evaluation results for MOA combinations

Combination by Average ROC:0.9899, precision @ rank n:0.9497
Combination by Maximization ROC:0.9866, precision @ rank n:0.9447
Combination by Median ROC:0.99, precision @ rank n:0.9548
Combination by AOM ROC:0.9896, precision @ rank n:0.9447
Combination by MOA ROC:0.9884, precision @ rank n:0.9447

1.2.3 Thresholding

PyOD calculates the probability of anomalies in the data through the model and screens out the anomalies based on a set threshold. In this process, the selection of the threshold value has an important impact on the accuracy of the anomaly detection results.

PyThresh is a comprehensive and extensible Python toolkit designed to automate the setting and processing of probability scores for anomaly detection in univariate or multivariate data. It is compatible with, but not limited to, the PyOD library, using similar syntax and data structures.PyThresh contains over 30 thresholding algorithms, covering a wide range of techniques from simple statistical analyses (such as Z-scores) to more sophisticated graph-theoretic and topological-mathematical approaches.The official repository of the PyThresh library is located at:pythreshThe installation command is as follows:

pip install pythresh

For details on using PyThresh, check out its official documentation:pythresh-doc. The following sample code shows a simple example of thresholding via the PyOD library and the PyThresh library:

Use of thresholding algorithms

Using PyThresh with the PyOD library to automatically select thresholds can improve recognition accuracy. Note, however, that using the algorithms in PyThresh to automatically determine thresholds does not guarantee ideal results in all cases.

# Import KNN model, evaluation function and data generation function from pyod library
from import KNN
from import generate_data
from import accuracy_score

# Import the KARCH thresholding method from the pythresh library
from import KARCH

# Set the contamination rate, i.e. the percentage of outliers
contamination = 0.1 # percentage of outliers
# Set the number of training samples
n_train = 500 # number of training points
# Set the number of test samples
n_test = 1000 # number of testing points

# Generate sample data, return training and test data with their labels
X_train, X_test, y_train, y_test = generate_data(n_train=n_train,
                    
                    n_features=2, # number of features
                    contamination=contamination, # proportion of outliers
                    random_state=42) # random seed to ensure repeatable results

# Initialize the KNN anomaly detector
clf_name = 'KNN' # Classifier name
clf = KNN() # Create KNN model instance
(X_train) # Fit the model using the training data
thres = KARCH() # Create an instance of the KARCH algorithm to create a thresholding process
# Make predictions on the test data
y_test_scores = clf.decision_function(X_test) # Calculate anomaly scores for the test set
# Based on threshold clf.threshold_
y_test_pred = (X_test) # Get test set, results
y_test_pred_thre = (y_test_scores) # process the outlier results

# Calculate the accuracy
accuracy = accuracy_score(y_test, y_test_pred)
print(f "Accuracy before thresholding: {accuracy:.4f}")

accuracy = accuracy_score(y_test, y_test_pred_thre)
print(f "Accuracy after thresholding: {accuracy:.4f}")

Accuracy before thresholding: 0.9940
Accuracy after thresholding: 0.9950

contamination parameter

In addition to initializing an instance of the PyThresh algorithm model, it is also possible to specify a threshold selection algorithm based on the contamination parameter when initializing the PyOD model:

from import KDE # Import KDE models
from import FILTER
from import generate_data
from import evaluate_print

contamination = 0.1 # Percentage of outliers
n_train = 200 # Number of training data points
n_test = 100 # Number of test data points

# Generate sample data
x_train, x_test, y_train, y_test = generate_data(n_train=n_train,
                    n_test=n_test, n_features=2, n_test
                    n_features=2,
                    contamination=contamination,
                    random_state=42) # random seed

# Train the KDE detector
clf_name = 'KDE' # model name
clf = KDE(contamination=FILTER()) # Add threshold selection algorithm
(X_train) # Fit model using training data

# Get prediction labels and anomaly scores for training data
y_train_pred = clf.labels_ # binary labels (0: normal points, 1: abnormal points)
y_train_scores = clf.decision_scores_ #

# Get prediction results for test data
y_test_pred = (X_test)
y_test_scores = clf.decision_function(X_test)

# Evaluate and print the results
print("\n on training data:")
evaluate_print(clf_name, y_train, y_train_scores) # evaluate on training data
print("\n on test data:")
evaluate_print(clf_name, y_test, y_test_scores) # evaluate test data

On training data.
KDE ROC:0.9992, precision @ rank n:0.95

On test data.
KDE ROC:1.0, precision @ rank n:1.0

1.2.4 Model saving and loading

PyOD Usagejoblibmaybepickleto save and load the PyOD model, as shown below:

from import LOF
from import generate_data
from import evaluate_print
from import visualize

from joblib import dump, load # import model saving and loading tools from joblib library

contamination = 0.3 # Percentage of anomalies
n_train = 200 # Number of training data points
n_test = 100 # Number of test data points

# Generate sample data
x_train, x_test, y_train, y_test = generate_data(n_train=n_train,
                    
                    n_features=2, # number of features is 2
                    contamination=contamination, # proportion of anomalies
                    random_state=42) # random state setting

# Train the LOF detector
clf_name = 'LOF' # classifier name
clf = LOF() # Instantiate LOF model
(X_train) # Fit model on training data

# Get predictive labels and anomaly scores for the training data
y_train_pred = clf.labels_ # Binary labels (0:normal points, 1:abnormal points)
y_train_scores = clf.decision_scores_ # raw anomaly scores

# Save the model
dump(clf, '') # save model to file
# Load the model
clf_load = load('') # load model from file

# Get the prediction for the test data
y_test_pred = clf_load.predict(X_test) # anomaly label for test data (0 or 1)
y_test_scores = clf_load.decision_function(X_test) # Anomaly scores for test data

# Evaluate and print the results
print("\n results on training data:")
evaluate_print(clf_name, y_train, y_train_scores) # evaluate results on training data
print("\n results on test data:")
evaluate_print(clf_name, y_test, y_test_scores) # evaluate results on test data

# Visualize the results
visualize(clf_name, x_train, y_train, x_test, y_test, y_train_pred,
            y_test_pred, show_figure=True, save_figure=False) # Visualize training and test results

Results on training data.
LOF ROC:0.5502, precision @ rank n:0.3333

Results on test data: LOF ROC:0.4829, precision @ rank n:0.3333
LOF ROC:0.4829, precision @ rank n:0.3333

png

2 Reference

joblib
pyod
pyod-doc
pyod-implemented-algorithms
benchmark
combo
pythresh
pythresh-doc
pickle