CleanVision is an open source Python library designed to help users automatically detect common problems in image datasets that may affect machine learning projects. The library is designed as a preliminary tool for computer vision projects to identify and resolve problems in a dataset before applying machine learning.CleanVision's core functionality includes detecting problematic images such as exact duplicates , near duplicates , blurs , low information content , overdarkness , overbrightness , grayscales , irregular aspect ratios, and dimensional anomalies.CleanVision open source repository address is:CleanVision, the official document address is:CleanVision-docs。
The CleanVision Basic Edition installation commands are as follows:
pip install cleanvision
The full version of the installation command is as follows:
pip install "cleanvision[all]"
Check out the CleanVision version:
# View version
import cleanvision
cleanvision.__version__
'0.3.6'
Necessary library version of the code in this article:
# for tabular display
import tabulate
# tabulate version needs to be 0.8.10 or higher.
tabulate.__version__
'0.9.0'
-
1 Instructions for use
- 1.1 Introduction to CleanVision Features
- 1.2 Basic use
- 1.3 Customized detection
- 1.4 Running CleanVision on the Torchvision dataset
- 1.5 Running CleanVision on the Hugging Face dataset
- 2 Reference
1 Instructions for use
1.1 Introduction to CleanVision Features
CleanVision supports a wide range ofspecificationof image files and can detect the following types of data problems:
Example pictures | Type of problem | descriptive | Keywords. |
---|---|---|---|
exact duplicate | Exactly the same image | exact_duplicates | |
near duplicate | Visually almost identical images | near_duplicates | |
blurred | Blurred image details (out of focus) | blurry | |
low information content | Images lacking content (small entropy of pixel values) | low_information | |
too dark | Irregular dark images (underexposed) | dark | |
too bright | Irregularly bright images (overexposed) | light | |
grayscale | Images lacking color | grayscale | |
Abnormal aspect ratio | Images with abnormal aspect ratios | odd_aspect_ratio | |
Abnormal size | Images with unusual dimensions compared to other images in the dataset | odd_size |
In the table above, CleanVision's detection of these problems relies on a variety of statistical methods, where a list of keywords is used to specify the name of each problem type in the CleanVision code.CleanVision is compatible with Linux, macOS, and Windows systems, and runs in Python 3.7 and above.
1.2 Basic use
This section describes how to read the pictures in a folder for problem detection. The following example shows the process of quality inspection of a folder containing 607 images. During the detection process, CleanVision will automatically load multiple processes to speed up the processing:
Basic use
from cleanvision import Imagelab
# Example data: https://cleanlab-public./CleanVision/image_files.zip
# Read the example image
dataset_path = ". /image_files/"
# Instantiate the Imagelab class for subsequent processing
imagelab = Imagelab(data_path=dataset_path)
# Use multiprocessing for multi-processing, n_jobs sets the number of processes
# n_jobs defaults to None, which means the number of processes is determined automatically.
# Processing will first check the image_property of each image.
# After all images are processed, duplicate is detected.
imagelab.find_issues(verbose=False, n_jobs=2)
Reading images from D:/cleanvision/image_files
If you are running CleanVision code on a Windows system, you need to put the relevant code into the main function in order to load the multiprocessing module correctly. Of course, it is also possible to set n_jobs to 1 to use a single process:
from cleanvision import Imagelab
if '__main__' == __name__:
# sample data:https://cleanlab-public./CleanVision/image_files.zip
# Read sample image
dataset_path = "./image_files/"
imagelab = Imagelab(data_path=dataset_path)
imagelab.find_issues(verbose=False)
Reading images from D:/cleanvision/image_files
Based on the REPORT function, it is possible to report the number of images for each problem type in the dataset and show the images of the most severe instances of each problem type:
()
Issues found in images in order of severity in the dataset
| | issue_type | num_images |
|---:|:-----------------|-------------:|
| 0 | odd_size | 109 |
| 1 | grayscale | 20 |
| 2 | near_duplicates | 20 |
| 3 | exact_duplicates | 19 |
| 4 | odd_aspect_ratio | 11 |
| 5 | dark | 10 |
| 6 | blurry | 6 |
| 7 | light | 5 |
| 8 | low_information | 5 |
--------------------- odd_size images ----------------------
Number of examples with this issue: 109
Examples representing most severe instances of this issue:
--------------------- grayscale images ---------------------
Number of examples with this issue: 20
Examples representing most severe instances of this issue:
------------------ near_duplicates images ------------------
Number of examples with this issue: 20
Examples representing most severe instances of this issue:
Set: 0
Set: 1
Set: 2
Set: 3
----------------- exact_duplicates images ------------------
Number of examples with this issue: 19
Examples representing most severe instances of this issue:
Set: 0
Set: 1
Set: 2
Set: 3
----------------- odd_aspect_ratio images ------------------
Number of examples with this issue: 11
Examples representing most severe instances of this issue:
----------------------- dark images ------------------------
Number of examples with this issue: 10
Examples representing most severe instances of this issue:
---------------------- blurry images -----------------------
Number of examples with this issue: 6
Examples representing most severe instances of this issue:
----------------------- light images -----------------------
Number of examples with this issue: 5
Examples representing most severe instances of this issue:
------------------ low_information images ------------------
Number of examples with this issue: 5
Examples representing most severe instances of this issue:
If you want to create a customized question identification type, you can refer to:custom_issue_manager。
The main way to interact with the data results is through the Imagelab class. This class can be used to understand the issues in the dataset at the macro level (global overview) and at the micro level (issues and quality scores for each image). It contains three main properties:
- Imagelab.issue_summary: summary of issues
- : List of Issues
- : dataset information, including information on similar images
Analysis of the results of the problem
The issue_summary attribute allows you to show the number of images in the dataset for different issue categories:
# return the result as a dataframe in pandas
res = imagelab.issue_summary
type(res)
View summary results:
res
issue_type | num_images | |
---|---|---|
0 | odd_size | 109 |
1 | grayscale | 20 |
2 | near_duplicates | 20 |
3 | exact_duplicates | 19 |
4 | odd_aspect_ratio | 11 |
5 | dark | 10 |
6 | blurry | 6 |
7 | light | 5 |
8 | low_information | 5 |
The issues attribute allows you to display the quality scores of various issues and their presence in each image. These quality scores range from 0 to 1, with lower scores indicating a higher severity of issues:
()
odd_size_score | is_odd_size_issue | odd_aspect_ratio_score | is_odd_aspect_ratio_issue | low_information_score | is_low_information_issue | light_score | is_light_issue | grayscale_score | is_grayscale_issue | dark_score | is_dark_issue | blurry_score | is_blurry_issue | exact_duplicates_score | is_exact_duplicates_issue | near_duplicates_score | is_near_duplicates_issue | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
D:/cleanvision/image_files/image_0.png | 1.0 | False | 1.0 | False | 0.806332 | False | 0.925490 | False | 1 | False | 1.000000 | False | 0.980373 | False | 1.0 | False | 1.0 | False |
D:/cleanvision/image_files/image_1.png | 1.0 | False | 1.0 | False | 0.923116 | False | 0.906609 | False | 1 | False | 0.990676 | False | 0.472314 | False | 1.0 | False | 1.0 | False |
D:/cleanvision/image_files/image_10.png | 1.0 | False | 1.0 | False | 0.875129 | False | 0.995127 | False | 1 | False | 0.795937 | False | 0.470706 | False | 1.0 | False | 1.0 | False |
D:/cleanvision/image_files/image_100.png | 1.0 | False | 1.0 | False | 0.916140 | False | 0.889762 | False | 1 | False | 0.827587 | False | 0.441195 | False | 1.0 | False | 1.0 | False |
D:/cleanvision/image_files/image_101.png | 1.0 | False | 1.0 | False | 0.779338 | False | 0.960784 | False | 0 | True | 0.992157 | False | 0.507767 | False | 1.0 | False | 1.0 | False |
Since a Pandas data table is returned, it can be filtered for specific types of data:
# The smaller the score, the more severe
dark_images = [["is_dark_issue"] == True].sort_values(
by=["dark_score"]
)
dark_images_files = dark_images.()
dark_images_files
['D:/cleanvision/image_files/image_417.png',
'D:/cleanvision/image_files/image_350.png',
'D:/cleanvision/image_files/image_605.png',
'D:/cleanvision/image_files/image_177.png',
'D:/cleanvision/image_files/image_346.png',
'D:/cleanvision/image_files/image_198.png',
'D:/cleanvision/image_files/image_204.png',
'D:/cleanvision/image_files/image_485.png',
'D:/cleanvision/image_files/image_457.png',
'D:/cleanvision/image_files/image_576.png']
Visualize the picture of the problem in it:
(image_files=dark_images_files[:4])
A cleaner way to accomplish the above task would be to specify the issue_types parameter directly in the function, which would allow you to directly display the images under a particular issue and sort them in order of severity:
# issue_types: issue types, num_images: number of images to display, cell_size: size of images in each grid
(issue_types=["low_information"], num_images=3, cell_size=(3, 3))
View image information and similar images
The info attribute allows you to view information about the dataset:
# View items that exist
()
dict_keys(['statistics', 'dark', 'light', 'odd_aspect_ratio', 'low_information', 'blurry', 'grayscale', 'odd_size', 'exact_duplicates', 'near_duplicates'])
# View statistics
["statistics"].keys()
dict_keys(['brightness', 'aspect_ratio', 'entropy', 'blurriness', 'color_space', 'size'])
# View statistics for the dataset
["statistics"] ["size"]
count 607.000000
mean 280.830152
std 215.001908
min 32.000000
25% 256.000000
50% 256.000000
75% 256.000000
max 4666.050578
Name: size, dtype: float64
View the number of substantially similar images in the dataset:
["exact_duplicates"]["num_sets"]
9
View pairs of images that are close together in the dataset:
["near_duplicates"]["sets"]
[['D:/cleanvision/image_files/image_103.png',
'D:/cleanvision/image_files/image_408.png'],
['D:/cleanvision/image_files/image_109.png',
'D:/cleanvision/image_files/image_329.png'],
['D:/cleanvision/image_files/image_119.png',
'D:/cleanvision/image_files/image_250.png'],
['D:/cleanvision/image_files/image_140.png',
'D:/cleanvision/image_files/image_538.png'],
['D:/cleanvision/image_files/image_25.png',
'D:/cleanvision/image_files/image_357.png'],
['D:/cleanvision/image_files/image_255.png',
'D:/cleanvision/image_files/image_43.png'],
['D:/cleanvision/image_files/image_263.png',
'D:/cleanvision/image_files/image_486.png'],
['D:/cleanvision/image_files/image_3.png',
'D:/cleanvision/image_files/image_64.png'],
['D:/cleanvision/image_files/image_389.png',
'D:/cleanvision/image_files/image_426.png'],
['D:/cleanvision/image_files/image_52.png',
'D:/cleanvision/image_files/image_66.png']]
1.3 Customized detection
Specify the type of detection
from cleanvision import Imagelab
# sample data:https://cleanlab-public./CleanVision/image_files.zip
dataset_path = "./image_files/"
# Specify the type of detection
issue_types = {"blurry":{}, "dark": {}}
imagelab = Imagelab(data_path=dataset_path)
imagelab.find_issues(issue_types=issue_types, verbose=False)
()
Reading images from D:/cleanvision/image_files
Issues found in images in order of severity in the dataset
| | issue_type | num_images |
|---:|:-------------|-------------:|
| 0 | dark | 10 |
| 1 | blurry | 6 |
----------------------- dark images ------------------------
Number of examples with this issue: 10
Examples representing most severe instances of this issue:
---------------------- blurry images -----------------------
Number of examples with this issue: 6
Examples representing most severe instances of this issue:
If the find_issues function has already been run, the current results will be merged with the previous results if a new detection type is added when the function is run again:
issue_types = {"light": {}}
imagelab.find_issues(issue_types)
# Report the results of the three types
()
Checking for light images ...
Issue checks completed. 21 issues found in the dataset. To see a detailed report of issues found, use ().
Issues found in images in order of severity in the dataset
| | issue_type | num_images |
|---:|:-------------|-------------:|
| 0 | dark | 10 |
| 1 | blurry | 6 |
| 2 | light | 5 |
----------------------- dark images ------------------------
Number of examples with this issue: 10
Examples representing most severe instances of this issue:
---------------------- blurry images -----------------------
Number of examples with this issue: 6
Examples representing most severe instances of this issue:
----------------------- light images -----------------------
Number of examples with this issue: 5
Examples representing most severe instances of this issue:
Results Saving
The following code shows how to save and load the results, but when loading the results, the data path and dataset must remain the same as when saving:
save_path = ". /results"
# Save results
# force indicates whether to overwrite the original file
(save_path, force=True)
# Load the results
imagelab = (save_path, dataset_path)
Successfully loaded Imagelab
Threshold setting
CleanVision determines the various detections by threshold control, where exact_duplicates and near_duplicates are based on the image hash (defined by theimagehashprovided by the library) for detection, while other types of detection use a threshold ranging from 0 to 1 to control the results. If an image scores below a set threshold for a particular problem type, it is considered to have that problem; the higher the threshold, the more likely it is judged to have that problem. This is shown below:
Keywords. | hyperparameterization | |
---|---|---|
1 | light | threshold |
2 | dark | threshold |
3 | odd_aspect_ratio | threshold |
4 | exact_duplicates | N/A |
5 | near_duplicates | hash_size(int),hash_types(whash,phash,ahash,dhash,chash) |
6 | blurry | threshold |
7 | grayscale | threshold |
8 | low_information | threshold |
For a single detection type, the threshold setting code is as follows:
imagelab = Imagelab(data_path=dataset_path)
issue_types = {"dark": {"threshold": 0.5}}
imagelab.find_issues(issue_types)
()
Reading images from D:/cleanvision/image_files
Checking for dark images ...
Issue checks completed. 20 issues found in the dataset. To see a detailed report of issues found, use ().
Issues found in images in order of severity in the dataset
| | issue_type | num_images |
|---:|:-------------|-------------:|
| 0 | dark | 20 |
----------------------- dark images ------------------------
Number of examples with this issue: 20
Examples representing most severe instances of this issue:
If the presence of a certain type of problem is normal, such as the prevalence of images that are too dark in astronomical datasets, then a maximum occurrence (max_prevalence) can be set. This means that if the percentage of images for a particular problem exceeds max_prevalence, the problem can be considered normal. In the above example, the number of images with the dark problem is 10 and the total number of images is 607, so the percentage of images with the dark problem is about 0.016. If max_prevalence is set to 0.015, then images with the dark problem will not be reported as dark:
(max_prevalence=0.015)
Removing dark from potential issues in the dataset as it exceeds max_prevalence=0.015
Please specify some issue_types to check for in imagelab.find_issues().
1.4 Running CleanVision on the Torchvision dataset
CleanVision supports problem detection using the Torchvision dataset with the following code:
Preparing the dataset
from import CIFAR10
from import ConcatDataset
from cleanvision import Imagelab
# intendtorchvisionhit the nail on the headCIFAR10data set
train_set = CIFAR10(root="./", download=True)
test_set = CIFAR10(root="./", train=False, download=True)
Files already downloaded and verified
Files already downloaded and verified
# View the number of samples in the training set and test set
len(train_set), len(test_set)
(50000, 10000)
If you want to merge the training and test sets, you can use the following code:
dataset = ConcatDataset([train_set, test_set])
len(dataset)
60000
View image:
dataset[0][0]
Running CleanVision
The Torchvision dataset can be manipulated by simply specifying the torchvision_dataset parameter when creating the Imagelab example, and the subsequent processing steps are the same as for reading images from a folder:
imagelab = Imagelab(torchvision_dataset=dataset)
imagelab.find_issues()
# View the results
# ()
Checking for dark, light, odd_aspect_ratio, low_information, exact_duplicates, near_duplicates, blurry, grayscale, odd_size images ...
Issue checks completed. 173 issues found in the dataset. To see a detailed report of issues found, use ().
# Summary of results
imagelab.issue_summary
issue_type | num_images | |
---|---|---|
0 | blurry | 118 |
1 | near_duplicates | 40 |
2 | dark | 11 |
3 | light | 3 |
4 | low_information | 1 |
5 | grayscale | 0 |
6 | odd_aspect_ratio | 0 |
7 | odd_size | 0 |
8 | exact_duplicates | 0 |
1.5 Running CleanVision on the Hugging Face dataset
CleanVision supports problem detection based on the Hugging Face dataset (if it works) with the following code:
# datasets is specialized for downloading huggingface datasets
from datasets import load_dataset
from cleanvision import Imagelab
# Take /datasets/mah91/cat as an example.
# To download a particular hugging face dataset, just set the path parameter to the text after the datasets to be downloaded.
# split means to extract the data of the train or test, if the split dataset is not provided then return the complete data
dataset = load_dataset(path="mah91/cat", split="train")
Repo card metadata block was not found. Setting CardData to empty.
# View the dataset, you can see that the dataset has 800 images, only the images are provided without annotations.
dataset
Dataset({
features: ['image'],
num_rows: 800
})
# Contains information about the different columns in the dataset and the type of each column, e.g. image, audio
{'image': Image(mode=None, decode=True, id=None)}
Specify the hf_dataset parameter to load the hugging face dataset:
# Load data into CleanVision, image_key specifies data containing 'image'
imagelab = Imagelab(hf_dataset=dataset, image_key="image")
The code to perform the test is as follows:
imagelab.find_issues()
# Summary of results
imagelab.issue_summary
Checking for dark, light, odd_aspect_ratio, low_information, exact_duplicates, near_duplicates, blurry, grayscale, odd_size images ...
Issue checks completed. 4 issues found in the dataset. To see a detailed report of issues found, use ().
issue_type | num_images | |
---|---|---|
0 | blurry | 3 |
1 | odd_size | 1 |
2 | dark | 0 |
3 | grayscale | 0 |
4 | light | 0 |
5 | low_information | 0 |
6 | odd_aspect_ratio | 0 |
7 | exact_duplicates | 0 |
8 | near_duplicates | 0 |
2 Reference
- CleanVision
- CleanVision-docs
- custom_issue_manager
- imagehash