[Image Processing] Clean image dataset based on CleanVision library

CleanVision is an open source Python library designed to help users automatically detect common problems in image datasets that may affect machine learning projects. The library is designed as a preliminary tool for computer vision projects to identify and resolve problems in a dataset before applying machine learning.CleanVision's core functionality includes detecting problematic images such as exact duplicates , near duplicates , blurs , low information content , overdarkness , overbrightness , grayscales , irregular aspect ratios, and dimensional anomalies.CleanVision open source repository address is:CleanVision, the official document address is:CleanVision-docs。

The CleanVision Basic Edition installation commands are as follows:

pip install cleanvision

The full version of the installation command is as follows:

pip install "cleanvision[all]"

Check out the CleanVision version:

# View version
import cleanvision
cleanvision.__version__

'0.3.6'

Necessary library version of the code in this article:

# for tabular display
import tabulate
# tabulate version needs to be 0.8.10 or higher.
tabulate.__version__

'0.9.0'

1 Instructions for use
- 1.1 Introduction to CleanVision Features
- 1.2 Basic use
- 1.3 Customized detection
- 1.4 Running CleanVision on the Torchvision dataset
- 1.5 Running CleanVision on the Hugging Face dataset
2 Reference

1 Instructions for use

1.1 Introduction to CleanVision Features

CleanVision supports a wide range ofspecificationof image files and can detect the following types of data problems:

Type of problem	descriptive	Keywords.
exact duplicate	Exactly the same image	exact_duplicates
near duplicate	Visually almost identical images	near_duplicates
blurred	Blurred image details (out of focus)	blurry
low information content	Images lacking content (small entropy of pixel values)	low_information
too dark	Irregular dark images (underexposed)	dark
too bright	Irregularly bright images (overexposed)	light
grayscale	Images lacking color	grayscale
Abnormal aspect ratio	Images with abnormal aspect ratios	odd_aspect_ratio
Abnormal size	Images with unusual dimensions compared to other images in the dataset	odd_size

In the table above, CleanVision's detection of these problems relies on a variety of statistical methods, where a list of keywords is used to specify the name of each problem type in the CleanVision code.CleanVision is compatible with Linux, macOS, and Windows systems, and runs in Python 3.7 and above.

1.2 Basic use

This section describes how to read the pictures in a folder for problem detection. The following example shows the process of quality inspection of a folder containing 607 images. During the detection process, CleanVision will automatically load multiple processes to speed up the processing:

Basic use

from cleanvision import Imagelab

# Example data: https://cleanlab-public./CleanVision/image_files.zip
# Read the example image
dataset_path = ". /image_files/"

# Instantiate the Imagelab class for subsequent processing
imagelab = Imagelab(data_path=dataset_path)

# Use multiprocessing for multi-processing, n_jobs sets the number of processes
# n_jobs defaults to None, which means the number of processes is determined automatically.
# Processing will first check the image_property of each image.
# After all images are processed, duplicate is detected.
imagelab.find_issues(verbose=False, n_jobs=2)

Reading images from D:/cleanvision/image_files

If you are running CleanVision code on a Windows system, you need to put the relevant code into the main function in order to load the multiprocessing module correctly. Of course, it is also possible to set n_jobs to 1 to use a single process:

from cleanvision import Imagelab

if '__main__' == __name__:
    # sample data：https://cleanlab-public./CleanVision/image_files.zip
    # Read sample image
    dataset_path = "./image_files/"
    imagelab = Imagelab(data_path=dataset_path)
    imagelab.find_issues(verbose=False)

Reading images from D:/cleanvision/image_files

Based on the REPORT function, it is possible to report the number of images for each problem type in the dataset and show the images of the most severe instances of each problem type:

()

Issues found in images in order of severity in the dataset

|    | issue_type       |   num_images |
|---:|:-----------------|-------------:|
|  0 | odd_size         |          109 |
|  1 | grayscale        |           20 |
|  2 | near_duplicates  |           20 |
|  3 | exact_duplicates |           19 |
|  4 | odd_aspect_ratio |           11 |
|  5 | dark             |           10 |
|  6 | blurry           |            6 |
|  7 | light            |            5 |
|  8 | low_information  |            5 | 

--------------------- odd_size images ----------------------

Number of examples with this issue: 109
Examples representing most severe instances of this issue:

png

--------------------- grayscale images ---------------------

Number of examples with this issue: 20
Examples representing most severe instances of this issue:

png

------------------ near_duplicates images ------------------

Number of examples with this issue: 20
Examples representing most severe instances of this issue:

Set: 0

png

Set: 1

png

Set: 2

png

Set: 3

png

----------------- exact_duplicates images ------------------

Number of examples with this issue: 19
Examples representing most severe instances of this issue:

Set: 0

png

Set: 1

png

Set: 2

png

Set: 3

png

----------------- odd_aspect_ratio images ------------------

Number of examples with this issue: 11
Examples representing most severe instances of this issue:

png

----------------------- dark images ------------------------

Number of examples with this issue: 10
Examples representing most severe instances of this issue:

png

---------------------- blurry images -----------------------

Number of examples with this issue: 6
Examples representing most severe instances of this issue:

png

----------------------- light images -----------------------

Number of examples with this issue: 5
Examples representing most severe instances of this issue:

png

------------------ low_information images ------------------

Number of examples with this issue: 5
Examples representing most severe instances of this issue:

png

If you want to create a customized question identification type, you can refer to:custom_issue_manager。

The main way to interact with the data results is through the Imagelab class. This class can be used to understand the issues in the dataset at the macro level (global overview) and at the micro level (issues and quality scores for each image). It contains three main properties:

Imagelab.issue_summary: summary of issues
: List of Issues
: dataset information, including information on similar images

Analysis of the results of the problem

The issue_summary attribute allows you to show the number of images in the dataset for different issue categories:

# return the result as a dataframe in pandas
res = imagelab.issue_summary
type(res)

View summary results:

res

	issue_type	num_images
0	odd_size	109
1	grayscale	20
2	near_duplicates	20
3	exact_duplicates	19
4	odd_aspect_ratio	11
5	dark	10
6	blurry	6
7	light	5
8	low_information	5

The issues attribute allows you to display the quality scores of various issues and their presence in each image. These quality scores range from 0 to 1, with lower scores indicating a higher severity of issues:

()

	odd_size_score	is_odd_size_issue	odd_aspect_ratio_score	is_odd_aspect_ratio_issue	low_information_score	is_low_information_issue	light_score	is_light_issue	grayscale_score	is_grayscale_issue	dark_score	is_dark_issue	blurry_score	is_blurry_issue	exact_duplicates_score	is_exact_duplicates_issue	near_duplicates_score	is_near_duplicates_issue
D:/cleanvision/image_files/image_0.png	1.0	False	1.0	False	0.806332	False	0.925490	False	1	False	1.000000	False	0.980373	False	1.0	False	1.0	False
D:/cleanvision/image_files/image_1.png	1.0	False	1.0	False	0.923116	False	0.906609	False	1	False	0.990676	False	0.472314	False	1.0	False	1.0	False
D:/cleanvision/image_files/image_10.png	1.0	False	1.0	False	0.875129	False	0.995127	False	1	False	0.795937	False	0.470706	False	1.0	False	1.0	False
D:/cleanvision/image_files/image_100.png	1.0	False	1.0	False	0.916140	False	0.889762	False	1	False	0.827587	False	0.441195	False	1.0	False	1.0	False
D:/cleanvision/image_files/image_101.png	1.0	False	1.0	False	0.779338	False	0.960784	False	0	True	0.992157	False	0.507767	False	1.0	False	1.0	False

Since a Pandas data table is returned, it can be filtered for specific types of data:

# The smaller the score, the more severe
dark_images = [["is_dark_issue"] == True].sort_values(
    by=["dark_score"]
)
dark_images_files = dark_images.()
dark_images_files

['D:/cleanvision/image_files/image_417.png',
 'D:/cleanvision/image_files/image_350.png',
 'D:/cleanvision/image_files/image_605.png',
 'D:/cleanvision/image_files/image_177.png',
 'D:/cleanvision/image_files/image_346.png',
 'D:/cleanvision/image_files/image_198.png',
 'D:/cleanvision/image_files/image_204.png',
 'D:/cleanvision/image_files/image_485.png',
 'D:/cleanvision/image_files/image_457.png',
 'D:/cleanvision/image_files/image_576.png']

Visualize the picture of the problem in it:

(image_files=dark_images_files[:4])

png

A cleaner way to accomplish the above task would be to specify the issue_types parameter directly in the function, which would allow you to directly display the images under a particular issue and sort them in order of severity:

# issue_types: issue types, num_images: number of images to display, cell_size: size of images in each grid
(issue_types=["low_information"], num_images=3, cell_size=(3, 3))

png

View image information and similar images

The info attribute allows you to view information about the dataset:

# View items that exist
()

dict_keys(['statistics', 'dark', 'light', 'odd_aspect_ratio', 'low_information', 'blurry', 'grayscale', 'odd_size', 'exact_duplicates', 'near_duplicates'])

# View statistics
["statistics"].keys()

dict_keys(['brightness', 'aspect_ratio', 'entropy', 'blurriness', 'color_space', 'size'])

# View statistics for the dataset
["statistics"] ["size"]

count     607.000000
mean      280.830152
std       215.001908
min        32.000000
25%       256.000000
50%       256.000000
75%       256.000000
max      4666.050578
Name: size, dtype: float64

View the number of substantially similar images in the dataset:

["exact_duplicates"]["num_sets"]

View pairs of images that are close together in the dataset:

["near_duplicates"]["sets"]

[['D:/cleanvision/image_files/image_103.png',
  'D:/cleanvision/image_files/image_408.png'],
 ['D:/cleanvision/image_files/image_109.png',
  'D:/cleanvision/image_files/image_329.png'],
 ['D:/cleanvision/image_files/image_119.png',
  'D:/cleanvision/image_files/image_250.png'],
 ['D:/cleanvision/image_files/image_140.png',
  'D:/cleanvision/image_files/image_538.png'],
 ['D:/cleanvision/image_files/image_25.png',
  'D:/cleanvision/image_files/image_357.png'],
 ['D:/cleanvision/image_files/image_255.png',
  'D:/cleanvision/image_files/image_43.png'],
 ['D:/cleanvision/image_files/image_263.png',
  'D:/cleanvision/image_files/image_486.png'],
 ['D:/cleanvision/image_files/image_3.png',
  'D:/cleanvision/image_files/image_64.png'],
 ['D:/cleanvision/image_files/image_389.png',
  'D:/cleanvision/image_files/image_426.png'],
 ['D:/cleanvision/image_files/image_52.png',
  'D:/cleanvision/image_files/image_66.png']]

1.3 Customized detection

Specify the type of detection

from cleanvision import Imagelab

# sample data：https://cleanlab-public./CleanVision/image_files.zip
dataset_path = "./image_files/"

# Specify the type of detection
issue_types = {"blurry":{}, "dark": {}}

imagelab = Imagelab(data_path=dataset_path)

imagelab.find_issues(issue_types=issue_types, verbose=False)
()

Reading images from D:/cleanvision/image_files


Issues found in images in order of severity in the dataset

|    | issue_type   |   num_images |
|---:|:-------------|-------------:|
|  0 | dark         |           10 |
|  1 | blurry       |            6 | 

----------------------- dark images ------------------------

Number of examples with this issue: 10
Examples representing most severe instances of this issue:

png

---------------------- blurry images -----------------------

Number of examples with this issue: 6
Examples representing most severe instances of this issue:

png

If the find_issues function has already been run, the current results will be merged with the previous results if a new detection type is added when the function is run again:

issue_types = {"light": {}}
imagelab.find_issues(issue_types)
# Report the results of the three types
()

Checking for light images ...
Issue checks completed. 21 issues found in the dataset. To see a detailed report of issues found, use ().
Issues found in images in order of severity in the dataset

|    | issue_type   |   num_images |
|---:|:-------------|-------------:|
|  0 | dark         |           10 |
|  1 | blurry       |            6 |
|  2 | light        |            5 | 

----------------------- dark images ------------------------

Number of examples with this issue: 10
Examples representing most severe instances of this issue:

png

---------------------- blurry images -----------------------

Number of examples with this issue: 6
Examples representing most severe instances of this issue:

png

----------------------- light images -----------------------

Number of examples with this issue: 5
Examples representing most severe instances of this issue:

png

Results Saving

The following code shows how to save and load the results, but when loading the results, the data path and dataset must remain the same as when saving:

save_path = ". /results"
# Save results
# force indicates whether to overwrite the original file
(save_path, force=True)

# Load the results
imagelab = (save_path, dataset_path)

Successfully loaded Imagelab

Threshold setting

CleanVision determines the various detections by threshold control, where exact_duplicates and near_duplicates are based on the image hash (defined by theimagehashprovided by the library) for detection, while other types of detection use a threshold ranging from 0 to 1 to control the results. If an image scores below a set threshold for a particular problem type, it is considered to have that problem; the higher the threshold, the more likely it is judged to have that problem. This is shown below:

	Keywords.	hyperparameterization
1	light	threshold
2	dark	threshold
3	odd_aspect_ratio	threshold
4	exact_duplicates	N/A
5	near_duplicates	hash_size(int)，hash_types(whash,phash,ahash,dhash,chash)
6	blurry	threshold
7	grayscale	threshold
8	low_information	threshold

For a single detection type, the threshold setting code is as follows:

imagelab = Imagelab(data_path=dataset_path)
issue_types = {"dark": {"threshold": 0.5}}
imagelab.find_issues(issue_types)

()

Reading images from D:/cleanvision/image_files
Checking for dark images ...

Issue checks completed. 20 issues found in the dataset. To see a detailed report of issues found, use ().
Issues found in images in order of severity in the dataset

|    | issue_type   |   num_images |
|---:|:-------------|-------------:|
|  0 | dark         |           20 | 

----------------------- dark images ------------------------

Number of examples with this issue: 20
Examples representing most severe instances of this issue:

png

If the presence of a certain type of problem is normal, such as the prevalence of images that are too dark in astronomical datasets, then a maximum occurrence (max_prevalence) can be set. This means that if the percentage of images for a particular problem exceeds max_prevalence, the problem can be considered normal. In the above example, the number of images with the dark problem is 10 and the total number of images is 607, so the percentage of images with the dark problem is about 0.016. If max_prevalence is set to 0.015, then images with the dark problem will not be reported as dark:

(max_prevalence=0.015)

Removing dark from potential issues in the dataset as it exceeds max_prevalence=0.015 
Please specify some issue_types to check for in imagelab.find_issues().

1.4 Running CleanVision on the Torchvision dataset

CleanVision supports problem detection using the Torchvision dataset with the following code:

Preparing the dataset

from import CIFAR10
from import ConcatDataset
from cleanvision import Imagelab

# intendtorchvisionhit the nail on the headCIFAR10data set
train_set = CIFAR10(root="./", download=True)
test_set = CIFAR10(root="./", train=False, download=True)

Files already downloaded and verified
Files already downloaded and verified

# View the number of samples in the training set and test set
len(train_set), len(test_set)

(50000, 10000)

If you want to merge the training and test sets, you can use the following code:

dataset = ConcatDataset([train_set, test_set])
len(dataset)

View image:

dataset[0][0]

png

Running CleanVision

The Torchvision dataset can be manipulated by simply specifying the torchvision_dataset parameter when creating the Imagelab example, and the subsequent processing steps are the same as for reading images from a folder:

imagelab = Imagelab(torchvision_dataset=dataset)
imagelab.find_issues()
# View the results
# ()

Checking for dark, light, odd_aspect_ratio, low_information, exact_duplicates, near_duplicates, blurry, grayscale, odd_size images ...

Issue checks completed. 173 issues found in the dataset. To see a detailed report of issues found, use ().

# Summary of results
imagelab.issue_summary

	issue_type	num_images
0	blurry	118
1	near_duplicates	40
2	dark	11
3	light	3
4	low_information	1
5	grayscale	0
6	odd_aspect_ratio	0
7	odd_size	0
8	exact_duplicates	0

1.5 Running CleanVision on the Hugging Face dataset

CleanVision supports problem detection based on the Hugging Face dataset (if it works) with the following code:

# datasets is specialized for downloading huggingface datasets
from datasets import load_dataset
from cleanvision import Imagelab
# Take /datasets/mah91/cat as an example.
# To download a particular hugging face dataset, just set the path parameter to the text after the datasets to be downloaded.
# split means to extract the data of the train or test, if the split dataset is not provided then return the complete data
dataset = load_dataset(path="mah91/cat", split="train")

Repo card metadata block was not found. Setting CardData to empty.

# View the dataset, you can see that the dataset has 800 images, only the images are provided without annotations.
dataset

Dataset({
    features: ['image'],
    num_rows: 800
})

# Contains information about the different columns in the dataset and the type of each column, e.g. image, audio

{'image': Image(mode=None, decode=True, id=None)}

Specify the hf_dataset parameter to load the hugging face dataset:

# Load data into CleanVision, image_key specifies data containing 'image'
imagelab = Imagelab(hf_dataset=dataset, image_key="image")

The code to perform the test is as follows:

imagelab.find_issues()
# Summary of results
imagelab.issue_summary

Checking for dark, light, odd_aspect_ratio, low_information, exact_duplicates, near_duplicates, blurry, grayscale, odd_size images ...

Issue checks completed. 4 issues found in the dataset. To see a detailed report of issues found, use ().

	issue_type	num_images
0	blurry	3
1	odd_size	1
2	dark	0
3	grayscale	0
4	light	0
5	low_information	0
6	odd_aspect_ratio	0
7	exact_duplicates	0
8	near_duplicates	0

2 Reference

CleanVision
CleanVision-docs
custom_issue_manager
imagehash