Location>code7788 >text

[Image Processing] Clean image dataset based on CleanVision library

Popularity:729 ℃/2024-10-24 10:42:09

CleanVision is an open source Python library designed to help users automatically detect common problems in image datasets that may affect machine learning projects. The library is designed as a preliminary tool for computer vision projects to identify and resolve problems in a dataset before applying machine learning.CleanVision's core functionality includes detecting problematic images such as exact duplicates , near duplicates , blurs , low information content , overdarkness , overbrightness , grayscales , irregular aspect ratios, and dimensional anomalies.CleanVision open source repository address is:CleanVision, the official document address is:CleanVision-docs

The CleanVision Basic Edition installation commands are as follows:

pip install cleanvision

The full version of the installation command is as follows:

pip install "cleanvision[all]"

Check out the CleanVision version:

# View version
import cleanvision
cleanvision.__version__
'0.3.6'

Necessary library version of the code in this article:

# for tabular display
import tabulate
# tabulate version needs to be 0.8.10 or higher.
tabulate.__version__
'0.9.0'

catalogs
  • 1 Instructions for use
    • 1.1 Introduction to CleanVision Features
    • 1.2 Basic use
    • 1.3 Customized detection
    • 1.4 Running CleanVision on the Torchvision dataset
    • 1.5 Running CleanVision on the Hugging Face dataset
  • 2 Reference

1 Instructions for use

1.1 Introduction to CleanVision Features

CleanVision supports a wide range ofspecificationof image files and can detect the following types of data problems:

Example pictures Type of problem descriptive Keywords.
exact duplicate Exactly the same image exact_duplicates
near duplicate Visually almost identical images near_duplicates
blurred Blurred image details (out of focus) blurry
low information content Images lacking content (small entropy of pixel values) low_information
too dark Irregular dark images (underexposed) dark
too bright Irregularly bright images (overexposed) light
grayscale Images lacking color grayscale
Abnormal aspect ratio Images with abnormal aspect ratios odd_aspect_ratio
Abnormal size Images with unusual dimensions compared to other images in the dataset odd_size

In the table above, CleanVision's detection of these problems relies on a variety of statistical methods, where a list of keywords is used to specify the name of each problem type in the CleanVision code.CleanVision is compatible with Linux, macOS, and Windows systems, and runs in Python 3.7 and above.

1.2 Basic use

This section describes how to read the pictures in a folder for problem detection. The following example shows the process of quality inspection of a folder containing 607 images. During the detection process, CleanVision will automatically load multiple processes to speed up the processing:

Basic use

from cleanvision import Imagelab

# Example data: https://cleanlab-public./CleanVision/image_files.zip
# Read the example image
dataset_path = ". /image_files/"

# Instantiate the Imagelab class for subsequent processing
imagelab = Imagelab(data_path=dataset_path)

# Use multiprocessing for multi-processing, n_jobs sets the number of processes
# n_jobs defaults to None, which means the number of processes is determined automatically.
# Processing will first check the image_property of each image.
# After all images are processed, duplicate is detected.
imagelab.find_issues(verbose=False, n_jobs=2)
Reading images from D:/cleanvision/image_files

If you are running CleanVision code on a Windows system, you need to put the relevant code into the main function in order to load the multiprocessing module correctly. Of course, it is also possible to set n_jobs to 1 to use a single process:

from cleanvision import Imagelab

if '__main__' == __name__:
    # sample data:https://cleanlab-public./CleanVision/image_files.zip
    # Read sample image
    dataset_path = "./image_files/"
    imagelab = Imagelab(data_path=dataset_path)
    imagelab.find_issues(verbose=False)
Reading images from D:/cleanvision/image_files

Based on the REPORT function, it is possible to report the number of images for each problem type in the dataset and show the images of the most severe instances of each problem type:

()
Issues found in images in order of severity in the dataset

|    | issue_type       |   num_images |
|---:|:-----------------|-------------:|
|  0 | odd_size         |          109 |
|  1 | grayscale        |           20 |
|  2 | near_duplicates  |           20 |
|  3 | exact_duplicates |           19 |
|  4 | odd_aspect_ratio |           11 |
|  5 | dark             |           10 |
|  6 | blurry           |            6 |
|  7 | light            |            5 |
|  8 | low_information  |            5 | 

--------------------- odd_size images ----------------------

Number of examples with this issue: 109
Examples representing most severe instances of this issue:

png

--------------------- grayscale images ---------------------

Number of examples with this issue: 20
Examples representing most severe instances of this issue:

png

------------------ near_duplicates images ------------------

Number of examples with this issue: 20
Examples representing most severe instances of this issue:

Set: 0

png

Set: 1

png

Set: 2

png

Set: 3

png

----------------- exact_duplicates images ------------------

Number of examples with this issue: 19
Examples representing most severe instances of this issue:

Set: 0

png

Set: 1

png

Set: 2

png

Set: 3

png

----------------- odd_aspect_ratio images ------------------

Number of examples with this issue: 11
Examples representing most severe instances of this issue:

png

----------------------- dark images ------------------------

Number of examples with this issue: 10
Examples representing most severe instances of this issue:

png

---------------------- blurry images -----------------------

Number of examples with this issue: 6
Examples representing most severe instances of this issue:

png

----------------------- light images -----------------------

Number of examples with this issue: 5
Examples representing most severe instances of this issue:

png

------------------ low_information images ------------------

Number of examples with this issue: 5
Examples representing most severe instances of this issue:

png

If you want to create a customized question identification type, you can refer to:custom_issue_manager

The main way to interact with the data results is through the Imagelab class. This class can be used to understand the issues in the dataset at the macro level (global overview) and at the micro level (issues and quality scores for each image). It contains three main properties:

  • Imagelab.issue_summary: summary of issues
  • : List of Issues
  • : dataset information, including information on similar images

Analysis of the results of the problem

The issue_summary attribute allows you to show the number of images in the dataset for different issue categories:

# return the result as a dataframe in pandas
res = imagelab.issue_summary
type(res)

View summary results:

res
issue_type num_images
0 odd_size 109
1 grayscale 20
2 near_duplicates 20
3 exact_duplicates 19
4 odd_aspect_ratio 11
5 dark 10
6 blurry 6
7 light 5
8 low_information 5

The issues attribute allows you to display the quality scores of various issues and their presence in each image. These quality scores range from 0 to 1, with lower scores indicating a higher severity of issues:

()
odd_size_score is_odd_size_issue odd_aspect_ratio_score is_odd_aspect_ratio_issue low_information_score is_low_information_issue light_score is_light_issue grayscale_score is_grayscale_issue dark_score is_dark_issue blurry_score is_blurry_issue exact_duplicates_score is_exact_duplicates_issue near_duplicates_score is_near_duplicates_issue
D:/cleanvision/image_files/image_0.png 1.0 False 1.0 False 0.806332 False 0.925490 False 1 False 1.000000 False 0.980373 False 1.0 False 1.0 False
D:/cleanvision/image_files/image_1.png 1.0 False 1.0 False 0.923116 False 0.906609 False 1 False 0.990676 False 0.472314 False 1.0 False 1.0 False
D:/cleanvision/image_files/image_10.png 1.0 False 1.0 False 0.875129 False 0.995127 False 1 False 0.795937 False 0.470706 False 1.0 False 1.0 False
D:/cleanvision/image_files/image_100.png 1.0 False 1.0 False 0.916140 False 0.889762 False 1 False 0.827587 False 0.441195 False 1.0 False 1.0 False
D:/cleanvision/image_files/image_101.png 1.0 False 1.0 False 0.779338 False 0.960784 False 0 True 0.992157 False 0.507767 False 1.0 False 1.0 False

Since a Pandas data table is returned, it can be filtered for specific types of data:

# The smaller the score, the more severe
dark_images = [["is_dark_issue"] == True].sort_values(
    by=["dark_score"]
)
dark_images_files = dark_images.()
dark_images_files
['D:/cleanvision/image_files/image_417.png',
 'D:/cleanvision/image_files/image_350.png',
 'D:/cleanvision/image_files/image_605.png',
 'D:/cleanvision/image_files/image_177.png',
 'D:/cleanvision/image_files/image_346.png',
 'D:/cleanvision/image_files/image_198.png',
 'D:/cleanvision/image_files/image_204.png',
 'D:/cleanvision/image_files/image_485.png',
 'D:/cleanvision/image_files/image_457.png',
 'D:/cleanvision/image_files/image_576.png']

Visualize the picture of the problem in it:

(image_files=dark_images_files[:4])

png

A cleaner way to accomplish the above task would be to specify the issue_types parameter directly in the function, which would allow you to directly display the images under a particular issue and sort them in order of severity:

# issue_types: issue types, num_images: number of images to display, cell_size: size of images in each grid
(issue_types=["low_information"], num_images=3, cell_size=(3, 3))

png

View image information and similar images

The info attribute allows you to view information about the dataset:

# View items that exist
()
dict_keys(['statistics', 'dark', 'light', 'odd_aspect_ratio', 'low_information', 'blurry', 'grayscale', 'odd_size', 'exact_duplicates', 'near_duplicates'])
# View statistics
["statistics"].keys()
dict_keys(['brightness', 'aspect_ratio', 'entropy', 'blurriness', 'color_space', 'size'])
# View statistics for the dataset
["statistics"] ["size"]
count     607.000000
mean      280.830152
std       215.001908
min        32.000000
25%       256.000000
50%       256.000000
75%       256.000000
max      4666.050578
Name: size, dtype: float64

View the number of substantially similar images in the dataset:

["exact_duplicates"]["num_sets"]
9

View pairs of images that are close together in the dataset:

["near_duplicates"]["sets"]
[['D:/cleanvision/image_files/image_103.png',
  'D:/cleanvision/image_files/image_408.png'],
 ['D:/cleanvision/image_files/image_109.png',
  'D:/cleanvision/image_files/image_329.png'],
 ['D:/cleanvision/image_files/image_119.png',
  'D:/cleanvision/image_files/image_250.png'],
 ['D:/cleanvision/image_files/image_140.png',
  'D:/cleanvision/image_files/image_538.png'],
 ['D:/cleanvision/image_files/image_25.png',
  'D:/cleanvision/image_files/image_357.png'],
 ['D:/cleanvision/image_files/image_255.png',
  'D:/cleanvision/image_files/image_43.png'],
 ['D:/cleanvision/image_files/image_263.png',
  'D:/cleanvision/image_files/image_486.png'],
 ['D:/cleanvision/image_files/image_3.png',
  'D:/cleanvision/image_files/image_64.png'],
 ['D:/cleanvision/image_files/image_389.png',
  'D:/cleanvision/image_files/image_426.png'],
 ['D:/cleanvision/image_files/image_52.png',
  'D:/cleanvision/image_files/image_66.png']]

1.3 Customized detection

Specify the type of detection

from cleanvision import Imagelab

# sample data:https://cleanlab-public./CleanVision/image_files.zip
dataset_path = "./image_files/"

# Specify the type of detection
issue_types = {"blurry":{}, "dark": {}}

imagelab = Imagelab(data_path=dataset_path)

imagelab.find_issues(issue_types=issue_types, verbose=False)
()
Reading images from D:/cleanvision/image_files


Issues found in images in order of severity in the dataset

|    | issue_type   |   num_images |
|---:|:-------------|-------------:|
|  0 | dark         |           10 |
|  1 | blurry       |            6 | 

----------------------- dark images ------------------------

Number of examples with this issue: 10
Examples representing most severe instances of this issue:

png

---------------------- blurry images -----------------------

Number of examples with this issue: 6
Examples representing most severe instances of this issue:

png

If the find_issues function has already been run, the current results will be merged with the previous results if a new detection type is added when the function is run again:

issue_types = {"light": {}}
imagelab.find_issues(issue_types)
# Report the results of the three types
()
Checking for light images ...
Issue checks completed. 21 issues found in the dataset. To see a detailed report of issues found, use ().
Issues found in images in order of severity in the dataset

|    | issue_type   |   num_images |
|---:|:-------------|-------------:|
|  0 | dark         |           10 |
|  1 | blurry       |            6 |
|  2 | light        |            5 | 

----------------------- dark images ------------------------

Number of examples with this issue: 10
Examples representing most severe instances of this issue:

png

---------------------- blurry images -----------------------

Number of examples with this issue: 6
Examples representing most severe instances of this issue:

png

----------------------- light images -----------------------

Number of examples with this issue: 5
Examples representing most severe instances of this issue:

png

Results Saving

The following code shows how to save and load the results, but when loading the results, the data path and dataset must remain the same as when saving:

save_path = ". /results"
# Save results
# force indicates whether to overwrite the original file
(save_path, force=True)
# Load the results
imagelab = (save_path, dataset_path)
Successfully loaded Imagelab

Threshold setting

CleanVision determines the various detections by threshold control, where exact_duplicates and near_duplicates are based on the image hash (defined by theimagehashprovided by the library) for detection, while other types of detection use a threshold ranging from 0 to 1 to control the results. If an image scores below a set threshold for a particular problem type, it is considered to have that problem; the higher the threshold, the more likely it is judged to have that problem. This is shown below:

Keywords. hyperparameterization
1 light threshold
2 dark threshold
3 odd_aspect_ratio threshold
4 exact_duplicates N/A
5 near_duplicates hash_size(int),hash_types(whash,phash,ahash,dhash,chash)
6 blurry threshold
7 grayscale threshold
8 low_information threshold

For a single detection type, the threshold setting code is as follows:

imagelab = Imagelab(data_path=dataset_path)
issue_types = {"dark": {"threshold": 0.5}}
imagelab.find_issues(issue_types)

()
Reading images from D:/cleanvision/image_files
Checking for dark images ...

Issue checks completed. 20 issues found in the dataset. To see a detailed report of issues found, use ().
Issues found in images in order of severity in the dataset

|    | issue_type   |   num_images |
|---:|:-------------|-------------:|
|  0 | dark         |           20 | 

----------------------- dark images ------------------------

Number of examples with this issue: 20
Examples representing most severe instances of this issue:

png

If the presence of a certain type of problem is normal, such as the prevalence of images that are too dark in astronomical datasets, then a maximum occurrence (max_prevalence) can be set. This means that if the percentage of images for a particular problem exceeds max_prevalence, the problem can be considered normal. In the above example, the number of images with the dark problem is 10 and the total number of images is 607, so the percentage of images with the dark problem is about 0.016. If max_prevalence is set to 0.015, then images with the dark problem will not be reported as dark:

(max_prevalence=0.015)
Removing dark from potential issues in the dataset as it exceeds max_prevalence=0.015 
Please specify some issue_types to check for in imagelab.find_issues().

1.4 Running CleanVision on the Torchvision dataset

CleanVision supports problem detection using the Torchvision dataset with the following code:

Preparing the dataset

from import CIFAR10
from import ConcatDataset
from cleanvision import Imagelab

# intendtorchvisionhit the nail on the headCIFAR10data set
train_set = CIFAR10(root="./", download=True)
test_set = CIFAR10(root="./", train=False, download=True)
Files already downloaded and verified
Files already downloaded and verified
# View the number of samples in the training set and test set
len(train_set), len(test_set)
(50000, 10000)

If you want to merge the training and test sets, you can use the following code:

dataset = ConcatDataset([train_set, test_set])
len(dataset)
60000

View image:

dataset[0][0]

png

Running CleanVision

The Torchvision dataset can be manipulated by simply specifying the torchvision_dataset parameter when creating the Imagelab example, and the subsequent processing steps are the same as for reading images from a folder:

imagelab = Imagelab(torchvision_dataset=dataset)
imagelab.find_issues()
# View the results
# ()
Checking for dark, light, odd_aspect_ratio, low_information, exact_duplicates, near_duplicates, blurry, grayscale, odd_size images ...

Issue checks completed. 173 issues found in the dataset. To see a detailed report of issues found, use ().
# Summary of results
imagelab.issue_summary
issue_type num_images
0 blurry 118
1 near_duplicates 40
2 dark 11
3 light 3
4 low_information 1
5 grayscale 0
6 odd_aspect_ratio 0
7 odd_size 0
8 exact_duplicates 0

1.5 Running CleanVision on the Hugging Face dataset

CleanVision supports problem detection based on the Hugging Face dataset (if it works) with the following code:

# datasets is specialized for downloading huggingface datasets
from datasets import load_dataset
from cleanvision import Imagelab
# Take /datasets/mah91/cat as an example.
# To download a particular hugging face dataset, just set the path parameter to the text after the datasets to be downloaded.
# split means to extract the data of the train or test, if the split dataset is not provided then return the complete data
dataset = load_dataset(path="mah91/cat", split="train")
Repo card metadata block was not found. Setting CardData to empty.
# View the dataset, you can see that the dataset has 800 images, only the images are provided without annotations.
dataset
Dataset({
    features: ['image'],
    num_rows: 800
})
# Contains information about the different columns in the dataset and the type of each column, e.g. image, audio

{'image': Image(mode=None, decode=True, id=None)}

Specify the hf_dataset parameter to load the hugging face dataset:

# Load data into CleanVision, image_key specifies data containing 'image'
imagelab = Imagelab(hf_dataset=dataset, image_key="image")

The code to perform the test is as follows:

imagelab.find_issues()
# Summary of results
imagelab.issue_summary
Checking for dark, light, odd_aspect_ratio, low_information, exact_duplicates, near_duplicates, blurry, grayscale, odd_size images ...

Issue checks completed. 4 issues found in the dataset. To see a detailed report of issues found, use ().
issue_type num_images
0 blurry 3
1 odd_size 1
2 dark 0
3 grayscale 0
4 light 0
5 low_information 0
6 odd_aspect_ratio 0
7 exact_duplicates 0
8 near_duplicates 0

2 Reference

  • CleanVision
  • CleanVision-docs
  • custom_issue_manager
  • imagehash