Location>code7788 >text

Notes on Getting Started with Deep Learning - Using the DataLoader

Popularity:235 ℃/2024-10-29 19:27:15

How to use DataSet DataSet?

Before introducing DataLoader, you need to understand the use of DataSetDataSet.Pytorch integrates a lot of datasets that have been processed, and there are some typical datasets in pytorch's torchvision, torchtext, and other modules, which can be downloaded and used through configuration.

Taking the CIFAR10 dataset as an example, the documentation has been described very clearly, one of the things to pay attention to is the transform parameter, which can be used to transform the image into the desired format, for example, like this, will be transformed from a PIL-formatted image to a tensor-formatted image.

# Prepared test dataset
test_data=.CIFAR10("dataset",train=False,transform=(),download=True)

What is DataLoader?

We can understand this way: if the Dataset dataset is a container to store all the data (images, audio), then the DataLoader is another container with a better storage function, which separates a lot of small compartments, you can set up a small compartment how many datasets of data to be composed of, each time you put the data into the storage of small compartments, should you not put the source dataset to disorganize the source data set every time you put the data into the storage compartments, etc.
That is, given a dataset, we can decide how to take data from the dataset for training, e.g., how much data to take at a time as an object to partition the dataset, whether to disrupt the dataset before partitioning it, etc. The result of the DataLoader is a large dictionary list of datasets to be partitioned, with each object in the list being a combination of objects from a set of The result of the DataLoader is a large dictionary list of partitioned datasets, where each object in the list is a combination of as many datasets as set

How to use DataLoader?

__getitem__ method

First you need to understand the __getitem__ method first, __getitem__ is known as a magic method, when defining a class in python, if you want to get the output value of the class by key, you need the __getitem__ method.So the purpose of the __getitem__ method is to automatically run the contents of the __getitem__ method when the class is called, get the result and return the

class Fib(): # Define class Fib
    def __init__(self,start=0,step=1): #define class Fib.
        # Define class Fib.
    def __getitem__(self, key): #characterize the __getitem__ function, key is the key of class Fib
            a = key+
            return a # When the value is taken according to the key, the value returned is a.

s=Fib()
s[1] # Returns 2, because the class has a __getitem__ method, so you can get the corresponding value directly from the key.

If there is no __getitem__ method, then there is no way to get the return value by key

class Fib():                  #define a classFib
    def __init__(self,start=0,step=1):
        =step
s=Fib()
s[1]
come (or go) back:TypeError: 'Fib' object does not support indexing

Using the CIFAR10 dataset in Pytorch as an example, you can see that the __getitem__ method in the source code looks like this:

    def __getitem__(self, index: int) -> Tuple[Any, Any]:
        """
        Args:
            index (int): Index

        Returns:
            tuple: (image, target) where target is index of the target class.
        """
        img, target = [index], [index]

        # doing this so that it is consistent with all other datasets
        # to return a PIL Image
        img = (img)

        if  is not None:
            img = (img)

        if self.target_transform is not None:
            target = self.target_transform(target)

        return img, target

It can be understood that in the call to the class if you enter the index, that is, the index/key in the class, then you can automatically call the __getitem__ method to get the return value of image and target, where image is the image in the dataset, target is the index in the label class, used to indicate the label is what

DataLoader Syntax

You can see how to use the DataLoader in Pytorch's Documents documentation, a portion of the screenshot is shown below
Here are a few of the more commonly used ones:

  • dataset: is our dataset, we build the dataset object and then pass it in.

  • batch_size: that is, how much data is taken at a time in the dataset container, and then the data taken at this time is used as an object in a Dataloader. If you still don't understand, you can see the following example:
    The first picture is me setting upbatch_size=4is an object in the DataLoader given at the time of the second image of thebatch_size=64It's good to know that the DataLoader is an object in the DataLoader that is given at the time of the


    It is also clear here that the dataset is 10,000 images/objects, and setting thebatch_size=4After that, there are 2500 objects in the DataLoader.

  • shuffle: whether to shuffle the dataset at each operation, usually selected as True.A better way to understand this is that if it is set to true, then each small object in the first DataLoader object creation is not the same as the small object in the second DataLoader object creation, because the data set is first disrupted each time the DataLoader is created, and naturally, the resulting DataLoader will be different

  • num_workers: multi-threaded operation to fetch data, 0 means only operate in the main thread.Generally speaking, windows system with multi-threading will report an error, set to 0 can be

  • drop_last: if there is a remainder in the fetched data, whether to keep the last remaining part. If true, the last part will be discarded, if false, the last part will not be discarded.

For example, in the code behind, if I set thedrop_last=FalseIf the data is taken 156 times, and the remainder of the last pickup is not discarded, then there are 156 data pickups in total.

If you set thedrop_last=TrueThen the last remaining part is discarded, and the number of pickups is also one less

Using DataLoader

The code used initially is as follows:

import torchvision
from import DataLoader
from import SummaryWriter

# Prepared test data sets
test_data=.CIFAR10("dataset",train=False,transform=())
# (of cargo etc) loadDataloader
test_dataloader=DataLoader(dataset=test_data,batch_size=4,shuffle=True,num_workers=0,drop_last=True)
# (of cargo etc) loadtensorboard
writer=SummaryWriter("logs")
# tensorboardThe serial number of the picture in
step=0
for data in test_dataloader:
    images,targets=data
    writer.add_images("test_03",images,step)
    step=step+1
()

Then use tensorboard to visualize how to use it~.