The Beauty of Data in Artificial Intelligence Model Training

Preface:In the training process of AI models, how to efficiently manage and process large amounts of data is an important topic.TensorFlow's TFRecord format provides a flexible and efficient solution for large-scale data storage and processing. In this section, we will introduce how to use TFRecord in combination with TensorFlow's Dataset API for data extraction, transformation, and loading (ETL) to better support the training and optimization of AI models. With TFRecord, you can store raw data in a binary format that is lightweight and easy to work with, resulting in significant performance gains for loading and parsing large-scale datasets. We will explore in detail how to build an ETL data pipeline to make the data loading and model training process smoother and faster through parallel processing, batch processing, and prefetching. This type of data processing not only performs well on a single machine, but also scales on multi-core CPUs, GPUs, or TPUs for larger scale AI model training. Whether you're working with images, text, or other types of large-scale data, understanding and mastering TFRecord and its optimization techniques will lay the foundation for building efficient data pipelines that enable you to train AI models faster and smarter.

Understanding TFRecord

When you use TFDS, the data is downloaded and cached to disk so you don't have to re-download it each time you use it.TFDS uses the TFRecord format for caching. You can see this if you look closely at the download process-for example, Figure 4-1 shows how the cnn_dailymail dataset is downloaded, scrambled, and written to a TFRecord file.

                        Figure 4-1. Downloading the cnn_dailymail dataset as a TFRecord file

In TensorFlow, TFRecord is the preferred format for storing and retrieving large amounts of data. It is a very simple file structure that is read sequentially to improve performance. On disk, the file structure is relatively straightforward, with each record consisting of an integer representing the length of the record, its corresponding cyclic redundancy check (CRC), a byte array of data, and the CRC of that byte array. The records are concatenated into a file, or if the data set is large, it is sliced.

                      For example, Figure 4-2 shows that the training set for cnn_dailymail was split into 16 files after downloading.

To visualize a simple example more closely, download the MNIST dataset and print its information:

data, info = ("mnist", with_info=True)

print(info)

In the info, you will see that its features are stored like this:

features=FeaturesDict({

'image': Image(shape=(28, 28, 1), dtype=tf.uint8),

'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),

}),

Similar to the CNN/DailyMail example, the file is downloaded to /root/tensorflow_datasets/mnist//files directory.

You can load the original record as a TFRecordDataset like this:

filename="/root/tensorflow_datasets/mnist/3.0.0/-00000-of-00001"

raw_dataset = (filename)

for raw_record in raw_dataset.take(1):

print(repr(raw_record))

Please note that the location of the filename may vary depending on your operating system.

<: shape=(), dtype=string, numpy=b"\n\x85\x03\n\xf2\x02\n\x05image\x12\xe8\x02\n\xe5\x02\n\xe2\x02\x89PNG\r \n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x00\x1c\x00\x00\x00\x1c\x08\x00\x00\x00\x00Wf \x80H\x00\x00\x01)IDAT(\x91\xc5\xd2\xbdK\xc3P\x14\x05\xf0S(v\x13)\x04,.\x82\xc5A q\xac\xedb\x1d\xdc\n.\x12\x87n\x0e\x82\x93\x7f@Q\xb2\x08\xba\tbQ0.\xe2\xe2\xd4\x b1\xa2h\x9c\x82\xba\x8a(\nq\xf0\x83Fh\x95\n6\x88\xe7R\x87\x88\xf9\xa8Y\xf5\x0e\x 8f\xc7\xfd\xdd\x0b\x87\xc7\x03\xfe\xbeb\x9d\xadT\x927Q\xe3\xe9\x07:\xab\xbf\xf4\ xf3\xcf\xf6\x8a\xd9\x14\xd29\xea\xb0\x1eKH\xde\xab\xea%\xaba\x1b=\xa4P/\xf5\x02\ xd7\\x07\x00\xc4=,L\xc0,>\x01@2\xf6\x12\xde\x9c\xde[t/\xb3\x0e\x87\xa2\xe2\ xc2\xe0A<\xca\xb26\xd5(\x1b\xa9\xd3\xe8\x0e\xf5\x86\x17\xceE\xdarV\xae\xb7_\xf3 I\xf7(\x06m\xaaE\xbb\xb6\xac\r\x9b$e<\xb8\xd7\xa2\x0e\x00\xd0l\x92\xb2\xd5\x15\ xcc\xae'\x00\xf4m\x08O'+\xc2y\x9f\x8d\xc9\x15\x80\xfe\x99[q\x962@CN|i\xf7\xa9!=\ \xab\x19\x00\xc8\xd6\xb8\xeb\xa1\xf0\xd8l\xca\xfb]\xee\xfb]\x9fV\xe1\x07\xb7\xc 9\x8b55\xe7M\xef\xb0\x04\xc0\xfd&\x89\x01<\xbe\xf9\x03*\x8a\xf5\x81\x7f\xaa/2y\x 87ks\xec\x1e\xc1\x00\x00\x00\x00IEND\xaeB`\x82\n\x0e\n\x05label\x12\x05\x1a\x03\ n\x01\x02">

It is a long string containing the details of the record, which also includes things like checksums. But if we already know the features, we can create a feature description and use it to parse the data. The code is as follows:

Creating a Characterization

feature_description = {

'image': ([], dtype=),

'label': ([], dtype=tf.int64),

}

def _parse_function(example_proto):

Use the above dictionary to parse the input proto

return .parse_single_example(example_proto, feature_description)

parsed_dataset = raw_dataset.map(_parse_function)

for parsed_record in parsed_dataset.take(1):

print((parsed_record))

This makes the output much friendlier! First, you can see that the image is a Tensor and that it contains a PNG. png is a compressed image format with a header defined by the IHDR, and the image data is located between the IDAT and the IEND. If you look closely at the byte stream, you can see them. There are also labels, which are stored as integers and contain the value 2:

{

'image': <: shape=(), dtype=string,

numpy=b"\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x00\x1c\x00\x00\x00\x1c\x08\x00\x00\x00\x00Wf\x80H\x00\x00\x01)IDAT(\x91\xc5\xd2\xbdK\xc3P\x14\x05\xf0S(v\x13)\x04,.\x82\xc5Aq\xac\xedb\x1d\xdc\n.\x12\x87n\x0e\x82\x93\x7f@Q\xb2\x08\xba\tbQ0.\xe2\xe2\xd4\xb1\xa2h\x9c\x82\xba\x8a(\nq\xf0\x83Fh\x95\n6\x88\xe7R\x87\x88\xf9\xa8Y\xf5\x0e\x8f\xc7\xfd\xdd\x0b\x87\xc7\x03\xfe\xbeb\x9d\xadT\x927Q\xe3\xe9\x07:\xab\xbf\xf4\xf3\xcf\xf6\x8a\xd9\x14\xd29\xea\xb0\x1eKH\xde\xab\xea%\xaba\x1b=\xa4P/\xf5\x02\xd7\\x07\x00\xc4=,L\xc0,>\x01@2\xf6\x12\xde\x9c\xde[t/\xb3\x0e\x87\xa2\xe2\xc2\xe0A<\xca\xb26\xd5(\x1b\xa9\xd3\xe8\x0e\xf5\x86\x17\xceE\xdarV\xae\xb7_\xf3AR\r!I\xf7(\x06m\xaaE\xbb\xb6\xac\r*\x9b$e<\xb8\xd7\xa2\x0e\x00\xd0l\x92\xb2\xd5\x15\xcc\xae'\x00\xf4m\x08O'+\xc2y\x9f\x8d\xc9\x15\x80\xfe\x99[q\x962@CN|i\xf7\xa9!=\xd7

\xab\x19\x00\xc8\xd6\xb8\xeb\xa1\xf0\xd8l\xca\xfb]\xee\xfb]\x9fV\xe1\x07\xb7\xc9\x8b55\xe7M\xef\xb0\x04\xc0\xfd&\x89\x01<\xbe\xf9\x03\x8a\xf5\x81\x7f\xaa/2y\x87ks\xec\x1e\xc1\x00\x00\x00\x00IEND\xaeB`\x82">,

'label': <: shape=(), dtype=int64, numpy=2>

}

From here, you can read the original TFRecord and decode it into a PNG using a PNG decoding library like Pillow.

ETL process for managing data in TensorFlow

Regardless of scale, ETL is the core model that TensorFlow uses for training. We explore small-scale single-computer model building in this book, but the same techniques can be used for large-scale training, across multiple machines and with massive datasets.

The extraction phase of the ETL process involves loading the raw data from a storage location and preparing it into a form that can be transformed. The transformation phase is the manipulation of the data to make it suitable or optimized for training. For example, logic such as batching the data, image enhancement, mapping to feature columns, etc., can be counted as part of the transformation phase. The loading phase, on the other hand, involves loading the data into the neural network for training.

Let's take a look at the full code, used to train the "Horse or Human" classifier. Here I have added comments to show where the extraction, transformation and loading stages are located:

import tensorflow as tf

import tensorflow_datasets as tfds

import tensorflow_addons as tfa

Model Definition Start

model = ([

.Conv2D(16, (3,3), activation='relu', input_shape=(300, 300, 3)),

.MaxPooling2D(2, 2),

.Conv2D(32, (3,3), activation='relu'),

.MaxPooling2D(2,2),

.Conv2D(64, (3,3), activation='relu'),

.MaxPooling2D(2,2),

.Conv2D(64, (3,3), activation='relu'),

.MaxPooling2D(2,2),

.Conv2D(64, (3,3), activation='relu'),

.MaxPooling2D(2,2),

(),

(512, activation='relu'),

(1, activation='sigmoid')

])

(optimizer='Adam', loss='binary_crossentropy', metrics=['accuracy'])

End of model definition

Start of extraction phase

data = ('horses_or_humans', split='train', as_supervised=True)

val_data = ('horses_or_humans', split='test', as_supervised=True)

End of extraction phase

Beginning of the conversion phase

def augmentimages(image, label):

image = (image, tf.float32)

image = (image/255)

image = .random_flip_left_right(image)

image = (image, 40, interpolation='NEAREST')

return image, label

train = (augmentimages)

train_batches = (100).batch(32)

validation_batches = val_data.batch(32)

End of conversion phase

python

End of conversion phase

Start of loading phase

history = (train_batches, epochs=10, validation_data=validation_batches, validation_steps=1)

End of loading phase

With a process like this, your data pipeline can be less affected by changes in the data and the underlying schema. When you use TFDS to extract data, you can use the same underlying structure whether the data is too small to fit in memory or too large to fit in a simple machine. The APIs used for transformation are also consistent, so you can use similar APIs no matter what the underlying data source is, and of course, once the data has been transformed, the process for loading the data is consistent, whether you're training on a single CPU, GPU, multiple GPU clusters, or even TPU clusters.

However, the way the data is loaded can have a huge impact on training speed. Next, let's look at how to optimize the loading phase.

Optimize the loading phase

When training a model, we can gain insight into the Extract-Transform-Load (ETL) process. We can assume that the extraction and transformation of data can be performed on any processor, including the CPU; in fact, the code in these phases performs tasks such as downloading the data, decompressing the data, and record-by-record processing, which are not the strengths of GPUs or TPUs, and so this portion of the code would normally run on the CPU. However, during the training phase, GPUs or TPUs can significantly improve performance, so it's a good idea to use a GPU or TPU during this phase if you can.

Therefore, in the presence of a GPU or TPU, it is ideal to split the workload between the CPU and the GPU/TPU: extraction and transformation are done on the CPU, while loading is done on the GPU/TPU.

Suppose you are working with a large data set. Because of the volume of data, the data must be prepared in batches (i.e., perform extracts and transforms), which results in a situation similar to the one shown in Figure 4-3. While the first batch is being prepared, the GPU/TPU is idle. When this batch is ready, it is sent to the GPU/TPU for training, but the CPU is idle until the training is complete and the CPU can start preparing the second batch. There will be a lot of idle time here, so we can see room for optimization.

            Figure 4-3. Training on the CPU/GPU

The logical solution is to process in parallel, allowing data preparation and training to occur simultaneously. This process is called pipelined processing, see Figure 4-4.

            Figure 4-4.

In this case, while the CPU is preparing the first batch, the GPU/TPU still has no tasks and is therefore idle. When the first batch is ready, the GPU/TPU can start training - at the same time, the CPU starts preparing the second batch. Of course, the time required to train the n-1st batch and prepare the nth batch is not always the same. If the training time is faster, the GPU/TPU will have a period of idle time; if the training time is slower, the CPU will have a period of idle time. Choosing the right batch size can help optimize performance here - since GPU/TPU time tends to be more expensive, you may want to minimize their idle time.

You may have noticed that when we moved from simple datasets in Keras (like Fashion MNIST) to using the TFDS version, we had to batch process them before training. Here's why: the existence of a pipelined model allows you to process a dataset no matter how large it is, using a consistent ETL pattern.

Parallel ETL to improve training performance

TensorFlow gives you all the APIs you need to parallelize the extraction and transformation process. let's take a look at what they look like with the Dogs vs. Cats dataset and the underlying TFRecord structure.

First, use Get Dataset:

train_data = ('cats_vs_dogs', split='train', with_info=True)

If you want to use the underlying TFRecords, you need access to the original file that was downloaded. Due to the large size of the dataset, it is split into multiple files (8 in version 4.0.0).

You can create a list of these files and use .list_files to load them:

file_pattern = f'/root/tensorflow_datasets/cats_vs_dogs/4.0.0/cats_vs_dogs*'

files = .list_files(file_pattern)

Once the files are obtained, they can be loaded into the dataset using the following:

train_dataset = (

cycle_length=4,

num_parallel_calls=

)

There are a couple new concepts here, so let's take a moment to explain them.

The cycle_length parameter specifies the number of input elements to process simultaneously. Later you will see the mapping function for decoding records, which will decode records as they are loaded from disk. Since cycle_length is set to 4, this process will process four records at a time. If you don't specify this value, it will be determined automatically based on the number of CPU cores available.

The num_parallel_calls parameter is used to specify the number of parallel calls to be executed. Setting it here makes the code more portable, as the value is dynamically adjusted according to the available CPUs. In combination with the cycle_length parameter, you set the maximum value for parallelism. For example, if num_parallel_calls is set to 6 after autotuning and cycle_length is 4, then there will be six separate threads, each loading four records at a time.

Now that the extraction process has been parallelized, let's see how we can parallelize the transformation of the data. First, create the mapping function that loads the original TFRecord and converts it into usable content - for example, decoding a JPEG image into an image buffer:

def read_tfrecord(serialized_example):

feature_description = {

"image": ((), , ""),

"label": ((), tf.int64, -1),

}

example = .parse_single_example(serialized_example, feature_description)

image = .decode_jpeg(example['image'], channels=3)

image = (image, tf.float32)

image = image / 255

image = (image, (300, 300))

return image, example['label']

As you can see, this is a typical mapping function and nothing specific is done to make it parallelized. The parallelization will be done when the mapping function is called. Here is how it is implemented:

import multiprocessing

cores = multiprocessing.cpu_count()

print(cores)

train_dataset = train_dataset.map(read_tfrecord, num_parallel_calls=cores)

train_dataset = train_dataset.cache()

First, if you don't want to auto-tune, you can use the multiprocessing library to get the number of CPUs. Then, when calling the mapping function, you can pass this number of CPUs as the number of parallel calls. It's that simple.

The cache method caches the dataset into memory. If you have enough RAM, this will speed things up significantly. However, attempting this with the Dogs vs. Cats dataset in Colab may cause the virtual machine to crash because the dataset won't fit completely in memory. In this case, Colab's infrastructure will provide you with a new, higher RAM machine.

The loading and training processes can be parallelized as well. Prefetching based on the number of available CPU cores is also possible when disambiguating and batching data. The code is shown below:

train_dataset = train_dataset.shuffle(1024).batch(32)

train_dataset = train_dataset.prefetch()

When the training set is fully parallelized, you can train the model as before:

(train_dataset, epochs=10, verbose=1)

I experimented with this in Google Colab and found that this additional code for parallelizing the ETL process reduced the training time from 75 seconds to about 40 seconds per epoch. Such a simple change almost halved my training time!

summarize

At this point we finished introducing Google's TensorFlow Datasets, a library that gives you access to a variety of datasets, from small-scale learning datasets to full-scale datasets used for research. You saw that they use common APIs and formats to reduce the amount of code you need to write when accessing data. We also discussed the ETL process, which is central to the design of TFDS, and in particular we explored parallelizing the extraction, transformation, and loading of data to improve training performance. In the next part of our knowledge, we will break down our study of today's hottest AI topic, natural language processing techniques.