Chapter 3: Beyond the Basics - Feature Detection in Images
Previous article "Unveil the mystery of computer vision, the original machine is so "see the picture"!
Preface to this article:Previous articleWe have implemented and trained a neural network that allows computers to "read" images. It can be said that we have one foot into the door of AI research and development. However, although we have stepped into the mysterious field of AI, in reality, we are still standing on the edge of the threshold. There is so much complexity here that not even the 2024 Nobel Prize winner for Artificial Intelligence in AI will be able to fully figure it out, because neural networks are simply too complex for one person's ability to parse what it has learned. In today's article, you only need to remember two key concepts: "convolution" and "pooling". "Convolution is a method used by neural networks themselves to extract key information from an image, while pooling is a method designed by humans to further reduce the size of the "convolution" processed image in order to minimize the amount of computation and reduce the burden on the hardware. "The size of the image after processing. As long as you understand these two concepts, everything will be much clearer.
(Click to follow the author and get timely updates on core knowledge in the field of Artificial Intelligence. The author of this article has up to 8 years of real-world experience in AI and can bring you unique knowledge that you wouldn't normally learn.)
In the previous post, we did this by creating a simple neural network that used input pixels from the Fashion MNIST dataset to match images with 10 labels, each representing a type (or category) of clothing. While the network we created did a good job of recognizing clothing types, there was one glaring shortcoming.
Our neural network was trained on a number of small monochrome images, each containing only one item of clothing, and that item of clothing was located in the center of the image. To improve the model further, we need to make it learn to detect features in the image. For example, instead of just looking at the raw pixels in the image, figure out how to break the image down into its basic elements. Instead of matching raw pixels, matching these elements will help us recognize what's in the image more effectively.
Thinking back to the Fashion MNIST dataset we used in the previous post - when recognizing shoes, the neural network may have judged that they were soles because it saw the many dark pixels clustered at the bottom of the image. However, if the shoe is no longer centered, or doesn't fill the entire image, this logic doesn't apply.
One of the methods used to detect features in an image stems from photography and image processing. If you've ever used a tool like Photoshop or GIMP to sharpen an image, you're actually using a mathematical filter that operates on the pixels of the image. Another name for this filter is "convolution", and when you apply these convolutions to a neural network, you get the famous Convolutional Neural Network (CNN).
In this post, we'll learn how to use convolution to detect features in an image, and then take a deeper look at how to classify images based on those features. We'll also explore image enhancement to extract more features, use transfer learning to learn from other people's trained features, and finally take a brief look at how to optimize your model using the "dropout" technique.
convolution
Simply put, a convolution is a weight filter that takes a pixel and operates on it with its surrounding pixels to get a new pixel value. As an example, think back to that ankle boot image in Fashion MNIST and look at the change in value of the pixels in the image, as shown in Figure 3-1.
Figure 3-1: Ankle boot image with convolution processing
Suppose we are now looking at a pixel in the middle of an image that has a value of 192 (remember that Fashion MNIST's dataset is a monochromatic image with pixel values ranging from 0 to 255). The pixel above this pixel in the middle has a value of 0, the one in the upper left corner has a value of 64, and so on.
Next, we define a 3×3 filter, as shown below, with a corresponding value for each grid of the filter. All we have to do is, multiply the selected pixel and the pixel values around it by the corresponding filter value respectively, and then add up the results, this new sum is the pixel value we want to replace. This step is repeated for every pixel in the image.
For example, if the current pixel value is 192, after processing with the filter, the new pixel value will be:
new_val = (-1 * 0) + (0 * 64) + (-2 * 128) +
(.5 * 48) + (4.5 * 192) + (-1.5 * 144) +
(1.5 * 142) + (2 * 226) + (-3 * 168)
The result is 577, which is the new pixel value. After processing the whole image with this method, we get a filtered image.
Next let's see what happens if we apply this filter to a more complex image, such as the 512×512 grayscale image of two people climbing a staircase that comes with SciPy. After processing with a filter with negative values on the left and positive values on the right, most of the information in the image is removed, with only the vertical lines remaining. You can see the result in Figure 3-2.
Figure 3-2: Extracting Vertical Lines Using Filters
Similarly, if you adjust the filter slightly, you can highlight the horizontal lines in the image, as shown in Figure 3-3.
Figure 3-3: Extracting Horizontal Lines Using Filters
These examples show us that using filters not only extracts features from an image, but also reduces redundant information in the image. Over time, we can learn filters that are better suited to matching input and output.
Pooling
Pooling is the process of removing some pixels while retaining the semantic information of the image content. This process is easier to understand through visual examples. Figure 3-4 illustrates the concept of maximum pooling.
Figure 3-4: Demonstrating Maximum Pooling
In this example, the boxes on the left represent pixels in a monochrome image. We divide these pixels into 2×2 chunks, i.e., we divide the 16 pixels into four 2×2 arrays called "pools". Then, we pick the maximum values from each of these chunks and combine these maximum values into a new image. This reduces the number of pixels on the left side by 75% (from 16 to 4 pixels), and the new image consists of the maxima from each pool.
Figure 3-5 demonstrates the Ascent image in Figure 3-2 with enhanced vertical lines after applying maximum pooling.
Figure 3-5: Ascent image after vertical filtering and maximum pooling
Note that the filtered features are not only preserved but further enhanced. In addition, the size of the image has been reduced from 512 × 512 to 256 × 256, which is only a quarter of the original size.
There are also other pooling methods, such as minimum pooling, which selects the smallest pixel value in each pool, and average pooling, which takes the average of the pixels in each pool.
This post covers two important concepts in machine learning: convolution and pooling. In the next post, we will implement a convolutional neural network and explore the functionality of its layers.