Convolutional neural networks: computer vision in detail

Explaining Computer Vision in Detail

Computer vision is a field that is growing by leaps and bounds thanks to deep learning. Deep learning and computer vision can help cars, identify pedestrians and cars around them and help them avoid them. It has also made face recognition technology more efficient and accurate, and will soon be able to experience or have long experienced the ability to unlock a cell phone or door lock just by swiping your face. When unlocking a cell phone, guess there must be a lot of apps on the phone for sharing pictures. On them, one can see pictures of food, hotels or beautiful landscapes. Some companies use deep learning techniques on these apps to show the most vivid and beautiful as well as the most relevant pictures. Machine learning has even given rise to new types of art. Deep learning is exciting for the following two reasons and think so.

First, the rapid development of computer vision signals the possibility of new types of applications being created that, a few years ago, one would not have dared to imagine. By learning to use these tools, it may be possible to create new products and applications.

Secondly, even if one fails to build on computer vision in the end, one finds that people are so imaginative and creative in their research on computer vision, which leads to new neural network structures and algorithms, that it actually inspires people to create results that cross over between computer vision and other fields. As an example, when I was working on speech recognition, I often looked to the field of computer vision for inspiration.
and apply it in the literature. So even if you don't produce results in computer vision, you can hopefully apply what you learn to other algorithms and structures as well. That's all for now, let's start learning.

These are some of the problems that you will be learning about in this blog, and you should have heard about image classification, or image recognition, a long time ago. For example, given this 64×64 image, ask the computer to recognize that it's a cat.

Another example, there is a problem in computer vision called target detection, for example in a driverless project, it is not necessary to recognize the objects in the picture as vehicles, but it is necessary to compute the position of other vehicles to make sure that you can avoid them. So in a target detection project, you first need to calculate what objects are in the picture, such as cars, and other things in the picture, and then model them as boxes, or use some other technique to identify their positions in the picture. Note that in this example, there are multiple vehicles in a picture at the same time, and each vehicle has an exact distance relative to it.

There's a more interesting example of picture style migration implemented by neural networks, say there's a picture but want to convert that picture to a different style. So picture style migration, that is, there is a satisfactory picture and a style picture, and in fact the picture on the right is a Picasso painting, whereas a neural network can be used to fuse them together to depict a new picture. Its overall outline comes from the left side, but is the style of the right side, finally generating the picture below. This amazing algorithm creates a new style of art.

However, one challenge to face when applying computer vision is that the data input can be very large. As an example, it is common to operate on small 64×64 images, which in fact, has a data volume of 64×64×3, because each image has 3 color channels. If you do some math, you can learn that the amount of data is 12288, so the feature vector\(x\)The dimension is 12288. this is actually ok because 64×64 is a really small image.

If you want to manipulate a larger image, such as a 1000×1000 image, it is sufficiently large as 1 megabyte, but the dimension of the feature vector reaches 1,000×1,000×3, because there are 3RGBchannel, so the number will be 3 million. If you look at it on a very small sized screen, you may not notice that the top image is only as big as 64×64, while the bottom one is a large 1000×1000 image.

If a data volume of 3 million is to be entered, this means that the feature vector\(x\)The dimensions are up to 3 million. So in the first hidden layer there are perhaps 1000 hidden units, and all the weights form the matrix\(W^{[1]}\). If a standard fully-connected network had been used, the size of this matrix would have been 1,000 x 3 million. Since now\(x\)The dimension of the\(3m\)，\(3m\)It is usually used to represent 3 million. This means that the matrix\(W^{[1]}\)There will be 3 billion parameters, which is a very large number. With such a large number of parameters, it is difficult to get enough data to prevent overfitting and competing demands on the neural network, and the huge memory requirements to handle a neural network containing 3 billion parameters make it less acceptable.

But for computer vision applications, surely you don't want it to handle only small images, you want it to be able to handle large images at the same time. For this, convolutional computation is required and it is a very important piece of a convolutional neural network.