Machine Learning: Explaining whether to use end-to-end deep learning? (Whether to use end-to-end learning?)

Detailed explanation of whether to use end-to-end deep learning?

Suppose a machine learning system is being built and a decision has to be made whether to use an end-to-end approach or not, to see some of the advantages and disadvantages of end-to-end deep learning so that the application can be judged based on some guidelines as to whether or not there is any hope of using an end-to-end approach.

Here are some of the benefits of applying end-to-end learning, first of all end-to-end learning is really just letting the data speak for itself. So if there's enough\((x,y)\)data, then no matter what the data from the\(x\)until (a time)\(y\)What is the most suitable function mapping, and what is the most suitable function mapping if training a large enough neural network in the hope that this neural network will figure it out on its own, as opposed to using a purely machine learning approach, which takes the mapping directly from the\(x\)until (a time)\(y\)The neural network that the inputs go to train may be more capable of capturing any statistical information in the data, rather than being forced to introduce human stereotypes.

For example, in the field of speech recognition, early recognition systems had this concept of phonemes, which are the basic sound units, such as Cu-, Ah-, and Tu- for the word cat, and thought that this phonemes were created by human linguists, but in fact thought that the phonemes were in fact the fantasies of phoneticians, and that describing a language in terms of phonemes would be reasonable. But don't force the learning algorithm to think in terms of phonemes, which is sometimes not so obvious. If the learning algorithm is allowed to learn any representation it wants, rather than being forced to use phonemes as a representation, then its overall performance may be better.

The second benefit of end-to-end deep learning is so that there are fewer components that need to be hand-designed, so that might be able to streamline the design workflow by not spending as much time hand-designing features, hand-designing these intermediate representations.

What about the drawbacks? There are some drawbacks here, first of all, it can require a lot of data. To learn this directly\(x\)until (a time)\(y\)mapping, which may require a large number of\((x,y)\)Data. In a previous example I have seen, in which a lot of sub-task data can be collected, such as face recognition, a lot of data can be collected for distinguishing faces in an image, and when a face is found, a lot of face recognition data can be found as well. But for the whole end-to-end task, there may be much less data available. So\(x\)which is the input for end-to-end learning.\(y\)is the output, so it takes a lot of these\((x,y)\)data, both at the input and at the output, so that these systems can be trained. This is why it's called end-to-end learning, because it's learned directly from one end of the system to the other end of the system.

Another drawback is that it excludes hand-designed components that might be useful. Machine learning researchers generally despise hand-designed stuff, but without a lot of data, there's no way for a learning algorithm to gain insights from a very small training set of data. So hand-designed components may, in this case, be a way to inject human knowledge directly into the algorithm, which is never a bad thing. Think learning algorithms have two main sources of knowledge, one is data and the other is hand-designed anything, which could be a component, a feature, or something else. So when there's a lot of data, hand-designed stuff is less important, but when there's not a lot of data, constructing a well-designed system that actually injects a lot of human knowledge about the problem directly into the problem into the algorithm should be pretty helpful.

So one of the drawbacks of end-to-end deep learning is that it excludes artificially designed components that might be useful, and well-designed artificially designed components can be very useful, but they also have algorithmic performances that could really hurt. For example, algorithms that are forced to think in phonemic units might be better off letting the algorithms find better representations on their own. So it's a double-edged sword, there may be bad things, there may be good things, but often there are more good things, and hand-designed components tend to help more when the training set is much smaller.

If one is building a new machine learning system while trying to decide whether to use end-to-end deep learning, it is thought that the key issue is that there is enough data to be able to learn directly from the\(x\)map to\(y\)Is a function sufficiently complex? There is no formal definition of the term "necessary complexity (complexity needed)" But intuitively, if one wants to start with\(x\)until (a time)\(y\)of data to learn a function that is looking at an image like this to recognize the location of all the bones in the image, then maybe this is like a relatively simple problem like recognizing the bones in the diagram, and maybe the system doesn't need that much data to learn to handle this task. Or given a picture of a person, maybe it's not that hard to find the face in the picture, so maybe it doesn't take much data to find the face, or at least enough data can be found to solve the problem. Comparatively, the x-ray photo of the handle maps directly to the age of the child, and going directly to such a function seems intuitively to be the more complex problem. It would require a lot of data to learn with a pure end-to-end approach.

To finish with a more complex example, it may be known that companies that have been taking the time to help major in driverless technology, the development of driverless technology actually makes quite exciting, how do you build a car that can drive itself? Well, one thing that can be done here, and this is not an end-to-end deep learning approach, is to look at the readings from the radar, LIDAR, or other sensors in front of the car as input images. But to make it simple to illustrate, let's say you take a picture of the front of the car or around the car, and then for driving to be safe, you need to be able to detect nearby cars, you need to detect pedestrians as well, you need to detect other things, and of course, the highly simplified example is provided here.

After figuring out where the other cars and shapes are, you need to plan your route. So in other words, when seeing where the other cars are and where the pedestrians are, one needs to decide how to swing the steering wheel to direct the path of the car in the next few seconds. If the decision is made to take a particular path, maybe this is an overhead view of the road, this is the car, maybe the decision is made to take that route, this is the route, then the steering wheel needs to be swung to the right angle, and also the right acceleration and braking commands need to be given. So from sensor or image input to detecting pedestrians and vehicles, deep learning can do a good job, but once the location or movement of other vehicles and pedestrians is known, choosing a path that the car is going to take, that's usually done with what's called motion-planning software instead of deep learning. If you have taken a robotics course, you must know about motion planning and then after the path that the car that has been decided is going to take. There are also some other algorithms that say this is a control algorithm that produces precise decisions determining exactly how many degrees the steering wheel should be turned and how much force should be applied on the gas pedal or brakes.

So this example shows that if one wants to use machine learning or deep learning to learn some individual component, then when applying supervised learning, one should carefully choose which of the\(x\)until (a time)\(y\)The type of mapping depends on the type of data that can be collected for those tasks. In contrast, it's exciting to talk about a pure end-to-end deep learning approach, where you input an image and come up with the steering wheel angle directly, but in terms of the data that can be collected today, and the type of data that can be learned with neural networks today, that's actually not the most promising approach, or that approach isn't the best one that the team has come up with to use. And the idea is that this pure end-to-end deep learning approach is actually less promising than a more complex multi-step approach like this. Because there are limitations to the data that can be collected at the moment, and the ability to train neural networks right now.

This is end-to-end deep learning, and sometimes the results are overwhelming. But it's also important to be aware of when you should use end-to-end deep learning.