Machine Learning: Explaining Multi-task learning (Multi-task learning)

Multitasking Learning Explained

In transfer learning, the steps are serialized, from task\(A\)The learning is just a matter of migrating to the task\(B\). In multi-task learning, it's learning that starts at the same time, trying to get a single neural network to do several things at once, and then hoping that each task here will help all the others.

To look at an example, let's say a driverless vehicle is being developed, then the driverless vehicle may need to detect different objects at the same time, such as detecting pedestrians, vehicles, stop signs, and traffic lights various other things. For example, in the example on the left, there's a stop sign in the image, and then there's a car in the image, but there's no pedestrians, and there's no traffic lights.

If this is the input image\(x^{(i)}\)Then here is no longer a label\(y^{(i)}\), instead there are 4 tabs. In this example, there are no pedestrians, one car, one stop sign, and no traffic lights. Then if one tries to detect other objects, perhaps\(y^{(i)}\)will have a higher dimension, let's go with 4 for now, so the\(y^{(i)}\)is a 4×1 vector. If we look at this training set labeling as a whole similarly to before, stack the labels of the training set horizontally like so\(y^{(1)}\)right\(y^{(m)}\)：

\[Y = \begin{bmatrix} | & | & | & \ldots & | \\ y^{(1)} & y^{(2)} & y^{(3)} & \ldots & y^{(m)} \\ | & | & | & \ldots & | \\ \end{bmatrix} \]

right now, though\(y^{(i)}\)are 4 × 1 vectors, so these are vertical column vectors, so this matrix\(Y\)now\(4×m\)Matrix. And before that, when\(y\)is a single real number when this is\(1×m\)Matrix.

So what you can do now is train a neural network to predict these\(y\)values, one obtains such a neural network with the inputs\(x\)The output is now a four-dimensional vector\(y\). Note that here the output draws four nodes, so the first node is trying to predict whether there are any pedestrians in the graph, and then the second output node predicts whether there are any cars, and here it predicts whether there are any stop signs, and here it predicts whether there are any traffic lights, so here\(\hat y\)It's four-dimensional.

To train this neural network, it is now necessary to define the loss function of the neural network, for an output\(\hat y\), which is a 4-dimensional vector, for the average loss over the entire training set:

\(\frac{1}{m}\sum_{i = 1}^{m}{\sum_{j = 1}^{4}{L(\hat y_{j}^{(i)},y_{j}^{(i)})}}\)

\(\sum_{j = 1}^{4}{L(\hat y_{j}^{(i)},y_{j}^{(i)})}\)These individual predicted losses, so this is a summation of four components, pedestrians, cars, stop signs, traffic lights, and this sign L refers to thelogistic lossThat's how it's written:

\(L(\hat y_{j}^{(i)},y_{j}^{(i)}) = - y_{j}^{(i)}\log\hat y_{j}^{(i)} - (1 - y_{j}^{(i)})log(1 - \hat y_{j}^{(i)})\)

The main difference between the average loss over the entire training set and the previous example of categorizing cats is that now there is an average loss over the\(j=1\)to 4 summation, which is the same assoftmaxThe main difference between the regression andsoftmaxReturning to the different.softmaxAssign individual labels to individual samples.

And that picture can have many different labels, so it's not like every picture is just a pedestrian picture, a car picture, a stop sign picture, or a traffic light picture. To know if each picture has pedestrians, or cars, stop signs, or traffic lights, multiple objects could be in one picture at the same time. In fact, that picture has both cars and stop signs, but no pedestrians or traffic lights, so instead of just giving a label to the picture, you need to iterate through the different types and then see for each type, whether that type of object appears in the picture. So let's say that on this occasion, a picture can have multiple labels. If a neural network is trained to try to minimize this cost function, what's done is multitask learning. Because what's being done is building a single neural network that looks at each diagram and then solves four problems where the system tries to tell, is there any of these four objects inside each diagram. Also it's possible to train four different neural networks instead of training one network to do four things. But some of the early features of neural networks, which are used in recognizing different objects, and then finding out that training one neural network to do four things will perform better than training four completely separate neural networks to do four things each, and that's the power of multi-task learning.

Another detail, so far, is how the algorithm is described, as if every image is fully labeled. It turns out that multitask learning can also handle the case where only some of the objects in an image are labeled. So the first training sample, say someone, who labeled the data tells there's a pedestrian in there, and there's no car, but they didn't label if there's a stop sign, or if there's a traffic light. Maybe in the second example, there's a pedestrian, there's a car. But when the labeling person looked at that image, they didn't label it, they didn't mark if there was a stop sign, if there was a traffic light, etc. Maybe some of the samples are labeled, but maybe some of the samples they only labeled if there was a car or not, and then there are some that are question marks.

Even with a dataset like this, it's possible to train the algorithm on it and do all four tasks at the same time, even if some of the images are only labeled with a small portion of the image and the rest are question marks or whatever it is. And then train the algorithm in such a way that even if some of the labels here are question marks or are not labeled, that's the way to\(j\)Summing from 1 to 4 will only be done for the 0 and 1 labeled\(j\)The values are summed, so when there is a question mark, that term is ignored in the summation so that only the labeled values are summed, and so it is possible to utilize such a dataset.

So when does multitasking learning make sense? It is meaningful when three things are true.

First, if a set of tasks are trained, low-level features can be shared. For the driverless example, it makes sense to recognize traffic lights, cars, and pedestrians at the same time; these objects have similar features that might help recognize stop signs, since these are features on the road.

Second, this guideline is not that absolute, so it's not necessarily true. But as seen in many cases of successful multi-task learning, if the amount of data for each task is close, remember that migration learning from the\(A\)Tasks learn and then migrate to\(B\)task, so if the task\(A\)With 1 million samples, the task\(B\)only 1,000 samples, then what is learned from those 1 million samples can really help augment the knowledge of smaller dataset task\(B\)of training. What about multitask learning? In multi-task learning, there are usually more tasks than just two, so maybe there are, previously, four tasks, but let's say there are 100 tasks to complete, and to do multi-task learning, try to recognize 100 different types of objects at the same time. It might turn out that there's about 1,000 samples per task. So if you focus on enhancing the performance of a single task, say focus on enhancing the performance of the 100th task, using the\(A100\)Indicates that if one tries to do this final task alone, there are only 1,000 samples to train this task, which is one of the 100 tasks, whereas by training on the other 99 tasks, these can add up to a total of 99,000 samples, which could dramatically improve the algorithmic performance, and could provide a lot of knowledge to enhance the performance of this task. Otherwise for the task\(A100\), with a training set of only 1000 samples, the results may be poor. If there's symmetry, those other 99 tasks, might be able to provide some data or provide some knowledge to help each of those 100 tasks. So the second point is not an absolutely correct guideline, but what is usually looked at is that if one is focusing on a single task, if one wants to get a big performance gain from multi-task learning, then the other tasks combined have to have a much larger amount of data than the individual task. One way to fulfill this condition is, for example, like this example on the right like this, or if the amount of data in each task is very similar, but the key is that if there are already 1,000 samples for a single task, then there better be more than 1,000 samples for all of the other tasks so that the knowledge from the other tasks can help to improve the performance of that task.

Finally multi-task learning tends to make more sense when it is possible to train a large enough neural network to do all the work well at the same time, so the alternative to multi-task learning is to train a separate neural network for each task. So instead of training a single neural network to handle pedestrian, car, stop sign, and traffic light detection simultaneously. You can train one neural network for pedestrian detection, one neural network for car detection, one neural network for stop sign detection and one neural network for traffic light detection. Then the researcherRich CaronaWhat was discovered a few years ago? The only situation in which multitasking learning would degrade performance, and the only situation in which performance would be lower compared to training a single neural network, is if the neural network wasn't big enough. But if it's possible to train a large enough neural network, then multitasking learning will certainly not or rarely degrade performance, all in the hope that it will improve performance and perform better than training separate neural networks to do each task individually.

So this is multitasking learning, which in practice is used less frequently than migration learning. See a lot of applications of transfer learning, where you need to solve a problem but there's very little training data, so you need to find a related problem with a lot of data to pre-learn and transfer the knowledge to this new problem. But multi-task learning is less common, where many tasks need to be handled at the same time, all well enough to train all of them at the same time, perhaps computer vision is an example. More applications using multi-task learning are seen in object detection, where one neural network tries to detect a whole bunch of objects better than training different neural networks separately to detect objects. But say, on average, transfer learning is currently used more often than multi-task learning, but both can be powerful tools.

So to summarize, multi-task learning enables training a neural network to perform many tasks, which can give higher performance than doing each task individually. But note that actually migration learning is used more often than multi-task learning. See a lot of the tasks are, if you want to solve a machine learning problem but the dataset is relatively small, then migration learning can really help, that is, if you find a related problem where the amount of data is much larger, you can train a neural network on it and then migrate it to this task where there is very little data.