Preface:We often say that a certain person only "reads dead books" and can't do the questions after a slight change. This is actually a very common phenomenon in our human learning. But you know what? Artificial intelligence is actually more likely to "read dead books". However, in the field of artificial intelligence, we have a high sounding term, called "overfitting". To put it bluntly, "overfitting" is the phenomenon of "reading dead books" in AI. In this subsection, we will talk about how to make AI less "reading dead books". Note that I said "less", because there is no way to completely eliminate this problem, only to minimize it.
Reducing Overfitting in Language Modeling
Overfitting occurs when the network becomes too focused on the training data, and one of the ways this manifests itself is that it becomes very good at matching patterns in the "noisy" data in the training set that do not exist elsewhere. Since this particular noise doesn't exist in the validation set, the better the network gets at matching this noise, the worse the losses on the validation set will be. This leads to the rising validation loss you see in Figure 6-3. In this section, we will explore several ways to generalize the model and reduce overfitting.
Adjustment of learning rates
One of the biggest factors that can lead to overfitting is that the optimizer's learning rate is too high. This means that the network is learning too fast. Below is an example of the code used to compile the model:
(loss='binary_crossentropy',
optimizer='adam', metrics=['accuracy'])
The optimizer here is simply declared as adam, which would call the Adam optimizer with default parameters. However, this optimizer supports multiple parameters, including the learning rate. The code can be changed to the following:
adam = (learning_rate=0.0001,
beta_1=0.9, beta_2=0.999, amsgrad=False)
(loss='binary_crossentropy',
optimizer=adam, metrics=['accuracy'])
Here, the default learning rate value (usually 0.001) is reduced by 90% to 0.0001. beta_1 and beta_2 values are kept at their default values, as is amsgrad.
- beta_1 and beta_2 must be between 0 and 1, usually both are close to 1.
- Amsgrad is an alternative implementation of the Adam optimizer, first presented in the paper On the Convergence of Adam and Beyond by Sashank Reddi, Satyen Kale and Sanjiv Kumar.
This lower learning rate has a profound effect on the network. Figure 6-4 shows the network's accuracy over 100 training cycles. As you can see, for the first 10 cycles or so, the lower learning rate makes it look like the network is "not learning," but then it "breaks out" and starts to learn quickly.
Figure 6-4: Accuracy when using lower learning rates
By looking at the losses (as shown in Figure 6-5), we can see that even though accuracy does not increase in the first few training cycles, the losses are decreasing. So if you observe the training process cycle by cycle, you can be confident that the network will eventually start to learn.
Figure 6-5: Losses when using lower learning rates
While the loss begins to exhibit an overfitting curve similar to that in Figure 6-3, note that this occurs much later and to a much lesser degree. At the 30th training cycle, the loss is about 0.45, while this is more than twice as high when using the higher learning rate in Figure 6-3. Although the network takes longer to achieve higher accuracy, it does so with less loss, so you can be more confident in the results.
When using these hyperparameters, the loss on the validation set begins to increase at about the 60th training cycle, at which point the training set reaches an accuracy of about 90%, while the validation set reaches an accuracy of about 81%, which suggests that our network is quite effective.
Of course, it's relatively simple to just tweak the optimizer parameters and then claim success, but there are actually many other ways you can use to improve your model, which are described in the next few sections. In these sections, I'll revert to using the default Adam optimizer for illustration purposes. Thus, the effect of adjusting the learning rate will not overshadow the benefits offered by other techniques.
Summary:In this section, we introduce how to mitigate the "reading to death" phenomenon of language models by adjusting the learning rate. In the next few sections, we will explore and analyze the characteristics of the training dataset and how the model's architectural design, preset dimensions, and other factors affect the model's "dead-reading" problem.