Reference:
/devilmaycry812839668/p/
The Actor-Mimic and expert DQN training curves for 100 training epochs for each of the 8 games. A training epoch is 250,000 frames and for each training epoch we evaluate the networks with a testing epoch that lasts 125,000 frames. We report AMN and expert DQN test reward for each testing epoch and the mean and max of DQN performance. The max is calculated over all testing epochs that the DQN experienced until convergence while the mean is calculated over the last ten epochs before the DQN training was stopped.
Reinforcement learning is different from other AI methods in terms of performance testing. Other AI methods perform performance testing after training, which means that training and testing are two separate processes in other AI methods, but this is not the case in Reinforcement Learning, where the testing process and the training process are fused together, to be more specific:
Suppose in a reinforcement learning training process, we want to carry out 100 epochs of training, and each epoch includes 250,000 frames, if the size of a batch is 100, then an epoch includes 2500 batch, that is, 2500 times to carry out parameter training updates;
Due to the reinforcement learning algorithm in the test and training is combined together, so every completion of 1 epoch of training we carry out a test, each test includes 125,000 frames, and then take the 125,000 frames in the process of collecting the sum of the reward as the test results, of course, you can also be divided into 125,000 to do the next normalization;
The focus of the test is how to calculate the max and mean values of the test based on the test results obtained during the training process. One method given here is to take the maximum value of all the testing epochs (the sum of the rewards of 125000 frames in each testing epoch) as the max value during the training process, although the max value is easy to obtain, the mean value is difficult to have a unique evaluation method. value is good to obtain but the mean value is difficult to have a unique evaluation method, one of the main contributions here is to give a more objective way of calculating the mean value, that is, to take the last 10 test results of the whole training process to do the average, that is, to take the value of the last 10 testing epochs of the training process (the sum of the rewards of 125000 frames in each testing epoch) to be the maximum value of the max value, although the max value is good to obtain but the mean value is difficult to have a unique evaluation method. 125000 frames of rewards) is averaged as the test mean value of the whole training process.