RIME: using cross-entropy loss size to distinguish whether preference is correct or not + intrinsic reward pre-training reward model

Article Title：RIME: Robust Preference-based Reinforcement Learning with Noisy Preferences，ICML 2024 Spotlight，3 6 8（？）
pdf：/pdf/2402.17257
html：/html/2402.17257v3 maybe/html/2402.17257v3
GitHub：/CJReinforce/RIME_ICML2024

0 abstract

Preference-based Reinforcement Learning (PbRL) circumvents the need for reward engineering by harnessing human preferences as the reward signal. However, current PbRL methods excessively depend on high-quality feedback from domain experts, which results in a lack of robustness. In this paper, we present RIME, a robust PbRL algorithm for effective reward learning from noisy preferences. Our method utilizes a sample selection-based discriminator to dynamically filter out noise and ensure robust training. To counteract the cumulative error stemming from incorrect selection, we suggest a warm start for the reward model, which additionally bridges the performance gap during the transition from pre-training to online training in PbRL. Our experiments on robotic manipulation and locomotion tasks demonstrate that RIME significantly enhances the robustness of the state-of-the-art PbRL method. Code is available at /CJReinforce/RIME_ICML2024.

BACKGROUND AND GAP: Preference-based reinforcement learning (PbRL) circumvents the need for reward engineering by utilizing human preferences as reward signals. However, current PbRL methods are overly reliant on high-quality feedback from experts, resulting in a lack of robustness.
METHOD: In this paper, we introduce RIME, a robust PbRL algorithm for efficient reward learning from noisy preferences.
- 1 Utilizes a discriminator (discriminator) based on sample selection that dynamically filters noise to ensure robust training.
- 2 In order to offset the cumulative error due to misselection (?), the reward model is proposed to have a warm start, which further bridges the performance gap in PbRL. , we propose a warm start for the reward model, which further bridges the pretrain → formal training performance gap in PbRL.
Experiments: experiments on robot manipulation (Meta-world) and motion tasks (DMControl) show that RIME significantly enhances the robustness of state-of-the-art PbRL methods (meaning pebble).

1 intro

Background: PbRL omits reward engineering, PbRL is good.
Gap 1: PbRL assumes that preference is expert-typed and error-free, but humans are fallible.
gap 2: Learning from noisy labels, also known as robust training.
- Song et al. (2022) classified robust training methods into four key categories: robust architecture (Cheng et al., 2020), robust regularization (Xia et al., 2020), robust loss design (Lyu & Tsang, 2019) and sample selection (Li et al., 2020; Song et al., 2021). ).
- However, integrating them into PbRL is difficult, seemingly because 1 it requires a large number of samples, while the number of feedbacks in PbRL (a few benchmarks we often run) is at most in the tens of thousands; and 2 there is a distribution shift during the training of RL, which undermines the assumption of (independently and identically distributed) input data, which is a central tenet of the principles underpinning robust training methods.
We propose the RIME (Robust preference-based reInforcement learning via warM-start dEnoising discriminator), which according to their generation was the first work to study the PbRL noisy label (?).
Primary methodology:
- 1 Use a discriminator with a threshold to find samples that are considered correct\(\mathcal D_t\)And then we'll use a threshold to find a sample that looks wrong.\(\mathcal D_f\), flipping it over, we ended up using a sample of the\(\mathcal D_t \cup\mathcal D_f\) 。
- Specifically, the threshold here is cross-entropy loss, and there is a theory that feels intuitive and is good work ww
- 2 Initialize the reward model with a pre-trained intrinsic reward.
- Specifically, the intrinsic reward is normalized to (-1,1) during pre-training because reward models generally use tanh as the activation function, and the output of tanh is (-1,1).

PbRL。
learning from noisy labels：
- Told intro's presentation again.
- It is mentioned that, in the context of PbRL, Xue et al. (2023) proposed an encoder-decoder architecture to model different human preferences, but roughly 100 times the number of preference would be needed compared to the RIME work.
Policy-to-Value Reincarnating RL（PVRL）：
- Reincarnate: vt, to cause to be reincarnated, reincarnated, given a new form.
- PVRL, which refers to the transfer of sub-optimal teacher policy to a value-based student RL agent (Agarwal et al., 2022).
- Inspiration: Uchendu et al. (2023) found that a randomly initialized Q-network in PVRL leads to a teacher policy that is quickly forgotten.
- gap: in the widely adopted PbRL pipeline, the PVRL challenge also arises in the transition from pretrain to online training, but has been neglected in previous studies. The issue of forgetting pretrain strategies becomes more important under noisy feedback, as detailed in Section 4.2.
- (Pre-training here refers to work such as pebble's e.g. maximum entropy pre-training strategy.)
- Elicit a hot start for the reward model.

3 preliminaries

PbRL。
Unsupervised Pre-training in PbRL：talk about sthpebble of pre-training.
Noisy Preferences in PbRL: Speaking of theBPref The mimicry of the human scripted teacher, using error teacher.

4 method: RIME

4.1 Denoising discriminator for RIME

Stream-saving: use the CELoss size of each (σ0, σ1, p) to determine if it is a correct/incorrect sample, and flip all incorrect samples of p.
Why use cross-entropy loss to determine correct/incorrect samples?
- Existing studies have shown that deep neural networks first learn generalizable patterns before overfitting the noise in the data (Arpit et al., 2017; Li et al., 2020).
- Therefore, prioritizing samples associated with smaller losses as correct samples is a well-founded way to improve robustness. (Not really understood)
recalls thatCross-entropy vs. KL dispersion 。
How to determine the threshold for cross-entropy loss?
- Theorem 4.1, assume that the x cross entropy loss of clean data is bounded by ρ, i.e.\(\mathcal L^\text{CE}(x)\le\rho\) ; then we have that the predicted preference of the damaged sample x\(P_\psi(x)\) peace\(\tilde y(x)=1-y\) The KL dispersion between the lower limit of the\(D_{\mathrm{KL}}(\tilde{y}(x)\parallel P_{\psi}(x))\geq-\ln\rho+\frac{\rho}{2}+O(\rho^{2})\) 。
- We then develop a lower bound on the KL dispersion threshold\(\tau_\text{base}=\ln \rho+\alpha\rho\)to filter out untrustworthy samples. Among them.\(\rho\) denotes the maximum cross-entropy loss of the trusted samples observed during the last update.\(\alpha\in(0,0.5]\) are adjustable hyperparameters.
- But there is also the distribution shift problem to consider. In order to increase the tolerance for clean samples in the case of distribution shift, we introduce an auxiliary term\(\tau_\text{unc}=\beta_t\cdot s_\mathrm{KL}\) , to characterize the uncertainty of the filtration, where the\(\beta_t=\max(\beta_\min,\beta_\max-kt)\) is a time-varying parameter (β max = 3, β min = 1).\(s_\mathrm{KL}\) is the standard deviation of the KL dispersion (which appears to be\(D_{\mathrm{KL}}(\tilde{y}(x)\parallel P_{\psi}(x))\) (KL dispersion). The intuition here is that training to OOD data may lead to fluctuations in the CELoss (not really understood)
A dataset that identifies credible samples:\(D_t=\{(\sigma^0,\sigma^1,\tilde{y}) | D_{\mathrm{KL}}(\tilde{y}\parallel P_\psi(\sigma^0,\sigma^1))<\tau_{\mathrm{lower}}\}\) which\(\tau_{\mathrm{lower}}=\tau_{\mathrm{base}}+\tau_{\mathrm{unc}}=-\ln\rho+\alpha\rho+\beta_{t}\cdot s_{\mathrm{KL}}\) 。
Identify datasets of untrustworthy samples:\(D_f=\{(\sigma^0,\sigma^1,\tilde{y}) | D_{\mathrm{KL}}(\tilde{y}\parallel P_\psi(\sigma^0,\sigma^1))>\tau_{\mathrm{upper}}\}\) ， \(\tau_{\mathrm{upper}}\) It appears to be a predefined value defined as\(3\ln(10)\) We've got it. Then flip the Df, merge the flipped Df with the Dt, and take it to the reward model.

4.2 Warm start for reward model

Save streams: train the reward model with intrinsic reward.
Observation:
- A significant performance degradation is observed during the transition from pre-training to online training (see Fig. 2). In the noisy feedback setting, this gap is clearly observable and fatal to robustness.
- After pre-training, PEBBLE resets the Q-network, retaining only the pre-trained policy. since the Q-network learns TD-error under a reward model that minimizes noisy feedback, this biased Q-function leads to a poorly learned policy, which erases the gains made during pre-training.
The warm start of the reward model:
- Specifically, we first take intrinsic reward to train the reward model in the pre-training phase.
- Since the output layer of a reward model typically uses the tanh activation function (Lee et al., 2021b), we first normalize the intrinsic reward to (-1,1), using the mean of the currently obtained intrinsic reward\(\hat r\) and variance\(\sigma_r\) Come and do it:\(r_{\mathrm{norm}}^{\mathrm{int}}(\mathbf{s}_t)=\mathrm{clip}(\frac{r^{\mathrm{int}}(\mathbf{s}_t)-\hat r}{3\sigma_r},-1+\delta,1-\delta)\) 。
- The data for pre-training the reward model appears to be the\((s_t,a_t,r_{\mathrm{norm}}^{\mathrm{int}},s_{t+1})\) Instead of using the segment form. (There is a reference to a nearest neighbor, which I didn't quite get w)

4.3 Overall Algorithm Flow

Pseudo-code was put in Appendix A. It's so civilized to put pseudo-code in Appendix A.

Key Points:

Pre-training with the reward model's warm start:
- In line 5, the collected intrinsic reward is normalized.
- In line 10, the training reward model uses the\(r_{\mathrm{norm}}^{\mathrm{int}}\) together with\(\hat r\) of MSE, not segment.
misidentification preference (used form a nominal expression) denoising discriminator：
- In line 13, initialize ρ to positive infinity.
- In line 19, compute the threshold τ lower for identifying credible samples.
- In line 24, the dataset with trusted samples ∪ wrong samples flipped is used to compute the new ρ, where ρ is a lower bound on the KL dispersion.

5 experiments

Setting: same as pebble, three DMControls + three Meta-worlds.
baselines： pebble、surf、rune, MRN (MRN I haven't read it yet).
The error rate (i.e., the probability of randomly picking (σ0,σ1,p) and flipping p) is between 0.1 and 0.3.
Massive ablation:
- In Appendix D.3, a wider variety of noisy teachers are attempted, and the table in the main text compares the average of the various types of noisy teachers.
- Comparison with other robust training methods: adaptive denoising training (ADT) (Wang et al., 2021), i.e., discarding a certain percentage of samples with large CELoss, seems to work well; using MAE and t-CE as alternative CELoss(?) loss function; using label smoothing (LS) for all preference label(?). The following is an example of the use of label smoothing.
- There is a real human, see Appendix D.4. The total amount of feedback and the amount of feedback per session are 100 and 10, respectively. the task is a hopper backflip. The task is a hopper backflip (really, it's so easy to learn (?), is the hopper backflip a task where you can keep backflipping as long as you pull the control variable to the limit). But the screenshot shows OpenAI gym instead of DMControl.
- Increasing the total number of feedbacks can effectively improve performance.
- Are the individual modules effective? When the number of feedbacks is quite limited (i.e., on a Walker-walk), a hot start is critical for robustness and saves on the number of queries.