PbRL | Christiano's 2017 opener, and Preference PPO / PrefPPO

PrefPPO first (?) appeared in PEBBLE as a baseline for pebble, a reproduction of Christiano et al.'s (2017) PbRL algorithm using PPO.

For evaluation, we compare to Christiano et al. (2017), which is the current state-of-the-art approach using the same type of feedback. The primary differences in our method are (1) the introduction of unsupervised pre-training, (2) the accommodation of off-policy RL, and (3) entropy-based sampling. We re-implemented Christiano et al. (2017) using the state-of-the-art on-policy RL algorithm: PPO (Schulman et al., 2017). We use the same reward learning framework and ensemble disagreement-based sampling as they proposed. We refer to this baseline as Preference PPO.

Christiano et al. (2017) The title of this article is Deep reinforcement learning from human preferences，Posted in. NeurIPS 2017；arxiv：/abs/1706.03741 ，GitHub：/mrahtz/learning-from-human-preferences（expense or outlay (TensorFlow implementation).

01 Read the paper：Deep reinforcement learning from human preferences

1.1 intro

intro：

A limitation of applying RL on a large scale is that many of the tasks involve complex, ill-defined, or difficult to specify objectives. (Some examples are given)
Inverse RL or behavior cloning can be done if human demos (expert data) are available, but there are many tasks for which it is difficult to give human demos, such as controlling oddly shaped robots that are very unlike humans.
Our PbRL idea: learn the reward model from human feedback and use it as a reward function in RL. This solves the two problems above, allows non-expert users to give feedback, and (according to the paper) saves an order of magnitude in the amount of feedback.
Form of human feedback: a human watches two videos and indicates which one is better (i.e. preference).

Related Work：

There are a number of jobs listed that use RL from human scoring or sorting. There is also some work on setting outside of RL using preference.
Akrour's work from 2012 and 2014 appears to be considered PbRL, but their approach is to hit preference on the entire trajectory, unlike ours where we only need to compare two shorter video segments.
- Akrour 2012：April: Active preference learning-based reinforcement learning. Joint European Conference on Machine Learning and Knowledge Discovery in Databases（seems like ECML PKDD，I don't know what the concept is.）2012。
- Akrour 2014：Programming by feedback. ICML 2014。
The approach of learning reward model from feedback seems to be more similar to the work of Wilson et al. (2012).
- Wilson et al. (2012)：A Bayesian approach for policy learning from trajectory preference queries. NeurIPS 2012。

1.2 Method

First, the setting of PbRL was introduced:

Define segment\(\sigma = ((s_0, a_0), ... (s_{k-1}, a_{k-1}))\) , is a trajectory segment of length k.
Define preference\(\sigma_0\succ\sigma_1\) indicates that the trajectory segment σ0 is better than σ1. Next, we use these preference data to learn the reward model.

Then method: (realized this article doesn't seem to give pseudo-code)

We want to maintain a reward model\(\hat r\)and a policy\(\pi\) 。
Probably repeat the process:
- With the current policy\(\pi\) Generate query\((\sigma_0,\sigma_1)\) ；
- Compare the query to human to get the preference data;
- Learning reward model with new preference dataset;
- Think of the reward model as a reward function to learn RL, update policy\(\pi\) 。

① Treat reward model as reward function, learn RL, update policy\(\pi\) ：

Thesis statement\(\hat r\) may be non-smooth (because the reward model is always being updated), so they want to use the policy gradient method, which is robust to the reward function.
They used A2C (advantage actor-critic) for Atari and TRPO for MuJoco. adjusted TRPO's entropy bonus so that MuJoCo's swimmer task uses an entropy bonus of 0.001, and the other tasks use 0.01 entropy bonus. Adjusted the entropy bonus for TRPO.
The reward model generates the\(\hat r\) Normalized to 0 mean 1 standard deviation.

② Use the current policy\(\pi\) Generate query\((\sigma_0,\sigma_1)\)：

The preference data is of the form\((\sigma_0,\sigma_1, p)\), where p is a distribution on {0,1}.
If the human can type preference, then p = 0 or 1. If the human can't, then p is uniformly distributed over 01. If the human feels that the query is incomparable, then the query will not be used to learn the reward model.

③ Reward model with new preference dataset:

We use the Bradley-Terry model for modeling.\(\hat r\) and preference:
\[\hat P[\sigma_0\succ \sigma_1] = \frac{\exp\sum\hat r(s_t^0,a_t^0)} {\exp\sum\hat r(s_t^0,a_t^0) + \exp\sum\hat r(s_t^1,a_t^1)} ~~. \tag{1} \]
Then, we go to optimize cross-entropy loss:
\[L(\hat r) = -\sum_{(\sigma_0,\sigma_1,p)} \left( p(0)\log \hat P[\sigma_0\succ \sigma_1] + p(1)\hat P[\sigma_1\succ \sigma_0]\right) \tag{2} \]
(The above process has become the classic PbRL practice)
They also added three little tricks:
- To ensemble the reward model, learn n independent reward models, and average the normalized values of each of them individually as\(\hat r\) The value of the
- Take a portion of the preference data and make a validation set to adjust the weight parameters of the mysterious L2 regularization and add a mysterious dropout to avoid overfitting the reward model (?).
- label smoothing(?) It seems that when a human types p = 0, the probability that p = 0 is considered to be 0.95 and the probability that p = 1 is 0.05.

query selection：

I.e., we now have a lot of trajectories from which we want to cut out segments, form segment pairs, and send them as query for comparison. How should we choose segment pair as query?
disagreement-based query selection is used here, which appears to allow each reward model to give the\(\hat P[\sigma_0\succ \sigma_1]\) values, calculate the variance of those values, and then pick a query with the highest variance.

1.3 Experimental results

The algorithm was written in TensorFlow. github:/mrahtz/learning-from-human-preferences (doesn't seem to be official code...)

preference Some are typed by humans, others by scripted teachers.
- Appendix B has some prompts for humans to type preference, which is interesting.
- scripted teacher: replace the true reward of the task with the true reward of the task in equation (1) above.\(\hat r\)The preference is generated in the reverse direction.
In the MuJoCo experiment, real reward, 1400 700 350 scripted teacher queries, and 750 human queries were used.
- (Personally, I understand that the 750 human queries here contain fewer than 750 labels, since humans should just throw away any query that they find incomparable)
- In the MuJoCo experiment, many of the tasks took 1e7 steps, which is very slow compared to pebble; pebble can be learned in 1e6 steps.
The Atari experiment used real reward, 10k 5.6k 3.3k scripted teacher queries, and 5.5k human queries (a lot of human label ...... This took a long time to type out; such an extensive experiment! This is solid work.)
These experiments did not lead to the conclusion that human labeling is better than scripted teacher, and the paper says that this may be because humans make mistakes, and different humans have different labeling guidelines.
Some details of the experiments in Appendix A:
- Some environments end the episode early, such as when the agent falls down, dies, or reaches its destination (?). They claimed that this done signal might provide task information, so they magically altered the environment to avoid variable episode lengths (?). So that the agent gets a big penalty when it dies, or the agent just resets the env when it reaches its destination instead of doing it (?).
- In MuJoCo, 25% of queries are hit directly before the start of the experiment with a random policy, and then preference is hit at a rate that decreases over time; segment length = 15 - 60, depending on the specific task.
- In Atari, the reward model and the policy have the same input, which is a convolutional network for processing images (for no apparent reason, I've heard that the main contribution of DPO is that it can eliminate the reward model of LLM RLHF, because the reward model is supposed to be the same size as the policy, so eliminating it saves a lot of video memory).
- Atari actually runs A3C, hitting 500 queries with segment length = 25 before the experiment starts.
Section 3.2 also learns the novel behavior of the hopper backflip in MuJoCo, which, according to the article, ensures that the hopper lands on its feet after the backflip. This was learned using 900 human queries, and there are other novel behaviors in Section 3.2.
A very thorough ablation study was done in Section 3.3, which found that segment length = 1 seems to deteriorate performance, that the reward model does not have much effect without regularization, that changing query selection to random does not seem to have much effect, and that it is better to update the policy while taking the newest trajectory and hitting it with a preference.

02 PrefPPO Implementation in PEBBLE

The PrefPPO implementation in PEBBLE directly modifies stable_baselines3's PPO model; they wrote a new class called PPO_REWARD, which encapsulates all interactions with the reward model in the () function.

2.1 How the reward model is constructed

Same as pebble's reward model.

If the state and action are continuous (like a regular cheetah walker), then concatenate the state and action as inputs to the reward model.

If the state is an image and the action is discrete (like the Pong environment in Christiano's 2017 paper), then (following Christiano's 2017 reproduction code), ...... seems to take the state image and calculate the reward directly, without concat a one-hot action or numerical action.

2.2 PPO_REWARD's ()

The general flow of PPO: collect rollout → calculate advantage of rollout and so on → calculate loss and backward → collect new rollout...

In the process of collecting the rollout, PrefPPO replaces the real task reward to be collected by the\(\hat r\) and add the rollout data to the query's options.

Before and after collecting the rollout, it seems that there is a call to learn_reward() function to train the reward model, this function first collects query, and then takes the collected query to learn the reward model. basically the same as pebble.