Dissertation Reading Translation ofDeep reinforcement learning from human preferences

with respect to

First published: 2024-09-11
Link to original paper:/abs/1706.03741
Date of first submission of paper arxiv: 12 Jun 2017
Machine-turned using KIMI, Beanbag and ChatGPT, then touched up manually
Please do not hesitate to point out any errors

Deep reinforcement learning from human preferences

Abstract

For complex reinforcement learning (RL) systems to interact effectively with real-world environments, we need to communicate complex goals to these systems. In this work, we explore defining goals in terms of (non-expert) human preferences for trajectory segment pairs. We show that this approach can efficiently solve complex RL tasks, including Atari games and simulated robot motion, without a reward function, while only needing to provide feedback on less than 1% of the agent's interactions with the environment. This significantly reduces the cost of human supervision, allowing for practical application to state-of-the-art reinforcement learning systems. To demonstrate the flexibility of our approach, we show that complex new behaviors can be successfully trained in about an hour of human participation time. These behaviors and environments are much more complex than anything previously learned from human feedback.

1 Introduction

Recent success in extending reinforcement learning (RL) to large-scale problems has been largely due to domains that have well-defined reward functions (Mnih et al., 2015, 2016; Silver et al., 2016). Unfortunately, many tasks have objectives that are complex, ill-defined, or difficult to specify explicitly. Overcoming this limitation would greatly expand the potential impact of deep reinforcement learning and may further expand the range of machine learning applications.

For example, suppose we want to use reinforcement learning to train a robot to clean tables or scramble eggs. It is not clear how to construct an appropriate reward function that relies on the robot's sensor data. We can try to design a simple reward function that roughly captures the INTENDED behavior, but this usually results in robot behavior that optimizes our reward function, but robot behavior that does not actually match our preferences. This difficulty is what forms the basis of recent work on the misalignment of our values with the goals of reinforcement learning systems (Bostrom, 2014; Russell, 2016; Amodei et al., 2016). If we are able to successfully communicate our actual goals to an intelligent body (agent), it will be a crucial step towards solving these problems.

If we have a demonstration of the desired task, we can extract a reward function through inverse reinforcement learning (Ng and Russell, 2000) and then use that reward function to train an intelligence through reinforcement learning. A more direct approach is to use imitation learning (imitation learning) to replicate the behavior of the demonstration. However, these methods are not applicable to behaviors that are difficult for humans to demonstrate (e.g., controlling a robot with multiple degrees of freedom and a morphology that is very different from that of a human).

Another approach is to allow humans to provide feedback on the current behavior of the system and use that feedback to define the task. In principle, this fits the reinforcement learning paradigm, but directly using human feedback as a reward function is too costly for reinforcement learning systems that require hundreds or thousands of hours of experience. In order to be able to actually train deep reinforcement learning systems based on human feedback, we need to reduce the amount of feedback required by several orders of magnitude.

Our approach is to learn a reward function from human feedback and then optimize that reward function. This basic approach has been considered before, but we face the challenge of extending it to modern deep reinforcement learning and demonstrate the most complex behavior learned from human feedback to date.

In summary, we would like to find a solution to the sequential decision-making problem without an explicitly specified reward function, which should satisfy the following conditions:

can solve the problem that we can onlyrecognizetasks where behavior is expected but not necessarily demonstrated.
allowing non-expert users to teach intelligences.
can scale to large problems and
Cost-effective in terms of user feedback.

Our algorithm fits a reward function based on human preferences while training the strategy to optimize the currently predicted reward function (see Figure 1). We asked humans to compare short video clips of intelligences' behavior rather than providing absolute numerical ratings. We found that making comparisons was easier for humans in some domains, while being equally effective in learning human preferences. Comparing short video clips is almost as fast as comparing individual states, but we show that this type of comparison is significantly more helpful. In addition, we show that collecting feedback online improves the performance of the system and prevents it from exploiting vulnerabilities in the learned reward function.

Our experiments were conducted in two domains: the Atari game in the Arcade Learning Environment (Bellemare et al., 2013), and a robotics task in the physics simulator MuJoCo (Todorov et al., 2012). We show that even small amounts of feedback provided by non-expert humans, ranging from fifteen minutes to five hours, are sufficient to learn most primitive reinforcement learning tasks, even if the reward function is unobservable. We then consider a number of new behaviors in each domain, such as completing a backflip or driving in the direction of traffic flow. We demonstrated that our algorithm was able to learn these behaviors with about an hour of feedback-even if it was difficult to motivate these behaviors by hand-designing the reward function.

1.1 Related Work

A large number of studies have explored reinforcement learning based on human ratings or rankings, including Akrour et al. (2011), Pilarski et al. (2011), Akrour et al. (2012), Wilson et al. (2012), Sugiyama et al. (2012), Wirth and Fürnkranz (2013), Daniel et al. ( 2015), El Asri et al. (2016), Wang et al. (2016) and Wirth et al. (2016). Other studies focus on reinforcement learning problems that start from preferences rather than absolute reward values (Fürnkranz et al., 2012; Akrour et al., 2014), as well as on optimization by human preferences in non-reinforcement learning environments (Machwe and Parmee, 2006; Secretan et al. 2008; Brochu et al. Sørensen et al. 2016).

Our algorithm follows the same basic approach as Akrour et al. (2012) and Akrour et al. (2014). They study continuous domains with four degrees of freedom and small discrete domains where they can assume that rewards are linear in the expectation of hand-coded features. We, on the other hand, study physical tasks with dozens of degrees of freedom and Atari tasks without hand-designed features; the complexity of our environment forces us to use different reinforcement learning algorithms and reward models and to cope with different algorithmic tradeoffs. One notable difference is that Akrour et al. (2012) and Akrour et al. (2014) obtain preferences from whole trajectories, rather than short segments. Thus, while we collected two orders of magnitude more comparisons, our experiments required less than an order of magnitude of human time. Other differences lie mainly in adapting our training procedure to cope with nonlinear reward models and modern deep reinforcement learning, e.g., using asynchronous training and integration methods.

Our approach to feedback guidance is very close to that of Wilson et al. (2012). However, Wilson et al. (2012) assume that the reward function is the distance to some unknown "target" strategy (which is itself a linear function of hand-coded features). Instead of performing reinforcement learning, they fit this reward function through Bayesian inference, and they generate trajectories based on the maximum a posteriori estimate (MAP) of the target strategy. Their experiments involved "synthetic" human feedback drawn from their Bayesian model, whereas we conducted experiments collecting feedback from non-expert users. It is unclear whether the approach of Wilson et al. (2012) can be extended to complex tasks, or whether it can handle real human feedback.

MacGlashan et al. (2017), Pilarski et al. (2011), Knox and Stone (2009), and Knox (2012) conducted a number of experiments involving reinforcement learning based on real human feedback, although their algorithmic approaches were not very similar. In the studies by MacGlashan et al. (2017) and Pilarski et al. (2011), learning takes place only in rounds (episodes) where an artificial trainer provides feedback. This seems infeasible in a domain like Atari games, where learning high-quality strategies requires thousands of hours of experience, and the cost of this approach is prohibitively expensive even for the simplest tasks we consider.TAMER (Knox, 2012; Knox and Stone, 2013) also learns reward functions, but they consider simpler settings ( settings) in which the desired policy can be learned relatively quickly.

Our work can also be seen as a specific instance of the cooperative inverse reinforcement learning framework (Hadfield-Menell et al., 2016). This framework considers a two-player game in which humans and robots interact in an environment with the goal of maximizing the human reward function. In our setting, humans can only interact with this game by expressing their preferences.

Compared to all previous work, our key contribution is to extend human feedback to deep reinforcement learning and learn more complex behaviors. This is in line with recent trends in extending reward learning methods to large deep learning systems, such as inverse reinforcement learning (Finn et al., 2016), imitation learning (Ho and Ermon, 2016; Stadie et al., 2017), semi-supervised skill generalization (Finn et al., 2017), and bootstrapping reinforcement learning from demonstrations (Silver et al. 2016; Hester et al. 2017).

2 Preliminaries and Methods

2.1 Setting and Goal

We consider an intelligent body interacting with its environment in a series of steps; at each moment in time\(t\), intelligence receives observations from the environment\(o_t \in \mathcal{O}\)and then sends the action to the environment\(a_t \in \mathcal{A}\)。

In traditional reinforcement learning, the environment also provides rewards\(r_t \in \mathbb{R}\), the goal of the intelligentsia is to maximize the discounted sum of rewards (discounted sum of rewards). In contrast to assuming that the environment generates reward signals, we assume that there is a human supervisor available in theTrack Segment(trajectory segments) between expressed preferences. Trajectory segments are sequences of observations and actions that\(\sigma=\left(\left(o_0, a_0\right),\left(o_1, a_1\right), \ldots,\left(o_{k-1}, a_{k-1}\right)\right) \in(\mathcal{O} \times \mathcal{A})^k\). We use\(\sigma^1 \succ \sigma^2\) Indicates human preference for trajectory segments\(\sigma^1\) Instead of track segments\(\sigma^2\). Informally, the goal of the intelligences is to generate human-preferred trajectories while minimizing the number of queries to humans.

More precisely, we will evaluate the behavior of our algorithm in two ways:

Quantitative: We say preference.\(\succ\) is given by a reward function^[1]\(r: \mathcal{O} \times \mathcal{A} \rightarrow \mathbb{R}\) is generated if the

\[\left(\left(o_0^1, a_0^1\right), \ldots,\left(o_{k-1}^1, a_{k-1}^1\right)\right) \succ\left(\left(o_0^2, a_0^2\right), \ldots,\left(o_{k-1}^2, a_{k-1}^2\right)\right) \]

on every

\[r\left(o_0^1, a_0^1\right)+\cdots+r\left(o_{k-1}^1, a_{k-1}^1\right)>r\left(o_0^2, a_0^2\right)+\cdots+r\left(o_{k-1}^2, a_{k-1}^2\right) \]

If human preferences are determined by the reward function\(r\) generated, then our intelligences should be based on the\(r\) get a high total reward. Thus, if we know the reward function\(r\), we will be able to quantitatively evaluate the agent. Ideally, the agent should achieve almost as much reward as it would if it used reinforcement learning to optimize the\(r\) When it's as high as it is.

determine the nature (usually of error or crime): Sometimes we do not have a reward function to make a quantitative assessment of behavior (which is exactly the case where our approach is useful in practice). In these cases, we can only qualitatively assess the extent to which the intelligences fulfill human preferences. In this paper, we will start with a goal expressed in natural language, ask humans to assess the behavior of an intelligent body based on how well it achieves that goal, and then show a video of the intelligent body trying to achieve that goal.

Our model based on trajectory segment comparison is very similar to the trajectory preference query used in Wilson et al. (2012), with the difference that we do not assume that the system can be reset to an arbitrary state^[2], and that our segments usually start in different states. This complicates the interpretation of human comparisons, but we show that our algorithm is able to overcome this challenge even if human raters are unaware of our algorithm.

2.2 Our Method

At each moment, our approach maintains a strategy\(\pi: \mathcal{O} \rightarrow \mathcal{A}\) and a reward function estimate\(\hat{r}: \mathcal{O} \times \mathcal{A} \rightarrow \mathbb{R}\), they are all parameterized by deep neural networks.

These networks are updated through three processes:

be tactful\(\pi\) Interact with the environment to generate a set of trajectories\(\left\{\tau^1, \ldots, \tau^i\right\}\).. Using traditional reinforcement learning algorithms to update\(\pi\) parameter to maximize the sum of predicted rewards\(r_t=\hat{r}\left(o_t, a_t\right)\)。
Trajectories generated from step 1\(\left\{\tau^1, \ldots, \tau^i\right\}\) Selecting a clip from the\(\left(\sigma^1, \sigma^2\right)\)and send them to humans for comparison.
Optimizing mappings through supervised learning\(\hat{r}\) parameters to fit the comparisons collected from humans to date.

2.2.1 Optimizing the Policy

in using\(\hat{r}\) After computing the reward, we are faced with a traditional reinforcement learning problem. We can solve this problem using any reinforcement learning algorithm appropriate for the domain. One nuance is that the reward function\(\hat{r}\) may be non-stationary (non-stationary), which makes us favor algorithms that are robust to changes in the reward function. This leads us to focus on policy gradient methods (PGMs), which have been successfully applied to such problems (Ho and Ermon, 2016).

In this article, we use theAdvantage actor-critic (Advantage actor-critic)(A2C; Mnih et al., 2016) to play the Atari game and use theTrust Domain Policy Optimization（trust region policy optimization）(TRPO; Schulman et al., 2015) to perform a simulated robotics task. In each case, we used parameter settings that have been found to be effective for traditional reinforcement learning tasks. The only hyperparameter we tuned was the entropy bonus of TRPO, as TRPO relies on the trust domain to ensure sufficient exploration, which can lead to insufficient exploration if the reward function is constantly changing.

we will\(\hat{r}\) The generated rewards are normalized (normalized) to have zero mean and constant standard deviation. This is a typical preprocessing step that is particularly appropriate for our learning problem, since the position of the rewards (position of the rewards) is undetermined during our learning process.

2.2.2 Preference Elicitation

The human supervisor is presented with two visualized trajectory clips in the form of short video clips. In all of our experiments, these video clips were between 1 and 2 seconds in length.

Humans then indicated which clip they preferred, either indicating that both clips were equally good or that they were unable to compare the two clips.

Human judgment recorded in database\(\mathcal{D}\) in a ternary of the form\(\left(\sigma^1, \sigma^2, \mu\right)\)which\(\sigma^1\) cap (a poem)\(\sigma^2\) It's two clips.\(\mu\) It's the one that's in\(\{1,2\}\) on the distribution, indicating which clip the user prefers. If a human chooses a clip as preferred, the\(\mu\) Reposition ownership on that choice. If the human marks the two fragments as equally desirable, the\(\mu\) is uniformly distributed. Finally, if a human labels the two segments as non-comparable, the comparison will not be included in the database.

2.2.3 Fitting the Reward Function

We can estimate the reward function\(\hat{r}\) Considered as a preference predictor, if we consider the\(\hat{r}\) as a potential factor explaining human judgment, and hypothesized that human choice preference segments\(\sigma^i\) The probability of depends exponentially on the aggregate value of potential rewards over the length of the segment:^[3]

\[\hat{P}\left[\sigma^1 \succ \sigma^2\right]=\frac{\exp \sum \hat{r}\left(o_t^1, a_t^1\right)}{\exp \sum \hat{r}\left(o_t^1, a_t^1\right)+\exp \sum \hat{r}\left(o_t^2, a_t^2\right)} \tag{1} \]

We choose\(\hat{r}\) to minimize the cross-entropy loss between these predictions and the actual human labels:

\[\operatorname{loss}(\hat{r})=-\sum_{\left(\sigma^1, \sigma^2, \mu\right) \in \mathcal{D}} \mu(1) \log \hat{P}\left[\sigma^1 \succ \sigma^2\right]+\mu(2) \log \hat{P}\left[\sigma^2 \succ \sigma^1\right] \]

This follows the Bradley-Terry model for estimating scoring functions from pairwise preferences (Bradley and Terry, 1952) and is a specialization of the Luce-Shephard selection rule (Luce, 2005; Shepard, 1957) for preferences on trajectory segments. It can be understood as equating rewards to a preference ranking scale, similar to the famous Elo ranking system developed for chess (Elo, 1978). Just as the difference between the Elo scores of two chess players estimates the probability that one player will beat the other in a chess match, the difference between the predicted rewards of two trajectory segments estimates the probability that a human will choose one over the other.

Our actual algorithm makes some modifications to this basic approach, which early experiments have found helpful and are analyzed in Section 3.3:

We fit an ensemble (ensemble) of predictors, each of which is a predictor in a process that starts from the\(\mathcal{D}\) medium-sized sample\(|\mathcal{D}|\) trained on individual ternary groups (allowing for repeated sampling). Estimates\(\hat{r}\) defined by normalizing each predictor independently and then averaging the results.
In the data there are\(1/e\) portion of the set is retained as the validation set for each predictor. We use the\(\ell_2\) regularization and adjust the regularization factor to keep the validation loss between 1.1 and 1.5 times the training loss. In some domains, we also apply dropout for regularization.
Rather than applying softmax directly as described in Equation 1, we assume that humans have a 10% probability of responding randomly and uniformly (UNIFORMLY). Conceptually, this adjustment is necessary because human evaluators have a fixed probability of making a mistake, and this probability does not decay to zero as reward differences become extreme.

2.2.4 Selecting Queries

We decide how to query preferences based on an uncertainty approximation to the reward function estimator, which is similar to the approach of Daniel et al. (2014): we sample a large number of queries of length\(k\)of pairs of trajectory segments, using each of the reward predictors in our ensemble to predict which segment will be preferred in each pair, and then selecting those trajectories with the highest predictive variance among ensemble members. This is a rough approximation, and the ablation experiments in Section 3 show that it actually hurts performance in some tasks. Ideally, we would like to base queries on the information value of the query (Akrour et al., 2012; Krueger et al., 2016), but we leave it to future work to explore this direction further.

Here, we assume that reward is a function of observation and action. And in our Atari environment experiment, we assume that reward is a function of the first four observations. In a general partially observable environment, we can consider a reward function that depends on the entire sequence of observations and use a recurrent neural network to model this reward function.↩︎
Wilson et al. (2012) also assume that a reasonable initial state can be sampled. However, we are dealing with a high-dimensional state space, where random states may not be reachable, and the intended policy (INTENDED policy) lies on a low-dimensional manifold.↩︎
Equation 1 does not use a discounting factor, which can be interpreted as modeling the human's indifference to the timing of events in the trajectory segments. Using an explicit discounting factor or inferring the human's discounting function would also be reasonable choices.↩︎