RL · Exploration | Use timing distances to construct intrinsic reward to encourage agents to explore

Paper title: Episodic Novelty Through Temporal Distance.
ICLR 2025，8 8 6 5 poster。
arxiv：/abs/2501.15418
pdf：/pdf/2501.15418
html：/html/2501.15418
open review：/forum?id=I7DeajDEx7

01 Main content of the paper
- 1.1 What is this paper focused on and what tasks do you want to solve?
- 1.2 How to do the previous method generally and what are the problems
- 1.3 Motivation of this paper, what does it hope to solve for gap
- 1.4 What is the main method of this paper and what is the algorithm process
  - 📌 How to learn temporal distance
  - 📌 Constructing an intrinsic reward using temporal distance
  - 📌 Specific algorithm
- 1.5 What is the experimental result
02 Text logic

01 Main content of the paper

Paper summary:

This paper studiesExploration issues in sparse reward environment, especially in "Contextual Markov Decision Processes (CMDP, such as randomly generated maps), how to make agents explore efficiently in each task environment.
This paper designed ETD (Episodic Novelty Through Temporal DThe core innovation is to proposetemporal distanceAs a measure of state novelty, timing distances are estimated through comparative learning and intrinsic reward-driven exploration is generated.

1.1 What is this paper focused on and what tasks do you want to solve?

Points of concern: Solve CMDP (Contextual Markov Decision Processes, situational MDP, such as random map navigation, robots completing tasks in random airport scenes)Low exploration efficiency under sparse rewardsThe problem.
challenge: Traditional methods rely on global experience, but CMDP environment is different each time (such as new maps), and historical experience cannot be reused directly (see 1.2 below for details).

1.2 How to do the previous method generally and what are the problems

Counting method(such as record number of accesses): Failed in continuous/large state space (each state is "unique", and novelty cannot be judged).
Similarity method(such as Euclidean distance): Unable to capture stateDynamic relationships(For example, two points in the maze seem close, but actually need to take a long way to reach it).

1.3 Motivation of this paper, what does it hope to solve for gap

The existing method is missingDynamic relationship between statesmodeling (e.g., "far orbit" and "direct" may be the same in Euclidean distance). 【To be checked】
Propose to useTiming distance(Average steps required from state A to B) As a more essential similarity measure, generalizes across environments.

1.4 What is the main method of this paper and what is the algorithm process

📌 How to learn temporal distance

First, define several concepts:

Probability of reaching state y at time step k starting from x:\(p^\pi(s_k=y|s_0=x)\)。
The probability of transition from state x to y:\(p^\pi_\gamma(s_f=y|s_0=x)=(1-\gamma)\sum_{k=0}^\infty \gamma^kp^\pi(s_k=y|s_0=x)\). This value is ≤ 1. Personal understanding,\(s_f\)It means\(s_0\)A state afterward.
The definition of quasimetric: satisfies non-negativeness, identity (d(x,x)=0) and triangular inequality, but does not require symmetry (d(x,y)≠d(y,x)).

Then, according to (Myers et al., 2024), we define the temporary distance of x to y as:

\[d^\pi_\text{SD}(x,y)=\log\left(\frac{p^\pi_\gamma(s_f=y|s_0=y)}{p^\pi_\gamma(s_f=y|s_0=x)}\right) \]

The harder x reaches y, the smaller the denominator.\(d^\pi_\text{SD}(x,y)\)The bigger it is.
According to (Myers et al., 2024), this definition is a Quasimetric even if MDP is random.

Then, according to (Ma & Collins, 2018; Poole et al., 2019), we learn by comparative learning.\(d^\pi_\text{SD}(x,y)\)。
Define an energy function\(f(x,y)\), it is expected that it assigns a larger value for two states that are easily accessible to each other and a smaller value for difficult to reach.
We train it with InfoNCE loss:

\[\mathcal{L}_{\theta} = \sum_{i=1}^{B} \left[ \log \left( \frac{\exp f(x_i, y_i)}{\sum_{j=1}^{B} \exp f(x_j, y_j)} \right) + \log \left( \frac{\exp f(x_i, y_i)}{\sum_{j=1}^{B} \exp f(x_j, y_j)} \right) \right] \]

Where (x,y) is from\((x,y)\sim p^\pi_\gamma(s_f=y|s_0=x)p_s(x)\)Sampling,\(p_s(x)\)is the edge distribution of x.
According to (Ma & Collins, 2018; Poole et al., 2019), energy functions can be used\(f(x,y)\)The only solution to recover\(d^\pi_\text{SD}(x,y)\)。
According to (Myers et al., 2024), if\(f(x,y)\)Decompose into a potential\(c(y)\)And one\(d(x,y)\)The difference, i.e.\(f(x,y)=c(y)-d(x,y)\), then you can use it directly\(d(x,y)=d^\pi_\text{SD}(x,y)\), so that the training can obtain the timing distance.
In ETD,\(c(y)\)is MLP, and\(d(x,y)\)It is achieved with asymmetric MRN.

📌 Constructing an intrinsic reward using temporal distance

Define the intrinsic reward of ETD:\(b_\text{ETD}(s_t)=\min_{k\in[0,t]}d(s_k,s_t)\), i.e., it is hoped to maximize the previous state in this episode\(s_k\)To this state\(s_t\)temporal distance, hope to go to some places as difficult to reach as possible. In particular, the ETD maximizes the minimum temporal distance, that is, from the current state\(s_t\)Recent\(s_k\), their temporal distance.

📌 Specific algorithm

Sampling context c to get a CMDP;
While episode is not over:
- Sampling action a according to s, get s';
- Calculate ETD intrinsic reward:\(b_{t+1}=\min_{k\in[0,t]}d(s_k,s_t)\)；
- Get the total reward:\(r_{t+1} = r^e_{t+1} + \beta b_{t+1}\)；
Save this episode into the replay buffer;
Pick a batch from the replay buffer. In fact,\(x = s_t, y = s_{t+j}, j \sim\text{Geom}(1−\gamma)\)That is, geometric distribution;
Update the loss function of temporal distance.
Update PPO policy with the new trajectory you just picked.

(Personally, the state should obviously contain the information of the current context, such as the layout of the maze in a maze. Otherwise, suppose (0,0) (0,2) in the previous episode is very close, but in this episode, there is a wall between the two points; if the state only contains its own coordinates (0,0) and not this wall, then it is impossible to learn the information "the (0,0) (0,2) of this game is very far away")

1.5 What is the experimental result

environment: MiniGrid maze, pixel game Crafter, 3D navigation MiniWorld, etc.
Advantages：
- In MiniGrid, ETD is better than NovelD.
- In a complex maze with noise, ETD converges faster than existing methods (NovelD, E3B, etc.)2 times faster。
- ETD exploration success rate increases in high-dimensional environments such as Crafter for pixel input15-20%。
- Robust for state noise (such as random perturbation), and ETD remains valid when traditional methods (such as counting) fail completely.

02 Text logic

The main story of this article:

In a sparse reward environment, exploration faces challenges. In Contextual MDP, both existing count-based and similar-based have problems; ETD can solve their problems.
ETD solves their problems by calculating similarity and intrinsic reward using temporal distance. temporal distance learning is used for learning.

1 intro analysis:

Paragraph 1: Difficulty in RL exploration for sparse rewards; while existing methods are effective in MDP, the real world is usually CMDP.
Paragraph 2: To solve the CMDP exploration problem, existing methods introduce episodic bonuses, which can be divided into two categories: count-based and similar-based. The count-based method performs poorly in large state spaces, while the similarity-based method has insufficient similarity calculations to capture the novelty of the state.
Paragraph 3: This article introduces ETD and encourages agents to explore a state that is far apart in time from the current episode history.
- The key innovation lies in the use of temporal distance, which measures the expected number of steps to transition between two states, which can be used as a robust indicator of similarity calculations.
- Unlike existing methods, temporal distance is not affected by state representation, thus avoiding noisy-TV issues and is suitable for pixel-based environments.

The next 2 backgrounds briefly introduce the mechanisms of CMDP and intrinsic reward.

3 Limitations of Current Episodic Bonuses：

A special gap that analyzes the existing method is used.
The first paragraph: Computing novelty based on the episode count will fail under large state space / noisy state because each state is novel.
Paragraph 2: Methods that calculate novelty based on state similarity, such as NGU and E3B, may solve this problem. Specifically, it can be achieved by estimating the ease of transition between states, such as EC and DEIR. However, the state similarity calculated by this method is often not accurate enough, as shown in Figure 2.

4 method：

4.1 How to learn temporal distance;
4.2 Calculate intrinsic reward based on temporal distance;
4.3 Introduce specific algorithms.

5 experiments：

5.1 introduces the setting and experimental results of minigrid, and the experiment covers 8 tasks.
5.2 introduces minigrid with noise, and has stuck all the baselines.
5.3 is ablation, which compares the distance between ETD and Euclidean, the specific method of generating intrinsic reward with ETD, and the symmetry/asymmetry design of ETD timing distance.
5.4 introduces the Pixel-Based Crafter and MiniWorld Maze environments and provides experimental results (compared with a small number of baselines).