- Article title: Skill Reinforcement Learning and Planning for Open-World Minecraft Tasks
- Original publication date: 2023.03
- arxiv:/abs/2303.16563
- GitHub:/PKU-RL/Plan4MC
- website:/view/plan4mc
- It was originally published on NeurIPS 2023 Workshop FMDM, and later rejected the article in ICLR 2024.
- 01 main idea
- 02 How to determine the list of skills to learn
- 03 How to get a low-level skill policy
- 04 How to plan high-level according to goals
- misc
← Return to Table of Contents
01 main idea
- high-level planning + low-level execution based on RL.
- First, let the LLM generate basic skills, such as finding an item, crafting an item. LLM will provide skill input (such as what it takes to make an item, what it already needs to have in the backpack) and output (you get a new item after making the item).
- Based on RL, learn to perform each low-level skill: This step is a regular RL training, training a policy for each skill.
- high-level planning: Given a target, based on the input and output of the skill generated by LLM, a directed acyclic graph (DAG) can be created. This graph illustrates the logical relationship of our tasks. Then, search on the graph, find the shortest circuit from the start point to the end point, and execute the skill on the path in turn.
02 How to determine the list of skills to learn
- Three basic skills of fine-grained sizes are proposed:
- Finding-skills, find an item;
- Manipulation-skills, operate a tool;
- Crafting-skills, make an item.
- Using ChatGPT to generate skill information (see Appendix E for propt), ChatGPT is able to generate all skills (55), and 6 errors were made, and the author manually corrected the error.
Specific prompt: (First provide some existing skill formats to explain the meaning of this format, and then let LLM generate information about other skills)
I am playing the game Minecraft. I define some basic skills, like attack something, collect something and place something nearby. I list the skills in a special format.
As an example:
furnace_nearby: consume: 'furnace': 1 require: equip: ['furnace'] obtain: 'furnace_nearby': 1
To understand this skill line by line: the skill is to get a furnace_nearby . 'consume' means things will be consumed or killed. In this skill, furnace * 1 will be consumed. 'require' means things are needed but will not be consumed. In this skill, nothing else is required. We should equip furnace to the first slot. If you do not have to equip anything, write 'equip: []'. Finally, we will obtain furnace_nearby * 1.
Another example:
cobblestone: consume: 'cobblestone_nearby': 1 require: 'wooden_pickaxe': 1 equip: ['wooden_pickaxe'] obtain: 'cobblestone': 1
To understand: to mine a cobblestone, we will consume a nearby cobblestone. A wooden_pickaxe is required and should be equipped, but will not be consumed.
Now you understand the rule of this format. Please help me generate the following skills: crafting_table_nearby, wool, beef, diamond.
skill format: (consume means that the item will be consumed, while require means that the item needs to be executed. However, after executing skill, the item will not be consumed)
# Manipulation-skills
crafting_table_nearby:
consume:
'crafting_table': 1
require:
equip: ['crafting_table']
obtain:
'crafting_table_nearby': 1
wool:
consume:
'sheep_nearby': 1
require:
'shears': 1
equip: [ 'shears']
obtain:
'wool': 1
# Crafting-skills
bed:
consume:
'planks': 3
'wool': 3
require:
'crafting_table_nearby': 1
equip: []
obtain:
'bed': 1
furnace:
consume:
'cobblestone': 8
require:
'crafting_table_nearby': 1
equip: []
obtain:
'furnace': 1
03 How to get a low-level skill policy
-
policy: trained using RL, based on the MineDojo simulator.
- observation: RGB image + some auxiliary information (compass, location, biome, etc.);
- action: There is no detailed description in the paper, it should be walking/running, squatting, digging items in different directions, turning, etc. (?)
-
Random strategies will spin in place + mc's map is larger + sparse rewards, making it difficult for Finding-skill to train:
- Within 500 steps, the random strategy can only travel to a distance of 5 blocks on the plain.
- Since trees are rare on plains and are usually > 20 away from players, if you train directly, you will not be able to train a skill like "get wood".
-
Solution:
- Finding-skills uses a layered strategy to train. The high-level strategy outputs the point to reach. The low-level strategy controls the agent to go to that point; first train the low-level strategy, then train the high-level strategy (it sounds reasonable, it feels that it can be trained, and the low-level strategy can be trained with HER ()
-
Manipulation-skills and Crafting-skills also have the problem of sparse rewards, which uses MineCLIP (a previous mysterious work) to generate intrinsic rewards. If you need any raw materials, use finding-skills to walk over first, or directly generate something you need next to the agent.
04 How to plan high-level according to goals
- skill planning method: build a skill graph, then run DFS on the graph (Algorithm 1 of Appendix C), and find the shortest path from the start point to the end point in the skill graph.
- Considering that low-level policy may fail to execute skill, skill planning and skill execution will be performed alternately until the end of episode (Algorithm 3 of Appendix C). Once low-level policy execution fails, high-level planning may plan other paths.
misc
- In fact, the results seem average, with less than 50% success rate in performing tasks; although it is far beyond all baselines, it is probably not as good as humans)