LeGO-MM: Learning Navigation for Goal-Oriented Mobile Manipulation via Hierarchical Policy Distillation
The training curves for learning Task 1 and Task 2 using SSD in the Map-based MM System setup are shown in Figure 9.
The training and evaluation curves for generalizing Task 1 to Task 2 using SPD are shown in Figure 10.
Figure 7: Training curves of SSD in the map-based MM system setup. (a) corresponds to the scene shown in Fig. 1. (b) corresponds to the scene shown in Fig. 5. (c) corresponds to the scene shown in Fig. 6.
Figure 8: (a) Training curves for generalizing Task 1 to Task 2 using SPD in the \textit{map-based MM system} setup. (b) Evaluation was done at 500 episodes intervals during training.
Pick:
This skill involves the agent being spawned randomly in the house at least 3 m from the object of interest without the object in hand. The skill is considered successful if the agent is successfully able to navigate to and pick up the object by calling a grip action when it is 0.15 m from the object of interest and rearrange its arm to 0.15 m of resting position. The horizon length of this skill is 700 steps. The reward function for this skill is represented as:
Here, $\mathbb{I}{pick}$ represents the condition of the pick skill successfully picking up the object, and $\mathbb{I}{success}$ represents the agent being able to pick up the object successfully and rearrange its arm to the resting position.
$2 \Delta^r_{arm} \mathbb{I}{holding}$ and $\Delta^o{arm}$ represent the Euclidean distance of the robot arm to the object, and
Place:
This skill involves the agent being spawned randomly in the house at least 3 m from the object without the object in hand. The agent has to navigate to the target receptacle, place the object within 0.15 m of the goal location, and rearrange its arm to its resting position. The horizon length of this skill is 700 steps.
Here, $\mathbb{I}{success}$, $\mathbb{I}{place}$ represent a sparse reward for successful skill completion and placing the object, respectively.
Open-Cabinet:
This skill involves the robot being spawned randomly in the house, with the skill of navigating to the cabinet and opening the drawer by calling the grasp action within 0.15 m of the drawer handle marker. The drawer is then opened to a joint position of 0.45. Further, the agent must successfully rearrange its arm to its resting position. The skill horizon length for this skill is 600 steps. The reward structure for this skill is given by:
Here, $\mathbb{I}{success}$ is an indicator for successful opening followed by arm rearrangement, $\mathbb{I}{open}$ is the indicator for the drawer being successfully opened, and $\mathbb{I}{grasp}$ is an indicator for the drawer handle being successfully grasped. $\Delta{arm}^m$,
Open Fridge: This skill involves the robot being spawned randomly in the house, with the skill of navigating to the fridge in the scene successfully grasping the fridge handle marker by calling the grasp action at 0.15 m from the fridge door handle. The fridge door must be opened to a joint position of 1.22, and the arm must be rearranged to its resting position. The skill horizon length for this skill is 600 steps. The per-time-step reward for this is modeled as:
Here, $\mathbb{I}{success}$, $\mathbb{I}{open}$ and
Pick from Fridge: This skill is similar in structure to the Pick skill except that the data distribution involves picking up an object from an open refrigerator with the agent being spawned
Navigation: This sub-skill involves randomly spawning a robot and sampling the End-Effector (EE)'s 6-DoF target pose in an obstacle-free scene. The robot base learns to navigate to cooperate with the movement of the 6-DoF robotic arm to deliver the EE to the target pose. We use an Inverse Kinematics (IK) solver to calculate the joint motion of the robotic arm and set the allowed maximum number of solution failures to 20. The per-time-step reward for this is modeled as:
Here,
and
Obstacle Avoidance: This skill requires the robot to avoid obstacles of various shapes based on Navigation. As shown in Fig. 1, the start/goal positions of the robot and all static obstacles are randomly initialized at the beginning of each episode. The per-time-step reward for this is modeled as:
Here,
Pick: As shown in Fig. 2 (a), based on Navigation and Collision Avoidance, this skill requires the robot base to cooperate with the movement of the robotic arm to deliver the EE to the picking position. The per-time-step reward for this is modeled as:
Here, we incentivize smooth actions with accelerated regularization
Place: Similar to Pick, this skill requires the robot base to cooperate with the movement of the robotic arm to deliver the EE to the placing position, as shown in Fig. 2 (b). The reward function is the same as that of the Pick skill.
Open: This skill first requires the robot to navigate to the vicinity of the cabinet while avoiding static obstacles, as shown in Fig. 3 (a). Then, the robot is asked to open the door of the cabinet, as shown in Fig. 3 (b). The reward function is the same as that of the Pick and Place skills.
Task 1:
This task is essentially a combination of Pick and Place, involving the navigation, collision avoidance, pick, and place sub-skills. The reward function is the same as that of the Pick and Place skills. The trajectories of mobile manipulation for completing Task 1 are shown in Fig. 4. This task is characterized by mobile manipulation in a confined space. The robot should have good maneuverability for collision avoidance. The training curves of training the MM robotic system to complete Task 1 based on SSD are shown in Fig. 9 (b).
Figure 1: Illustrations of obstacle avoidance sub-skill in the Map-based MM System setup. The start/goal positions of the robot and all static obstacles are randomly initialized at the beginning of each episode.
Task 2:
Unlike Task 1, this task requires the robot to move across different rooms to complete a pick-and-place task. In addition, the robot is asked to open a door before going from one room to another. Therefore, this task involves pick, place, navigation, collision avoidance, and open sub-skills, as shown in Fig. 5. This task is characterized by long-horizon and multi-skill mobile manipulation. The robot needs to move efficiently through the rooms and open the door with few IK failures. Therefore, we further improve the per-time-step reward, as follows:
Here, we extend the agent’s action space by an additional action
To incentivize fast motions whenever possible, we add the following reward:
where
The training curves of training the MM robotic system to complete Task 2 based on SSD are shown in Fig. 9 (c). The demonstrations of the robot traversing a narrow area to pick an object and carefully opening the door are shown in Fig. 6.
Figure 2: Illustrations of pick and place sub-skills in the Map-based MM System setup. (a) The robot is required to navigate to
Figure 3: Illustrations of open sub-skill in the Map-based MM System setup. (a) The robot is required to navigate to the vicinity of the cabinet while avoiding static obstacles. (b) The robot is asked to open the door of the cabinet.
Figure 4: The trajectories of mobile manipulation for completing Task 1. This task is characterized by mobile manipulation in a confined space. The robot should have good maneuverability for collision avoidance.
Figure 5: Task 2 requires the robot to move across different rooms to complete a pick-and-place task. In addition, the robot is asked to open a door before going from one room to another. This task is characterized by long-horizon and multi- skill mobile manipulation.
Figure 6: The demonstrations of the robot traversing a narrow area to pick an object and carefully opening the door in Task 2.
In the SAC algorithm, the policy evaluation step aims to iteratively compute the soft Q-value of a policy
We can then derive an update rule for the residual Q-function \cite{li2023residual} from the above modified soft Bellman backup operator based on Eq. (2) and (3) in the paper. Based on the residual Q-function
The derivation from Eq. (6a) to Eq. (6b) uses Eq. (2), $\hat{Q}t = Q{R^+,t} + \omega Q^*$, and Eq. (3):
Here
To obtain Eq. (6c), note that







