Embodied Multimodal Representations via Implicit Preference Learning

Jianxiong Li* 1, Jinliang Zheng* 1 2, Yinan Zheng* 1,
Liyuan Mao3, Xiao Hu1, Sijie Cheng1, Haoyi Niu1, Jihao Liu4 2,
Yu Liu2, Jingjing Liu1, Ya-Qin Zhang1, Xianyuan Zhan 1 5,

1AIR, Tsinghua University 2SenseTime Research 3Shanghai Jiaotong University
4CUHK MMLab 5Shanghai AI Lab

*Equal contribution,
†Project Lead:
✉Corresponding author:
Exciting News! Our paper has been accepted at ICML-2024!


Multimodal pretraining has emerged as an effective strategy for the trinity of goals of representation learning in autonomous robots: 1) extracting both local and global task progression information; 2) enforcing temporal consistency of visual representation; 3) capturing trajectory-level language grounding. Most existing methods approach these via separate objectives, which often reach sub-optimal solutions. In this paper, we propose a universal unified objective that can simultaneously extract meaningful task progression information from image sequences and seamlessly align them with language instructions. We discover that via implicit preferences, where a visual trajectory inherently aligns better with its corresponding language instruction than mismatched pairs, the popular Bradley-Terry model can transform into representation learning through proper reward reparameterizations. The resulted framework, DecisionNCE, mirrors an InfoNCE-style objective but is distinctively tailored for decision-making tasks, providing an embodied representation learning framework that elegantly extracts both local and global task progression features, with temporal consistency enforced through implicit time contrastive learning, while ensuring trajectory-level instruction grounding via multimodal joint encoding. Evaluation on both simulated and real robots demonstrates that DecisionNCE effectively facilitates diverse downstream policy learning tasks, offering a versatile solution for unified representation and reward learning.

Downstream Control Tasks Results

The DecisionNCE encoders are pretrained using large-scale human video dataset EpicKitchen. We freeze the pretrained vision-language encoders and use their output representations as input to a 256-256 MLP to train LCBC policies.

Results on Real Robots

algebraic reasoning

Figure 1: Real robot LCBC experimental results. Success rate is averaged over 10 episodes and 3 seeds.

Red cup on silver pan

Red cup on red plate

Duck on green plate

Duck in pot

Move pot

Fold cloth

Flip the red cup upright

Open the microwave

Close the microwave

Results on Simulation

We also evaluate on the FrankaKitchen benchmark. We train LCBC policies on 5 tasks in FrankaKitchen environment using 1/3/5 demonstrations for each task. DecisionNCE achieves the highest success rate across diverse dataset quantities, demonstrating its effectiveness in extracting valuable information from out-of-domain data.

algebraic reasoning

Figure 2: Simulation LCBC results. Max success rate averaged over 25 evaluation episodes and 3 seeds.

Image Image Image Image Image


          title={DecisionNCE: Embodied Multimodal Representations via Implicit Preference Learning},
          author={Li, Jianxiong and Zheng, Jinliang and Zheng, Yinan and Mao, Liyuan and Hu, Xiao and Cheng, Sijie and Niu, Haoyi and Liu, Jihao and Liu, Yu and Liu, Jingjing and others},
          booktitle={Forty-first International Conference on Machine Learning}