Multimodal pretraining has emerged as an effective strategy for the trinity of goals of representation learning in autonomous robots: 1) extracting both local and global task progression information; 2) enforcing temporal consistency of visual representation; 3) capturing trajectory-level language grounding. Most existing methods approach these via separate objectives, which often reach sub-optimal solutions. In this paper, we propose a universal unified objective that can simultaneously extract meaningful task progression information from image sequences and seamlessly align them with language instructions. We discover that via implicit preferences, where a visual trajectory inherently aligns better with its corresponding language instruction than mismatched pairs, the popular Bradley-Terry model can transform into representation learning through proper reward reparameterizations. The resulted framework, DecisionNCE, mirrors an InfoNCE-style objective but is distinctively tailored for decision-making tasks, providing an embodied representation learning framework that elegantly extracts both local and global task progression features, with temporal consistency enforced through implicit time contrastive learning, while ensuring trajectory-level instruction grounding via multimodal joint encoding. Evaluation on both simulated and real robots demonstrates that DecisionNCE effectively facilitates diverse downstream policy learning tasks, offering a versatile solution for unified representation and reward learning.
The DecisionNCE encoders are pretrained using large-scale human video dataset EpicKitchen. We freeze the pretrained vision-language encoders and use their output representations as input to a 256-256 MLP to train LCBC policies.
Figure 1: Real robot LCBC experimental results. Success rate is averaged over 10 episodes and 3 seeds.
We also evaluate on the FrankaKitchen benchmark. We train LCBC policies on 5 tasks in FrankaKitchen environment using 1/3/5 demonstrations for each task. DecisionNCE achieves the highest success rate across diverse dataset quantities, demonstrating its effectiveness in extracting valuable information from out-of-domain data.
Figure 2: Simulation LCBC results. Max success rate averaged over 25 evaluation episodes and 3 seeds.
@inproceedings{lidecisionnce,
title={DecisionNCE: Embodied Multimodal Representations via Implicit Preference Learning},
author={Li, Jianxiong and Zheng, Jinliang and Zheng, Yinan and Mao, Liyuan and Hu, Xiao and Cheng, Sijie and Niu, Haoyi and Liu, Jihao and Liu, Yu and Liu, Jingjing and others},
booktitle={Forty-first International Conference on Machine Learning}
}