Universal Actions for Enhanced Embodied Foundation Models

Jinliang Zheng 1,2,5,*, Jianxiong Li1,*, Dongxiu Liu4,1,*, Yinan Zheng1, Zhihao Wang3,1, Zhonghong Ou1, Yu Liu2, Jingjing Liu1, Ya-Qin Zhang1, Xianyuan Zhan1,5,†

1AIR, Tsinghua University
2SenseTime Research
3Peking University
4Beijing University of Posts and Telecommunications
5Shanghai AI Lab
* Equal contributionCorresponding authors

Tsinghua University SenseTime Peking University BUPT Shanghai AI Lab

Training on diverse, internet-scale data is a key factor in the success of recent large foundation models. Yet, using the same recipe for building embodied agents has faced noticeable difficulties. Despite the availability of many crowd-sourced embodied datasets, their action spaces often exhibit significant heterogeneity due to distinct physical embodiment and control interfaces for different robots, causing substantial challenges in developing embodied foundation models using cross-embodiment data. In this paper, we introduce UniAct, a new embodied foundation modeling framework operating in the Universal Action Space. Our learned universal actions capture the generic behaviors across diverse robots by exploiting their shared structural features, and enable enhanced cross-domain data utilization and cross-embodiment generalizations by eliminating the notorious heterogeneity. Moreover, the universal actions can be efficiently translated back to heterogeneous actionable commands by simply adding embodiment-specific details, from which fast adaptation to new robots becomes simple and straightforward. Our 0.5B instantiation of UniAct outperforms 14X larger SOTA embodied foundation models in extensive evaluations on various real-world and simulation robots, showcasing exceptional cross-embodiment control and adaptation capability, highlighting the crucial benefit of adopting universal actions.

Play with Code!

Although universal actions represent highly abstract embodiment-agnostic intentions, we can observe fascinating consistencies in how these actions manifest across different embodiments. Try it yourself!

WidowX Robot in real world

Franka Robot in simulation

AIRBOT in real world-few shot adaptation

Generalized Cross-Embodiment Policy in Universal Action Space

Explore the potential of universal action in more embodiments!

Project 1

put the eggplant into the pot

Project 1

flip the pot

Project 1

lift AAA battery

Project 1

Put the toy man in the sink

Project 1

Put the corn in the pot

Project 1

Put the eggplant in the pot

Project 1
Project 1
Project 1
Project 1
Project 1
Project 1
Project 1
Project 1
Project 1
Project 1
Project 1

Methods

Our approach consists of the following key components

Grabber Arm

Universal Action Space

Our novel approach to unifying robot control across different embodiments through a shared action representation

Experimental Results

Comprehensive evaluation across different platforms and tasks

Results on WindoX (Real World)

Model Visual Generalization Motion Generalization Physical Generalization Semantic Generalization Language Grounding Average
Octo 24.0 30.0 10.0 19.4 48.3 28.9
OpenVLA-7B 66.0 35.0 60.0 48.1 88.3 65.1
UniAct-0.5B(Ours) 69.0 66.0 70.0 38.8 73.3 63.3

Results on Franka (Simulation)

Model LIBERO-Spatial LIBERO-Object LIBERO-Goal LIBERO-Long LIBERO-90 Average
Octo 17.5 12.0 36.5 20.0 32.5 27.7
OpenVLA-7B 43.0 66.5 54.5 13.0 44.1 44.1
UniAct-0.5B(Ours) 64.5 77.5 68.0 46.5 60.0 61.3

Results on AIRBOT with different control interfaces(Fast Adaption)

Model Relative EEF Position(easy/hard task) Relative Joint Position(easy/hard task) Absolute EEF Position(easy/hard task) Absolute Joint Position(easy/hard task)
Octo 5.0 / 12.5 22.5 / 17.5 2.5 / 0.0 0.0 / 0.0
OpenVLA-7B 32.5 / 22.5 17.5 / 2.5 32.5 / 15.0 2.5 / 7.5
UniAct-0.5B(Ours) 57.5 / 40.0 47.5 / 45.0 52.5 / 30.0 70.0 / 30.0