Universal Actions for Enhanced Embodied Foundation Models
Jinliang Zheng 1,2,5,*, Jianxiong Li1,*, Dongxiu Liu4,1,*, Yinan Zheng1, Zhihao Wang3,1, Zhonghong Ou1, Yu Liu2, Jingjing Liu1, Ya-Qin Zhang1, Xianyuan Zhan1,5,†
1AIR, Tsinghua University
2SenseTime Research
3Peking University
4Beijing University of Posts and Telecommunications
5Shanghai AI Lab
*
Equal contribution†Corresponding authors
Training on diverse, internet-scale data is a key factor in the success of recent large foundation models. Yet, using the same recipe for building embodied agents has faced noticeable difficulties. Despite the availability of many crowd-sourced embodied datasets, their action spaces often exhibit significant heterogeneity due to distinct physical embodiment and control interfaces for different robots, causing substantial challenges in developing embodied foundation models using cross-embodiment data. In this paper, we introduce UniAct, a new embodied foundation modeling framework operating in the Universal Action Space. Our learned universal actions capture the generic behaviors across diverse robots by exploiting their shared structural features, and enable enhanced cross-domain data utilization and cross-embodiment generalizations by eliminating the notorious heterogeneity. Moreover, the universal actions can be efficiently translated back to heterogeneous actionable commands by simply adding embodiment-specific details, from which fast adaptation to new robots becomes simple and straightforward. Our 0.5B instantiation of UniAct outperforms 14X larger SOTA embodied foundation models in extensive evaluations on various real-world and simulation robots, showcasing exceptional cross-embodiment control and adaptation capability, highlighting the crucial benefit of adopting universal actions.
Play with Code!
Although universal actions represent highly abstract embodiment-agnostic intentions, we can observe fascinating consistencies in how these actions manifest across different embodiments. Try it yourself!
WidowX Robot in real world
Franka Robot in simulation
AIRBOT in real world-few shot adaptation
Generalized Cross-Embodiment Policy in Universal Action Space
Explore the potential of universal action in more embodiments!
put the eggplant into the pot
flip the pot
lift AAA battery
Put the toy man in the sink
Put the corn in the pot
Put the eggplant in the pot
Methods
Our approach consists of the following key components
Universal Action Space
Our novel approach to unifying robot control across different embodiments through a shared action representation
Experimental Results
Comprehensive evaluation across different platforms and tasks
Results on WindoX (Real World)
Model | Visual Generalization | Motion Generalization | Physical Generalization | Semantic Generalization | Language Grounding | Average |
---|---|---|---|---|---|---|
Octo | 24.0 | 30.0 | 10.0 | 19.4 | 48.3 | 28.9 |
OpenVLA-7B | 66.0 | 35.0 | 60.0 | 48.1 | 88.3 | 65.1 |
UniAct-0.5B(Ours) | 69.0 | 66.0 | 70.0 | 38.8 | 73.3 | 63.3 |
Results on Franka (Simulation)
Model | LIBERO-Spatial | LIBERO-Object | LIBERO-Goal | LIBERO-Long | LIBERO-90 | Average |
---|---|---|---|---|---|---|
Octo | 17.5 | 12.0 | 36.5 | 20.0 | 32.5 | 27.7 |
OpenVLA-7B | 43.0 | 66.5 | 54.5 | 13.0 | 44.1 | 44.1 |
UniAct-0.5B(Ours) | 64.5 | 77.5 | 68.0 | 46.5 | 60.0 | 61.3 |
Results on AIRBOT with different control interfaces(Fast Adaption)
Model | Relative EEF Position(easy/hard task) | Relative Joint Position(easy/hard task) | Absolute EEF Position(easy/hard task) | Absolute Joint Position(easy/hard task) |
---|---|---|---|---|
Octo | 5.0 / 12.5 | 22.5 / 17.5 | 2.5 / 0.0 | 0.0 / 0.0 |
OpenVLA-7B | 32.5 / 22.5 | 17.5 / 2.5 | 32.5 / 15.0 | 2.5 / 7.5 |
UniAct-0.5B(Ours) | 57.5 / 40.0 | 47.5 / 45.0 | 52.5 / 30.0 | 70.0 / 30.0 |