Universal Actions for Enhanced Embodied Foundation Models

Paper Code Models

Jinliang Zheng ^1,2,5,*, Jianxiong Li^1,*, Dongxiu Liu^4,1,*, Yinan Zheng¹, Zhihao Wang^3,1, Zhonghong Ou¹, Yu Liu², Jingjing Liu¹, Ya-Qin Zhang¹, Xianyuan Zhan^1,5,†

¹AIR, Tsinghua University
²SenseTime Research
³Peking University
⁴Beijing University of Posts and Telecommunications
⁵Shanghai AI Lab
^* Equal contribution^†Corresponding authors

Training on diverse, internet-scale data is a key factor in the success of recent large foundation models. Yet, using the same recipe for building embodied agents has faced noticeable difficulties. Despite the availability of many crowd-sourced embodied datasets, their action spaces often exhibit significant heterogeneity due to distinct physical embodiment and control interfaces for different robots, causing substantial challenges in developing embodied foundation models using cross-embodiment data. In this paper, we introduce UniAct, a new embodied foundation modeling framework operating in the Universal Action Space. Our learned universal actions capture the generic behaviors across diverse robots by exploiting their shared structural features, and enable enhanced cross-domain data utilization and cross-embodiment generalizations by eliminating the notorious heterogeneity. Moreover, the universal actions can be efficiently translated back to heterogeneous actionable commands by simply adding embodiment-specific details, from which fast adaptation to new robots becomes simple and straightforward. Our 0.5B instantiation of UniAct outperforms 14X larger SOTA embodied foundation models in extensive evaluations on various real-world and simulation robots, showcasing exceptional cross-embodiment control and adaptation capability, highlighting the crucial benefit of adopting universal actions.

Play with Code!

Although universal actions represent highly abstract embodiment-agnostic intentions, we can observe fascinating consistencies in how these actions manifest across different embodiments. Try it yourself!

WidowX Robot in real world

Franka Robot in simulation

AIRBOT in real world-few shot adaptation

Generalized Cross-Embodiment Policy in Universal Action Space

Explore the potential of universal action in more embodiments!

put the eggplant into the pot

flip the pot

lift AAA battery

Put the toy man in the sink

Put the corn in the pot

Put the eggplant in the pot

Methods

Our approach consists of the following key components

Universal Action Space

Our novel approach to unifying robot control across different embodiments through a shared action representation

Experimental Results

Comprehensive evaluation across different platforms and tasks

Results on WindoX (Real World)

Model	Visual Generalization	Motion Generalization	Physical Generalization	Semantic Generalization	Language Grounding	Average
Octo	24.0	30.0	10.0	19.4	48.3	28.9
OpenVLA-7B	66.0	35.0	60.0	48.1	88.3	65.1
UniAct-0.5B(Ours)	69.0	66.0	70.0	38.8	73.3	63.3

Results on Franka (Simulation)

Model	LIBERO-Spatial	LIBERO-Object	LIBERO-Goal	LIBERO-Long	LIBERO-90	Average
Octo	17.5	12.0	36.5	20.0	32.5	27.7
OpenVLA-7B	43.0	66.5	54.5	13.0	44.1	44.1
UniAct-0.5B(Ours)	64.5	77.5	68.0	46.5	60.0	61.3

Results on AIRBOT with different control interfaces(Fast Adaption)

Model	Relative EEF Position(easy/hard task)	Relative Joint Position(easy/hard task)	Absolute EEF Position(easy/hard task)	Absolute Joint Position(easy/hard task)
Octo	5.0 / 12.5	22.5 / 17.5	2.5 / 0.0	0.0 / 0.0
OpenVLA-7B	32.5 / 22.5	17.5 / 2.5	32.5 / 15.0	2.5 / 7.5
UniAct-0.5B(Ours)	57.5 / 40.0	47.5 / 45.0	52.5 / 30.0	70.0 / 30.0