USIM and U0: A Vision-Language-Action Dataset and Model for General Underwater Robots

Junwen Gu; Zhiheng Wu

USIM and U0: A Vision-Language-Action Dataset and Model for General Underwater Robots

Junwen Gu^1,3†, Zhiheng Wu^2†, Pengxuan Si³, Shuang Qiu^1,3, Yukai Feng^1,3,
Luoyang Sun^1,3, Laien Luo^1,3, Lianyi Yu^1,3, Jian Wang^1,3*, and Zhengxing Wu^1,3

¹ Institute of Automation, Chinese Academy of Sciences
² Baidu Inc.
³ University of Chinese Academy of Sciences
^† Equal Contribution, ^* Corresponding Author

Paper arXiv

Abstract

Underwater environments pose challenges such as complex hydrodynamics, limited visibility, and constrained communication. To overcome the lack of large-scale data for multi-task autonomy, we present USIM, a simulation-based Vision-Language-Action dataset comprising 561K frames (15.6 hours) from 1,852 trajectories of BlueROV2 interactions across 20 tasks in 9 scenarios. Building on USIM, we propose U0, a multimodal VLA model that fuses binocular vision and sensor data with a convolution-attention perception module. U0 achieves 80% success across tasks including inspection, navigation, and tracking. In challenging mobile manipulation tasks, it reduces the distance to the target by 21.2% compared with baseline methods. Together, USIM and U0 establish a foundation for scalable underwater datasets and general-purpose robotic intelligence.

Method

Overall Framework. Diverse underwater scenarios and a BlueROV2 robot equipped with a manipulator and gripper are first constructed using the Stonefish simulator. Data collection and control are implemented via ROS, resulting in the USIM dataset of 561K frames (approximately 15.6 hours) covering 20 tasks. Based on USIM, U0 is developed with a dual-system architecture, incorporating multimodal sensor fusion and convolution-attention-based perception focus enhancement, while producing both target perception and robot action outputs.

Underwater Scenarios

Seabed

Seabed pipeline

Industrial pool

Solar charging station

Lake

Open sea surface

Underwater factory

Modern shipwreck

Ancient shipwreck

Seabed

Seabed pipeline

Industrial pool

Solar charging station

Lake

Open sea surface

Underwater factory

Modern shipwreck

Ancient shipwreck

BibTeX

@misc{gu2025usimu0visionlanguageactiondataset,
      title={USIM and U0: A Vision-Language-Action Dataset and Model for General Underwater Robots}, 
      author={Junwen Gu and Zhiheng Wu and Pengxuan Si and Shuang Qiu and Yukai Feng and Luoyang Sun and Laien Luo and Lianyi Yu and Jian Wang and Zhengxing Wu},
      year={2025},
      eprint={2510.07869},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2510.07869}, 
}

Note

The dataset, code, and model weights will be made publicly available after the paper’s formal publication.

More Works from Our Lab

Deformation Control and Thrust Analysis of a Flexible Fishtail With Muscle-Like Actuation

USIM and U0: A Vision-Language-Action Dataset and Model for General Underwater Robots

Abstract

Method

Underwater Scenarios

BibTeX