(a) Data and Phase Decomposition: Task decomposition with aligned visual observations and language annotations.
(b) Input-level Adaptation via Masking: Phase-aware masking enables the model to selectively attend to relevant tokens during attention computation without modifying the input structure.
(c) End to End Training: End-to-end training using decomposed data with phase-aware masking.
We design two tasks, Sorting and Cleaning, to validate the performance of Long-VLA on long-horizon tasks.
Long-VLA consistently outperformed both the Base Policy and the current SOTA method, pi0.
In addition, we demonstrate robust performance under both unseen lighting conditions and visually distracted environments.
Below, we showcase several demos.
@misc{fan2025longvlaunleashinglonghorizoncapability,
title={Long-VLA: Unleashing Long-Horizon Capability of Vision Language Action Model for Robot Manipulation},
author={Yiguo Fan and Pengxiang Ding and Shuanghao Bai and Xinyang Tong and Yuyang Zhu and Hongchao Lu and Fengqi Dai and Wei Zhao and Yang Liu and Siteng Huang and Zhaoxin Fan and Badong Chen and Donglin Wang},
year={2025},
eprint={2508.19958},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2508.19958},
}