Long-VLA: Unleashing Long-Horizon Capability of Vision Language Action Model for Robot Manipulation

¹MiLAB, Westlake University

²Zhejiang University

³Xi'an Jiaotong University

⁴University of Electronic Science and Technology of China

⁵Beijing University of Aeronautics and Astronautics

( ^*Equal Contribution, ^†Project Leader, ^✉Equal Corresponding Author)

CoRL 2025

Overview

In this work, we introduce Long-VLA, the first end-to-end VLA model specifically designed for long-horizon robotic tasks. Our approach features a novel phase-aware input masking strategy that adaptively segments each subtask into moving and interaction phases, enabling the model to focus on phase-relevant sensory cues and enhancing subtask compatibility. This unified strategy preserves the scalability and data efficiency of VLA training, and our architecture-agnostic module can be seamlessly integrated into existing VLA models. We further propose the L-CALVIN benchmark to systematically evaluate long-horizon manipulation. Extensive experiments on both simulated and real-world tasks demonstrate that Long-VLA significantly outperforms prior state-of-the-art methods, establishing a new baseline for long-horizon robotic control.

Long-VLA Framework

(a) Data and Phase Decomposition: Task decomposition with aligned visual observations and language annotations.
(b) Input-level Adaptation via Masking: Phase-aware masking enables the model to selectively attend to relevant tokens during attention computation without modifying the input structure.
(c) End to End Training: End-to-end training using decomposed data with phase-aware masking.

Experiments

We design two tasks, Sorting and Cleaning, to validate the performance of Long-VLA on long-horizon tasks.
Long-VLA consistently outperformed both the Base Policy and the current SOTA method, pi0.
In addition, we demonstrate robust performance under both unseen lighting conditions and visually distracted environments. Below, we showcase several demos.

Cleaning up the kitchen

Random Location 1

Random Location 2

Unseen Lighting

Visual Distraction

Sorting the cubes with the order 'CORL'

Random Location 1

Random Location 2

Unseen Lighting

Visual Distraction

BibTeX

@misc{fan2025longvlaunleashinglonghorizoncapability, title={Long-VLA: Unleashing Long-Horizon Capability of Vision Language Action Model for Robot Manipulation}, author={Yiguo Fan and Pengxiang Ding and Shuanghao Bai and Xinyang Tong and Yuyang Zhu and Hongchao Lu and Fengqi Dai and Wei Zhao and Yang Liu and Siteng Huang and Zhaoxin Fan and Badong Chen and Donglin Wang}, year={2025}, eprint={2508.19958}, archivePrefix={arXiv}, primaryClass={cs.RO}, url={https://arxiv.org/abs/2508.19958}, }