Long-VLA: Unleashing Long-Horizon Capability of Vision Language Action Model for Robot Manipulation

1MiLAB, Westlake University

2Zhejiang University

3Xi'an Jiaotong University

4University of Electronic Science and Technology of China

5Beijing University of Aeronautics and Astronautics

( *Equal Contribution, Project Leader, Equal Corresponding Author)

CoRL 2025

Overview

Long-VLA Framework

In this work, we introduce Long-VLA, the first end-to-end VLA model specifically designed for long-horizon robotic tasks. Our approach features a novel phase-aware input masking strategy that adaptively segments each subtask into moving and interaction phases, enabling the model to focus on phase-relevant sensory cues and enhancing subtask compatibility. This unified strategy preserves the scalability and data efficiency of VLA training, and our architecture-agnostic module can be seamlessly integrated into existing VLA models. We further propose the L-CALVIN benchmark to systematically evaluate long-horizon manipulation. Extensive experiments on both simulated and real-world tasks demonstrate that Long-VLA significantly outperforms prior state-of-the-art methods, establishing a new baseline for long-horizon robotic control.

Long-VLA Framework

(a) Data and Phase Decomposition: Task decomposition with aligned visual observations and language annotations.
(b) Input-level Adaptation via Masking: Phase-aware masking enables the model to selectively attend to relevant tokens during attention computation without modifying the input structure.
(c) End to End Training: End-to-end training using decomposed data with phase-aware masking.

Experiments

We design two tasks, Sorting and Cleaning, to validate the performance of Long-VLA on long-horizon tasks.
Long-VLA consistently outperformed both the Base Policy and the current SOTA method, pi0.
In addition, we demonstrate robust performance under both unseen lighting conditions and visually distracted environments. Below, we showcase several demos.

Cleaning up the kitchen

Random Location 1

Random Location 2

Unseen Lighting

Visual Distraction

Sorting the cubes with the order 'CORL'

Random Location 1

Random Location 2

Unseen Lighting

Visual Distraction

BibTeX