(a) Data and Phase Decomposition: Task decomposition with aligned visual observations and language annotations.
(b) Input-level Adaptation via Masking: Phase-aware masking enables the model to selectively attend to relevant tokens during attention computation without modifying the input structure.
(c) End to End Training: End-to-end training using decomposed data with phase-aware masking.
We design two tasks, Sorting and Cleaning, to validate the performance of Long-VLA on long-horizon tasks.
Long-VLA consistently outperformed both the Base Policy and the current SOTA method, pi0.
In addition, we demonstrate robust performance under both unseen lighting conditions and visually distracted environments.
Below, we showcase several demos.