Unleashing Reasoning and Memory Capabilities in VLA Models for Embodied Visual Tracking

Yunpeng Qi^{3, 4}*, Jiazhao Zhang^{1, 2}*,
Minghan Li², Shaoan Wang¹, Kui Wu⁵, Hanjing Ye⁶,
Hong Zhang⁶, Zhibo Chen³, Fangwei Zhong⁷, Zhizheng Zhang^{2, 4}^†, He Wang^{1, 2, 4}^†

¹Peking University ²GalBot ³USTC ⁴BAAI
⁵Beihang University ⁶SUSTech ⁷Beijing Normal University

*Equal Contribution †Equal Advising

TrackVLA++ teaser image

TrackVLA++ is a novel Vision-Language-Action model that incorporates spatial reasoning and target identification memory, enabling superior performance in both long-horizon and highly crowded tracking scenarios.

Summary Video

Abstract

Embodied Visual Tracking (EVT) is a fundamental ability that underpins practical applications, such as companion robots, guidance robots and service assistants, where continuously following moving targets is essential. Recent advances have enabled language-guided tracking in complex and unstructured scenes. However, existing approaches lack explicit spatial reasoning and effective temporal memory, causing failures under severe occlusions or in the presence of similar-looking distractors. To address these challenges, we present TrackVLA++, a novel Vision-Language-Action (VLA) model that enhances embodied visual tracking with two key modules: a spatial reasoning mechanism and a Target Identification Memory (TIM). The reasoning module introduces a Chain-of-Thought paradigm, termed Polar-CoT, which infers the target's relative position and encodes it as a compact polar-coordinate token for action prediction. Guided by these spatial priors, the TIM employs a gated update strategy to preserve long-horizon target memory, ensuring spatiotemporal consistency and mitigating target loss during extended occlusions. Extensive experiments show that TrackVLA++ achieves state-of-the-art performance on public benchmarks across both egocentric and multi-camera settings. On the challenging EVT-Bench DT split, TrackVLA++ surpasses the previous leading approach by 5.1% and 12% respectively. Furthermore, TrackVLA++ exhibits strong zero-shot generalization, enabling robust real-world tracking in dynamic and occluded scenarios.

TrackVLA++ Pipeline

Given a video stream and a language instruction, TrackVLA++ predicts a tracking trajectory by utilizing Polar-CoT reasoning to infer the target's position and continuously updating the Target Identification Memory with CoT-based predictions for long-horizon tracking.

Long Horizon Tracking

TrackVLA++ achieves sustained, long-horizon tracking in complex urban environments, maintaining continuity for over 30 minutes along multi-kilometer trajectories. It remains robust to heavy distractors, multiple barriers, and pronounced illumination changes, demonstrating stable performance in diverse and dynamic settings.

Unstructured Scene Tracking:

TrackVLA++ maintains target identity across complex, unstructured environments, including narrow crosswalks, barrier-dense areas, and wooded scenes; it handles large heading changes, traverses tight corridors into buildings, and continues through constrained spaces such as elevators.

Crowded Scene Tracking

TrackVLA++ exhibits robust performance in crowded scenes with severe occlusions and target disappearance, demonstrating strong spatiotemporal reasoning, consistent re-identification, and long-term memory. It is robust to dynamic distractors common in urban walkways—such as pedestrian flows, crosswalk traffic, and heterogeneous motions—sustaining stable tracking on sidewalks, crosswalks, and related urban settings.

Recognition Ability

TrackVLA++ leverages multi-camera inputs to recognize targets under off-axis views (i.e., non-frontal, oblique perspectives) and reorient toward the target, demonstrating strong spatial understanding and cross-view reasoning capability.

Cross Domain Generalization

TrackVLA++ exhibits cross-domain generalization, enabling robust tracking across diverse scene styles, viewpoints, and camera parameters, while also generalizing beyond human subjects to a broad spectrum of objects, all without additional adaptation.