A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge

A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge

, Hongsi Liu^27*, Zekun Qi^34*, Yunan Wang^12*,
Xinqiang Yu⁴, Jiazhao Zhang⁴⁵, Runpei Dong⁶, Jiawei He⁴,
He Wang⁴⁵, Zhizheng Zhang⁴, Li Yi³, Wenjun Zeng², Xin Jin^2†

* equal contribution † corresponding authors

¹Shanghai Jiao Tong University ²Eastern Institute of Technology ³Tsinghua University
⁴Galbot ⁵Peking University ⁶UIUC ⁷University of Science and Technology of China

Abstract

Recent advances in vision-language-action (VLA) models have shown promise in integrating image generation with action prediction to improve generalization and reasoning in robot manipulation. However, existing methods are limited to challenging image-based forecasting, which suffers from redundant information and lacks comprehensive and critical world knowledge, including dynamic, spatial and semantic information. To address these limitations, we propose DreamVLA, a novel VLA framework that integrates comprehensive world knowledge forecasting to enable inverse dynamics modeling, thereby establishing a perception-prediction-action loop for manipulation tasks. Specifically, DreamVLA introduces a dynamic-region-guided world knowledge prediction, integrated with the spatial and semantic cues, which provide compact yet comprehensive representations for action planning. This design aligns with how humans interact with the world by first forming abstract multimodal reasoning chains before acting. To mitigate interference among the dynamic, spatial and semantic information during training, we adopt a block-wise structured attention mechanism that masks their mutual attention, preventing information leakage and keeping each representation clean and disentangled. Moreover, to model the conditional distribution over future actions, we employ a diffusion-based transformer that disentangles action represen- tations from shared latent features. Extensive experiments on both real-world and simulation environments demonstrate that DreamVLA achieves 76.7% success rate on real robot tasks and 4.44 average length on the CALVIN ABC-D benchmarks.

Comparison with previous VLA paradigm

Interpolation end reference image.

(a) Vanilla VLA directly maps visual observations and language instructions to actions. (b) Models leveraging separate image/video generation or copilot models to generate future frames or trajectories, subsequently guiding an action head. (c) VLA variants explicitly predict a subgoal image as an intermediate visual reasoning step prior to action generation. (d) Our proposed DreamVLA, which explicitly predicts dynamic regions, depth map, semantics (DINOv2 and SAM) knowledge, significantly enhances the model's action reasoning and generalization.

Pipeline

Interpolation end reference image.

Given the current robot state, observation, and language instruction, DreamVLA encodes multimodal inputs via frozen text, visual encoders and a tunable state encoder. These tokens, together with a learnable set of <dream> queries, are processed by a large language model to produce world embedding. Three lightweight decoders then project each corresponding element of this embedding into the dynamics region, monocular depth and high-level semantics. A separate <action> query draws a latent action embedding, which conditions a diffusion transformer that refines Gaussian noise into an n-step action sequence. The dashed box highlights prediction heads that are used only during training, inference skips these heads and operates directly on the world embedding.

Dynamic Regions

Interpolation end reference image.

Visualization of dynamic regions over time. We show the static camera (left) and wrist-mounted camera (right) observations alongside the corresponding dynamic masks generated by our method at multiple time steps. The masks highlight dynamic regions by leveraging optical flow trajectories extracted via CoTracker. Compared to the original observations, our method effectively suppresses irrelevant background and focuses on interaction-relevant areas (e.g., moving objects and end-effector), enabling more structured and efficient action reasoning.

Block-wise Structured Attention

Interpolation end reference image.

To preserve clear modality bound-aries, <dream> query is decomposed into three sub-queries (dynamic, depth and semantics). If these sub-queries could freely attend to one another, high-frequency flow details would contaminate depth reasoning, and semantic cues might bleed into motion features, producing noisy mixed representations. We therefore mask their mutual attention: each sub-query attends only to the shared visual, language, and state tokens, while direct links among the three are disabled, keeping their latent features disentangled and free of cross-talk.

CALVIN ABC-D Experiments

We present the average success computed over 1000 rollouts for each task and the average number of completed tasks to solve 5 instructions consecutively (Avg. Len.). DreamVLA shows significant superiority over baselines. The best results are bolde.

Interpolation end reference image.

CALVIN Demo

Semantic orientation can not only be applied to manipulation tasks but also to robotic navigation task. This orientation-aware constraint enhances the navigation process by ensuring precise alignment with the desired orientation, thereby improving task performance in scenarios where directionality is critical.

lift_pink_block_slider->place_in_slider->turn_on_led->close_drawer->rotate_blue_block_left

turn_off_led->move_slider_left->rotate_blue_block_right->lift_red_block_table->stack_block

turn_off_led->lift_red_block_slider->place_in_slider->move_slider_right->push_blue_block_left

rotate_pink_block_right->move_slider_right->turn_on_led->close_drawer->lift_pink_block_table

rotate_pink_block_left->turn_on_led->push_into_drawer->lift_pink_block_drawer->place_in_drawer

rotate_blue_block_right->lift_pink_block_table->place_in_slider->move_slider_left->open_drawer

Real-world Demo

The real-world demonstration showcases the practical application of our model in a physical environment. We collect post-training data using SoFar, a modular-based robot manipulation method.

Place the banana into the basket.

Place the chili into the basket.

Place the chili into the white basket.

BibTeX

@article{dreamvla25,
          author       = {Wenyao Zhang and
                          Hongsi Liu and
                          Zekun Qi and
                          Yunan Wang and
                          Xinqiang Yu and
                          Jiazhao Zhang and
                          Runpei Dong and
                          Jiawei He and
                          He Wang and
                          Zhizheng Zhang and
                          Li Yi and 
                          Wenjun Zeng and
                          Xin Jin},
          title        = {DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge},
          journal      = {CoRR},
          volume       = {abs/2507.04447},
          year         = {2025},
          url          = {https://doi.org/10.48550/arXiv.2507.04447},
          doi          = {10.48550/ARXIV.2507.04447},
          eprinttype    = {arXiv},
          eprint       = {2507.04447}
        }