A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge
,
Hongsi Liu27*,
Zekun Qi34*,
Yunan Wang12*,
Xinqiang Yu4,
Jiazhao Zhang45,
Runpei Dong6,
Jiawei He4,
He Wang45,
Zhizheng Zhang4,
Li Yi3,
Wenjun Zeng2,
Xin Jin2†
* equal contribution † corresponding authors
1Shanghai Jiao Tong University
2Eastern Institute of Technology
3Tsinghua University
4Galbot
5Peking University
6UIUC
7University of Science and Technology of China
Abstract
Recent advances in vision-language-action (VLA) models have shown promise in integrating image generation with action prediction to improve generalization and reasoning in robot manipulation. However, existing methods are limited to challenging image-based forecasting, which suffers from redundant information and lacks comprehensive and critical world knowledge, including dynamic, spatial and semantic information. To address these limitations, we propose DreamVLA, a novel VLA framework that integrates comprehensive world knowledge forecasting to enable inverse dynamics modeling, thereby establishing a perception-prediction-action loop for manipulation tasks. Specifically, DreamVLA introduces a dynamic-region-guided world knowledge prediction, integrated with the spatial and semantic cues, which provide compact yet comprehensive representations for action planning. This design aligns with how humans interact with the world by first forming abstract multimodal reasoning chains before acting. To mitigate interference among the dynamic, spatial and semantic information during training, we adopt a block-wise structured attention mechanism that masks their mutual attention, preventing information leakage and keeping each representation clean and disentangled. Moreover, to model the conditional distribution over future actions, we employ a diffusion-based transformer that disentangles action represen- tations from shared latent features. Extensive experiments on both real-world and simulation environments demonstrate that DreamVLA achieves 76.7% success rate on real robot tasks and 4.44 average length on the CALVIN ABC-D benchmarks.
Comparison with previous VLA paradigm
(a) Vanilla VLA directly maps visual observations and language instructions to actions. (b) Models leveraging separate image/video generation or copilot models to generate future frames or trajectories, subsequently guiding an action head. (c) VLA variants explicitly predict a subgoal image as an intermediate visual reasoning step prior to action generation. (d) Our proposed DreamVLA, which explicitly predicts dynamic regions, depth map, semantics (DINOv2 and SAM) knowledge, significantly enhances the model's action reasoning and generalization.
Pipeline
Given the current robot state, observation, and language instruction, DreamVLA encodes multimodal inputs via frozen text, visual encoders and a tunable state encoder. These tokens, together with a learnable set of <dream> queries, are processed by a large language model to produce world embedding. Three lightweight decoders then project each corresponding element of this embedding into the dynamics region, monocular depth and high-level semantics. A separate <action> query draws a latent action embedding, which conditions a diffusion transformer that refines Gaussian noise into an n-step action sequence. The dashed box highlights prediction heads that are used only during training, inference skips these heads and operates directly on the world embedding.
Dynamic Regions
Visualization of dynamic regions over time. We show the static camera (left) and wrist-mounted camera (right) observations alongside the corresponding dynamic masks generated by our method at multiple time steps. The masks highlight dynamic regions by leveraging optical flow trajectories extracted via CoTracker. Compared to the original observations, our method effectively suppresses irrelevant background and focuses on interaction-relevant areas (e.g., moving objects and end-effector), enabling more structured and efficient action reasoning.
Block-wise Structured Attention
To preserve clear modality bound-aries, <dream> query is decomposed into three sub-queries (dynamic, depth and semantics). If these sub-queries could freely attend to one another, high-frequency flow details would contaminate depth reasoning, and semantic cues might bleed into motion features, producing noisy mixed representations. We therefore mask their mutual attention: each sub-query attends only to the shared visual, language, and state tokens, while direct links among the three are disabled, keeping their latent features disentangled and free of cross-talk.
CALVIN ABC-D Experiments
We present the average success computed over 1000 rollouts for each task and the average number of completed tasks to solve 5 instructions consecutively (Avg. Len.). DreamVLA shows significant superiority over baselines. The best results are bolde.
CALVIN Demo
Semantic orientation can not only be applied to manipulation tasks but also to robotic navigation task. This orientation-aware constraint enhances the navigation process by ensuring precise alignment with the desired orientation, thereby improving task performance in scenarios where directionality is critical.
lift_pink_block_slider->place_in_slider->turn_on_led->close_drawer->rotate_blue_block_left
turn_off_led->move_slider_left->rotate_blue_block_right->lift_red_block_table->stack_block
turn_off_led->lift_red_block_slider->place_in_slider->move_slider_right->push_blue_block_left
rotate_pink_block_right->move_slider_right->turn_on_led->close_drawer->lift_pink_block_table
rotate_pink_block_left->turn_on_led->push_into_drawer->lift_pink_block_drawer->place_in_drawer
rotate_blue_block_right->lift_pink_block_table->place_in_slider->move_slider_left->open_drawer
Real-world Demo
The real-world demonstration showcases the practical application of our model in a physical environment. We collect post-training data using SoFar, a modular-based robot manipulation method.
Place the banana into the basket.
Place the chili into the basket.
Place the chili into the white basket.
BibTeX
@article{dreamvla25,
author = {Wenyao Zhang and
Hongsi Liu and
Zekun Qi and
Yunan Wang and
Xinqiang Yu and
Jiazhao Zhang and
Runpei Dong and
Jiawei He and
He Wang and
Zhizheng Zhang and
Li Yi and
Wenjun Zeng and
Xin Jin},
title = {DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge},
journal = {CoRR},
volume = {abs/2507.04447},
year = {2025},
url = {https://doi.org/10.48550/arXiv.2507.04447},
doi = {10.48550/ARXIV.2507.04447},
eprinttype = {arXiv},
eprint = {2507.04447}
}