GitHub - xipi702/awesome-pure-vla: This is a pure vla survey.

awesome-pure-vla

This is a pure vla survey.

Awesome Pure Vision-Language-Action (VLA) Model

Paper

Contents

Autoregression-Based VLA

2025 WorldVLA: Towards Autoregressive Action World Model
[Paper] [Code]

2025 UP-VLA: A Unified Understanding and Prediction Model for Embodied Agent.
[Paper]

2025 Universal Actions for Enhanced Embodied Foundation Models
[Paper]

2025 Humanoid-VLA: Towards Universal Humanoid Control with Visual Integration.
[Paper]

2025 NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks.
[Paper] [Code]

2025 OneTwoVLA: A Unified Vision-Language-Action Model with Adaptive Reasoning
[Paper] [Code]

2025 VOTE: Vision-Language-Action Optimization with Trajectory Ensemble Voting
[Paper]

2025 UniVLA: Learning to Act Anywhere with Task-centric Latent Actions
[Paper]

2025 Unveiling the Potential of Vision-Language-Action Models with Open-Ended Multimodal Instructions
[Paper]

2025 UAV-VLA: Vision-Language-Action System for Large Scale Aerial Mission Generation.
[Paper]

2025 Tactile-VLA: Unlocking Vision-Language-Action Model's Physical Knowledge for Tactile Generalization
[Paper]

2025 Shake-VLA: Vision-Language-Action Model-Based System for Bimanual Robotic Manipulations and Liquid Mixing.
[Paper]

2025 Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models.
[Paper]

2025 DexGraspVLA: A Vision-Language-Action Framework Towards General Dexterous Grasping.
[Paper]

2025 HAMSTER: Hierarchical Action Models For Open-World Robot Manipulation
[Paper]

2025 CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models.
[Paper]

2025 Gemini Robotics: Bringing AI into the Physical World.
[Paper]

2025 CognitiveDrone: A VLA model and evaluation benchmark for real-time cognitive task solving and reasoning in UAVs
[Paper] [Code]

2025 $\pi$ 0.5: A Vision-Language-Action Model with Open-World Generalization.
[Paper] [Code]

2025 InSpire: Vision-Language-Action Models with Intrinsic Spatial Reasoning
[Paper] [Code]

2025 OpenDriveVLA: Towards End-to-end Autonomous Driving with Large Vision Language Action Model.
[Paper] [Code]

2025 GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data.
[Paper]

2025 RaceVLA: VLA-based Racing Drone Navigation with Human-like Behaviour.
[Paper]

2025 VTLA: Vision-Tactile-Language-Action Model with Preference Learning for Insertion Manipulation
[Paper]

2025 PointVLA: Injecting the 3D World into Vision-Language-Action Models.
[Paper]

2025 Interleave-VLA: Enhancing Robot Manipulation with Interleaved Image-Text Instructions.
[Paper]

2025 MoManipVLA: Transferring Vision-language-action Models for General Mobile Manipulation.
[Paper]

2025 FAST: Efficient Action Tokenization for Vision-Language-Action Models.
[Paper]

2025 SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model.
[Paper]

2025 VLA-Cache: Towards Efficient Vision-Language-Action Model via Adaptive Token Caching in Robotic Manipulation.
[Paper] [Code]

2025 Beyond Sight: Finetuning Generalist Robot Policies with Heterogeneous Sensors via Language Grounding.
[Paper]

2025 MoLe-VLA: Dynamic Layer-skipping Vision-Language-Action Model via Mixture-of-Layers for Efficient Robot Manipulation.
[Paper] [Code]

2025 Accelerating Vision-Language-Action Model Integrated with Action Chunking via Parallel Decoding.
[Paper]

2025 BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation
[Paper] [Code]

2025 GR-MG: Leveraging Partially-Annotated Data Via Multi-Modal Goal-Conditioned Policy
[Paper] [Code]

2025 LoHoVLA: A Vision-Language-Action Model for Long-Horizon Embodied Tasks
[Paper]

2025 OTTER: A Vision-Language-Action Model with Text-Aware Visual Feature Extraction.
[Paper]

2025 ChatVLA: Unified Multimodal Understanding and Robot Control with Vision-Language-Action Model.
[Paper]

2025 From Foresight to Forethought: VLM-In-the-Loop Policy Steering via Latent Alignment
[Paper]

2025 VLAS: Vision-Language-Action Model with Speech Instructions for Customized Robot Manipulation.
[Paper] [Code]

2025 Agentic Robot: A Brain-Inspired Framework for Vision-Language-Action Models in Embodied Agents.
[Paper]

2025 VLA Model – Expert Collaboration for Bi-directional Manipulation Learning.
[Paper]

2025 CombatVLA: An Efficient Vision-Language-Action Model for Combat Tasks in 3D Action Role-Playing Games.
[Paper]

2025 JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse.
[Paper]

2025 CronusVLA: Transferring Latent Motion Across Time for Multi-Frame Prediction in Manipulation
[Paper] [Code]

2025 Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success.
[Paper]

2025 VaViM and VaVAM: Autonomous Driving through Video Generative Modeling.
[Paper]

2025 SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment.
[Paper]

2025 ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation.
[Paper]

2025 Pre-training Auto-regressive Robotic Models with 4D Representations
[Paper]

2025 TLA: Tactile-Language-Action Model for Contact-Rich Manipulation.
[Paper]

2025 OG-VLA: 3D-Aware Vision Language Action Model via Orthographic Image Generation
[Paper]

2025 4D-VLA: Spatiotemporal Vision-Language-Action Pretraining with Cross-Scene Calibration
[Paper] [Code]

2025 Fast ECoT: Efficient Embodied Chain-of-Thought via Thoughts Reuse
[Paper]

2024 OpenVLA: An Open-Source Vision-Language-Action Model.
[Paper]

2024 Octo: An Open-Source Generalist Robot Policy.
[Paper]

2024 Robot Utility Models: General Policies for Zero-Shot Deployment in New Environments.
[Paper]

2024 RoboMM: All-in-One Multimodal Large Model for Robotic Manipulation.
[Paper]

2024 LLaRA: Supercharging Robot Learning Data for Vision-Language Policy.
[Paper] [Code]

2024 Robotic Control via Embodied Chain-of-Thought Reasoning.
[Paper]

2024 Mobility VLA: Multimodal Instruction Navigation with Long-Context VLMs and Topological Graphs.
[Paper]

2024 Scaling Cross-Embodied Learning: One Policy for Manipulation, Navigation, Locomotion and Aviation
[Paper]

2024 GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation.
[Paper]

2024 Bi-VLA: Vision-Language-Action Model-Based System for Bimanual Robotic Dexterous Manipulations.
[Paper]

2024 RoboNurse-VLA: Robotic Scrub Nurse System based on Vision-Language-Action Model.
[Paper]

2024 Moto: Latent Motion Token as the Bridging Language for Robot Manipulation.
[Paper]

2024 TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies.
[Paper]

2024 Actra: Optimized Transformer Architecture for Vision-Language-Action Models in Robot Learning.
[Paper]

2024 DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution.
[Paper] [Code]

2024 Language Reasoning in Vision-Language-Action Model for Robotic Grasping.
[Paper]

2024 RT-Affordance: Affordances are Versatile Intermediate Representations for Robot Manipulation.
[Paper]

2024 OccLLaMA: An Occupancy-Language-Action Generative World Model for Autonomous Driving.
[Paper]

2024 Language Models as Zero-Shot Trajectory Generators.
[Paper]

2024 Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks.
[Paper]

2024 Latent Action Pretraining from Videos.
[Paper] [Code]

2024 RVT-2: Learning Precise Manipulation from Few Demonstrations.
[Paper]

2024 In-Context Imitation Learning via Next-Token Prediction
[Paper]

2024 BAKU: An Efficient Transformer for Multi-Task Policy Learning.
[Paper]

2024 QUAR-VLA: Vision-Language-Action Model for Quadruped Robots.
[Paper]

2024 HiRT: Enhancing Robotic Control with Hierarchical Robot Transformers.
[Paper]

2024 MissionGPT: Mission Planner for Mobile Robot based on Robotics Transformer Model.
[Paper]

2023 PaLM-E: An Embodied Multimodal Language Model.
[Paper]

2023 RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control.
[Paper]

2023 An Embodied Generalist Agent in 3D World.
[Paper]

2023 Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model.
[Paper]

2023 Vision-Language Foundation Models as Effective Robot Imitators.
[Paper]

2023 Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation.
[Paper]

2023 Compositional Foundation Models for Hierarchical Planning.
[Paper] [Code]

2023 Prompt a Robot to Walk with Large Language Models.
[Paper]

2023 Open-Ended Instructable Embodied Agents with Memory-Augmented Large Language Models.
[Paper]

2023 RoboAgent: Generalization and Efficiency in Robot Manipulation via Semantic Augmentations and Action Chunking.
[Paper]

2023 Look Before You Leap: Unveiling the Power of GPT-4V in Robotic Vision-Language Planning.
[Paper]

2023 Open-World Object Manipulation using Pre-trained Vision-Language Models.
[Paper]

2023 Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware.
[Paper]

2023 Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action
[Paper]

2022 RT-1: Robotics Transformer for Real-World Control at Scale.
[Paper]

2022 A Generalist Agent.
[Paper]

2022 Inner Monologue: Embodied Reasoning through Planning with Language Models.
[Paper]

2022 LATTE: LAnguage Trajectory TransformEr.
[Paper]

2022 VIMA: General Robot Manipulation with Multimodal Prompts.
[Paper]

2022 Instruction-Following Agents with Multimodal Transformer.
[Paper]

2022 Interactive Language: Talking to Robots in Real Time.
[Paper]

Diffusion-based VLA

2025 RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation.
[Paper]

2025 Time- Diffusion Policy with Action Discrimination for Robotic Manipulation
[Paper]

2025 CDP: Towards Robust Autoregressive Visuomotor Policy Learning via Causal Diffusion
[Paper]

2025 Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies
[Paper]

2025 Task Reconstruction and Extrapolation for π_0 using Text Latent
[Paper]

2025 Dita: Scaling diffusion transformer for generalist vision-language-action policy
[Paper] [Code]

2025 ForceVLA: Enhancing VLA Models with a Force-aware MoE for Contact-rich Manipulation
[Paper]

2025 SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
[Paper]

2025 VQ-VLA: Improving Vision-Language-Action Models via Scaling Vector-Quantized Action Tokenizers
[Paper] [Code]

2025 DexVLG: Dexterous Vision-Language-Grasp Model at Scale
[Paper] [Code]

2025 AC-DiT: Adaptive Coordination Diffusion Transformer for Mobile Manipulation
[Paper] [Code]

2025 DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control.
[Paper] [Code]

2025 MinD: Unified Visual Imagination and Control via Hierarchical World Models
[Paper] [Code]

2025 Hume: Introducing System-2 Thinking in Visual-Language-Action Model
[Paper]

2025 TriVLA: A Triple-System-Based Unified Vision-Language-Action Model for General Robot Control
[Paper] [Code]

2025 DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge
[Paper] [Code]

2025 GEVRM: Goal-Expressive Video Generation Model For Robust Visual Manipulation
[Paper]

2025 DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving
[Paper] [Code]

2025 DreamGen: Unlocking Generalization in Robot Learning through Video World Models
[Paper] [Code]

2025 Enerverse: Envisioning embodied future space for robotics manipulation
[Paper] [Code]

2025 VidBot: Learning Generalizable 3D Actions from In-the-Wild 2D Human Videos for Zero-Shot Robotic Manipulation
[Paper] [Code]

2025 Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success.
[Paper]

2025 TrackVLA: Embodied Visual Tracking in the Wild.
[Paper]

2025 FP3: A 3D Foundation Policy for Robotic Manipulation.
[Paper]

2025 GR00T N1: An Open Foundation Model for Generalist Humanoid Robots.
[Paper]

2025 ObjectVLA: End-to-End Open-World Object Manipulation Without Demonstration.
[Paper]

2025 SwitchVLA: Execution-Aware Task Switching for Vision-Language-Action Models
[Paper]

2025 A0: An affordance-aware hierarchical model for general robotic manipulation
[Paper] [Code]

2025 Pixel Motion as Universal Representation for Robot Control
[Paper] [Code]

2025 Evo-0: Vision-Language-Action Model with Implicit Spatial Understanding
[Paper] [Code]

2024 3D Diffuser Actor: Policy Diffusion with 3D Scene Representations.
[Paper]

2024 Multimodal Diffusion Transformer: Learning Versatile Behavior from Multimodal Goals.
[Paper]

2024 CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation.
[Paper]

2024 PERIA: Perceive, Reason, Imagine, Act via Holistic Language and Vision Planning for Manipulation.
[Paper]

2024 Improving Vision-Language-Action Models via Chain-of-Affordance.
[Paper]

2024 $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control.
[Paper]

2024 Track2Act: Predicting Point Tracks from Internet Videos enables Generalizable Robot Manipulation
[Paper]

2024 Diffusion-VLA: Generalizable and Interpretable Robot Foundation Model via Self-Generated Reasoning.
[Paper]

2024 TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation.
[Paper] [Code]

2024 Run-time Observation Interventions Make Vision-Language-Action Models More Visually Robust.
[Paper] [Code]

2024 Diffusion Transformer Policy
[Paper]

2023 Learning Universal Policies via Text-Guided Video Generation.
[Paper]

2023 Diffusion Policy: Visuomotor Policy Learning via Action Diffusion.
[Paper]

2023 Learning to Act from Actionless Videos through Dense Correspondences.
[Paper]

2023 Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models.
[Paper]

2023 NoMaD: Goal Masked Diffusion Policies for Navigation and Exploration
[Paper]

2022 SE(3)-DiffusionFields: Learning smooth cost functions for joint grasp and motion optimization through diffusion.
[Paper]

2022 StructDiffusion: Language-Guided Creation of Physically-Valid Structures using Unseen Objects
[Paper]

Reinforcement-based Fine-Tune Models

2025 SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Constrained Learning.
[Paper]

2025 Improving Vision-Language-Action Model with Online Reinforcement Learning.
[Paper]

2025 ReinboT: Amplifying Robot Visual-Language Manipulation with Reinforcement Learning.
[Paper]

2025 ConRFT: A Reinforced Fine-tuning Method for VLA Models via Consistency Policy.
[Paper] [Code]

2025 MoRE: Unlocking Scalability in Reinforcement Learning for Quadruped Vision-Language-Action Models.
[Paper]

2025 Online RL with Simple Reward Enables Training VLA Models with Only One Trajectory
[Paper]

2025 LeVERB: Humanoid Whole-Body Control with Latent Vision-Language Instruction
[Paper]

2025 AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning
[Paper] [Code]

2025 Refined Policy Distillation: From VLA Generalists to RL Experts
[Paper]

2025 RLRC: Reinforcement Learning-based Recovery for Compressed Vision-Language-Action Models
[Paper] [Code]

2025 VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning
[Paper] [Code]

2025 AutoDrive-R$^2$: Incentivizing Reasoning and Self-Reflection Capacity for VLA Model in Autonomous Driving
[Paper]

2025 ReWiND: Language-Guided Rewards Teach Robot Policies without New Demonstrations
[Paper] [Code]

2025 Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation
[Paper]

2025 ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning
[Paper] [Code]

2025 IRL-VLA: Training an Vision-Language-Action Policy via Reward World Model.
[Paper]

2024 Vision-Language Models Provide Promptable Representations for Reinforcement Learning.
[Paper]

2024 Adaptive Language-Guided Abstraction from Contrastive Explanations.
[Paper]

2024 GRAPE: Generalizing Robot Policy via Preference Alignment
[Paper] [Code]

2024 RLDG: Robotic Generalist Policy Distillation via Reinforcement Learning
[Paper]

2024 ELEMENTAL: Interactive Learning from Demonstrations and Vision-Language Models for Reward Design in Robotics.
[Paper]

2024 NaVILA: Legged Robot Vision-Language-Action Model for Navigation.
[Paper] [Code]

2023 LIV: Language-Image Representations and Rewards for Robotic Control.
[Paper]

2022 VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training.
[Paper]

2017 Proximal Policy Optimization Algorithms
[Paper]

Other Advanced VLA

2025 HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model.
[Paper]

2025 RationalVLA: A Rational Vision-Language-Action Model with Dual System
[Paper]

2025 Openhelix: A short survey, empirical analysis, and open-source dual-system vla model for robotic manipulation
[Paper] [Code]

2025 EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos
[Paper]

2025 ACTLLM: Action Consistency Tuned Large Language Model
[Paper]

2025 Time- Diffusion Policy with Action Discrimination for Robotic Manipulation
[Paper]

2025 BridgeVLA: Input-Output Alignment for Efficient 3D Manipulation Learning with Vision-Language Models
[Paper] [Code]

2025 GeoManip: Geometric Constraints as General Interfaces for Robot Manipulation
[Paper] [Code]

2025 TTF-VLA: Temporal Token Fusion via Pixel-Attention Integration for Vision-Language-Action Models
[Paper]

2025 MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation
[Paper]

2025 LeVERB: Humanoid Whole-Body Control with Latent Vision-Language Instruction
[Paper]

2025 MoManipVLA: Transferring Vision-language-action Models for General Mobile Manipulation.
[Paper]

2025 CubeRobot: Grounding Language in Rubik's Cube Manipulation via Vision-Language Model
[Paper]

2025 ViSA-Flow: Accelerating Robot Skill Learning via Large-Scale Video Semantic Action Flow
[Paper] [Code]

2025 Training Strategies for Efficient Embodied Reasoning
[Paper]

2025 CAST: Counterfactual Labels Improve Instruction Following in Vision-Language-Action Models
[Paper]

2025 RoboBrain: A Unified Brain Model for Robotic Manipulation from Abstract to Concrete
[Paper]

2025 CEED-VLA: Consistency Vision-Language-Action Model with Early-Exit Decoding
[Paper]

2025 SAFE: Multitask Failure Detection for Vision-Language-Action Models.
[Paper] [Code]

2025 DyWA: Dynamics-adaptive World Action Model for Generalizable Non-prehensile Manipulation.
[Paper]

2025 CrayonRobo: Object-Centric Prompt-Driven Vision-Language-Action Model for Robotic Manipulation
[Paper] [Code]

2025 cVLA: Towards Efficient Camera-Space VLAs
[Paper]

2025 Real-Time Action Chunking with Large Models
[Paper]

2025 Fast-in-Slow: A Dual-System Foundation Model Unifying Fast Manipulation within Slow Reasoning.
[Paper] [Code]

2025 Knowledge Insulating Vision-Language-Action Models: Train Fast, Run Fast, Generalize Better
[Paper]

2025 Probing a Vision-Language-Action Model for Symbolic States and Integration into a Cognitive Architecture.
[Paper]

2025 VLA Model-Expert Collaboration for Bi-directional Manipulation Learning
[Paper] [Code]

2025 An Atomic Skill Library Construction Method for Data-Efficient Embodied Manipulation
[Paper]

2025 RoboGround: Robotic Manipulation with Grounded Vision-Language Priors
[Paper] [Code]

2024 Yell At Your Robot: Improving On-the-Fly from Language Corrections.
[Paper]

2024 3D-VLA: A 3D Vision-Language-Action Generative World Model.
[Paper]

2024 ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation
[Paper]

2024 RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics
[Paper]

2024 Grounding Multimodal Large Language Models in Actions
[Paper]

2024 CoVLA: Comprehensive Vision-Language-Action Dataset for Autonomous Driving
[Paper]

2024 Helix: A vision-language-action model for generalist humanoid control
[Paper]

2024 AutoRT: Embodied Foundation Models for Large Scale Orchestration of Robotic Agents
[Paper]

2024 Exploring the Adversarial Vulnerabilities of Vision-Language-Action Models in Robotics
[Paper]

2024 DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
[Paper]

2024 Effective Tuning Strategies for Generalist Robot Manipulation Policies
[Paper]

2024 Edgevla: Efficient vision-language-action models.
[Paper]

2024 DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution
[Paper]

2024 A Dual Process VLA: Efficient Robotic Manipulation Leveraging VLM
[Paper]

2024 RoboMamba: Efficient Vision-Language-Action Model for Robotic Reasoning and Manipulation.
[Paper]

2024 ReVLA: Reverting Visual Domain Limitation of Robotic Foundation Models
[Paper]

2024 Scaling Diffusion Policy in Transformer to 1 Billion Parameters for Robotic Manipulation
[Paper]

2024 RoboUniView: Visual-Language Model with Unified View Representation for Robotic Manipulation
[Paper]

2024 ShowUI: One Vision-Language-Action Model for GUI Visual Agent
[Paper]

2024 A Survey on Robotics with Foundation Models: toward Embodied AI.
[Paper]

2024 General Flow as Foundation Affordance for Scalable Robot Learning.
[Paper] [Code]

2024 Manipulation Facing Threats: Evaluating Physical Vulnerabilities in End-to-End Vision Language Action Models.
[Paper]

2024 Closed-Loop Open-Vocabulary Mobile Manipulation with GPT-4V
[Paper]

2023 Affordances from Human Videos as a Versatile Representation for Robotics
[Paper]

2023 GenAug: Retargeting behaviors to unseen situations via Generative Augmentation
[Paper]

2023 Scaling Robot Learning with Semantically Imagined Experience
[Paper]

2023 Where are we in the search for an Artificial Visual Cortex for Embodied Intelligence?
[Paper]

2023 VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models.
[Paper]

2023 Physically Grounded Vision-Language Models for Robotic Manipulation
[Paper]

2022 R3M: A Universal Visual Representation for Robot Manipulation
[Paper]

2022 CACTI: A Framework for Scalable Multi-Task Multi-Scene Visual Imitation Learning
[Paper]

2022 Instruction-driven history-aware policies for robotic manipulations
[Paper]

2021 CLIPort: What and Where Pathways for Robotic Manipulation.
[Paper]

Real-World Datasets and Benchmarks

2024 Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot
[Paper] [Code]

2024 DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
[Paper] [Code]

2024 Autort: Embodied foundation models for large scale orchestration of robotic agents
[Paper]

2024 Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots
[Paper] [Code]

2024 Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0
[Paper] [Code]

2024 LoTa-Bench: Benchmarking Language-oriented Task Planners for Embodied Agents
[Paper] [Code]

2024 Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation
[Paper]

2024 RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots
[Paper]

2024 Benchmarking Vision, Language, & Action Models on Robotic Learning Tasks
[Paper]

2024 $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control.
[Paper]

2024 NAVSIM: Data-Driven Non-Reactive Autonomous Vehicle Simulation and Benchmarking
[Paper]

2024 OpenVLA: An Open-Source Vision-Language-Action Model.
[Paper]

2023 Open-World Object Manipulation using Pre-Trained Vision-Language Models
[Paper]

2023 Robohive: A unified framework for robot learning
[Paper] [Code]

2023 On Bringing Robots Home
[Paper]

2023 LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning
[Paper]

2023 RoboGen: Towards Unleashing Infinite Data for Automated Robot Learning via Generative Simulation
[Paper] [Code]

2022 Bc-z: Zero-shot task generalization with robotic imitation learning
[Paper]

2022 Rt-1: Robotics transformer for real-world control at scale
[Paper] [Code]

2022 Vima: General robot manipulation with multimodal prompts
[Paper] [Code]

2022 PersFormer: 3D Lane Detection via Perspective Transformer and the OpenLane Benchmark
[Paper]

2022 Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks
[Paper] [Code]

2021 Mt-opt: Continuous multi-task robotic reinforcement learning at scale
[Paper]

2021 Are we ready for autonomous driving? The KITTI vision benchmark suite
[Paper] [Code]

2021 One Million Scenes for Autonomous Driving: ONCE Dataset
[Paper]

2021 Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain Datasets
[Paper]

2020 Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning
[Paper] [Code]

2020 Alfred: A benchmark for interpreting grounded instructions for everyday tasks
[Paper] [Code]

2019 RoboNet: Large-Scale Multi-Robot Learning
[Paper]

2019 Rlbench: The robot learning benchmark & learning environment
[Paper]

2019 The ApolloScape Open Dataset for Autonomous Driving and Its Application
[Paper]

2019 Argoverse: 3D Tracking and Forecasting with Rich Maps
[Paper]

2019 Scalability in Perception for Autonomous Driving: Waymo Open Dataset
[Paper]

2019 nuScenes: A multimodal dataset for autonomous driving
[Paper]

2018 Multiple Interactions Made Easy (MIME): Large Scale Demonstrations Data for Imitation
[Paper]

2018 Scaling Egocentric Vision: The EPIC-KITCHENS Dataset
[Paper]

2018 ROBOTURK: A Crowdsourcing Platform for Robotic Skill Learning through Imitation
[Paper]

2018 BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning
[Paper]

2016 The Cityscapes Dataset for Semantic Urban Scene Understanding
[Paper]

Simulation Datasets and Benchmarks

2024 LoTa-Bench: Benchmarking Language-oriented Task Planners for Embodied Agents
[Paper] [Code]

2024 Bench2Drive: Towards Multi-Ability Benchmarking of Closed-Loop End-To-End Autonomous Driving.
[Paper]

2023 RoboGen: Towards Unleashing Infinite Data for Automated Robot Learning via Generative Simulation
[Paper] [Code]

2023 UniSim: A Neural Closed-Loop Sensor Simulator
[Paper]

2022 Vima: General robot manipulation with multimodal prompts
[Paper] [Code]

2022 Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks
[Paper] [Code]

2020 Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning
[Paper] [Code]

2020 Relay Policy Learning: Solving Long-Horizon Tasks via Imitation and Reinforcement Learning
[Paper] [Code]

2019 Rlbench: The robot learning benchmark & learning environment
[Paper]

2019 Interactive gibson benchmark: A benchmark for interactive navigation in cluttered environments
[Paper]

2018 Virtualhome: Simulating household activities via programs
[Paper] [Code]

2018 ROBOTURK: A Crowdsourcing Platform for Robotic Skill Learning through Imitation
[Paper]

unknown LMDrive: Closed-Loop End-to-End Driving with Large Language Models. \

Simulators

2025 Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems
[Paper] [Code]

2024 Genesis: A universal and generative physics engine for robotics and beyond, December 2024
[Paper] [Code]

2024 ManiSkill3: GPU Parallelized Robotics Simulation and Rendering for Generalizable Embodied AI
[Paper]

2022 Close the Optical Sensing Domain Gap by Physics-Grounded Active Stereo Sensor Simulation
[Paper] [Code]

2022 Bullet Physics SDK.
[Code]

2021 Habitat 2.0: Training home assistants to rearrange their habitat
[Paper]

2021 igibson 1.0: A simulation environment for interactive tasks in large realistic scenes
[Paper]

2021 iGibson 2.0: Object-Centric Simulation for Robot Learning of Everyday Household Tasks
[Paper]

2021 Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning
[Paper] [Code]

2020 Sapien: A simulated part-based interactive environment
[Paper]

2020 Alfred: A benchmark for interpreting grounded instructions for everyday tasks
[Paper] [Code]

2020 ThreeDWorld: A platform for interactive multi-modal physical simulation
[Paper]

2020 LGSVL Simulator: A High Fidelity Simulator for Autonomous Driving.
[Paper]

2019 Habitat: A platform for embodied ai research
[Paper]

2018 Virtualhome: Simulating household activities via programs
[Paper] [Code]

2018 Deepmind control suite
[Paper]

2018 Gibson env: Real-world perception for embodied agents
[Paper] [Code]

2017 Ai2-thor: An interactive 3d environment for visual ai
[Paper] [Code]

2017 CARLA: An Open Urban Driving Simulator.
[Paper]

2012 MuJoCo: A physics engine for model-based control
[Paper]

Robot Hardware

unknown FRANKA ROBOTICS.
[Paper]

unknown A1 Robot Arm.
[Paper]

unknown LeRobot: State-of-the-art Machine Learning for Real-World Robotics in Pytorch.
[Code]

🎈Citation

If you find this pure vla helpful, please cite us:

@misc{zhang2025purevisionlanguageaction,
      title={Pure Vision Language Action (VLA) Models: A Comprehensive Survey}, 
      author={Dapeng Zhang and Jing Sun and Chenghui Hu and Xiaoyan Wu and Zhenlong Yuan and Rui Zhou and Fei Shen and Qingguo Zhou},
      year={2025},
      eprint={2509.19012},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2509.19012}, 
}