awesome-pure-vla
This is a pure vla survey.
Awesome Pure Vision-Language-Action (VLA) Model
Contents
- Autogression-Based VLA
- Diffusion-Based VLA
- Reinforcement-based Fine-Tune VLA
- Other Advanced VLA
- Real-World Datasets and Benchmarks
- Simulation Datasets and Benchmarks
- Simulators
- Robot Hardware
Autoregression-Based VLA
2025
WorldVLA: Towards Autoregressive Action World Model
[Paper]
[Code]
2025
UP-VLA: A Unified Understanding and Prediction Model for Embodied Agent.
[Paper]
2025
Universal Actions for Enhanced Embodied Foundation Models
[Paper]
2025
Humanoid-VLA: Towards Universal Humanoid Control with Visual Integration.
[Paper]
2025
NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks.
[Paper]
[Code]
2025
OneTwoVLA: A Unified Vision-Language-Action Model with Adaptive Reasoning
[Paper]
[Code]
2025
VOTE: Vision-Language-Action Optimization with Trajectory Ensemble Voting
[Paper]
2025
UniVLA: Learning to Act Anywhere with Task-centric Latent Actions
[Paper]
2025
Unveiling the Potential of Vision-Language-Action Models with Open-Ended Multimodal Instructions
[Paper]
2025
UAV-VLA: Vision-Language-Action System for Large Scale Aerial Mission Generation.
[Paper]
2025
Tactile-VLA: Unlocking Vision-Language-Action Model's Physical Knowledge for Tactile Generalization
[Paper]
2025
Shake-VLA: Vision-Language-Action Model-Based System for Bimanual Robotic Manipulations and Liquid Mixing.
[Paper]
2025
Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models.
[Paper]
2025
DexGraspVLA: A Vision-Language-Action Framework Towards General Dexterous Grasping.
[Paper]
2025
HAMSTER: Hierarchical Action Models For Open-World Robot Manipulation
[Paper]
2025
CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models.
[Paper]
2025
Gemini Robotics: Bringing AI into the Physical World.
[Paper]
2025
CognitiveDrone: A VLA model and evaluation benchmark for real-time cognitive task solving and reasoning in UAVs
[Paper]
[Code]
2025
$\pi$ 0.5: A Vision-Language-Action Model with Open-World Generalization.
[Paper]
[Code]
2025
InSpire: Vision-Language-Action Models with Intrinsic Spatial Reasoning
[Paper]
[Code]
2025
OpenDriveVLA: Towards End-to-end Autonomous Driving with Large Vision Language Action Model.
[Paper]
[Code]
2025
GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data.
[Paper]
2025
RaceVLA: VLA-based Racing Drone Navigation with Human-like Behaviour.
[Paper]
2025
VTLA: Vision-Tactile-Language-Action Model with Preference Learning for Insertion Manipulation
[Paper]
2025
PointVLA: Injecting the 3D World into Vision-Language-Action Models.
[Paper]
2025
Interleave-VLA: Enhancing Robot Manipulation with Interleaved Image-Text Instructions.
[Paper]
2025
MoManipVLA: Transferring Vision-language-action Models for General Mobile Manipulation.
[Paper]
2025
FAST: Efficient Action Tokenization for Vision-Language-Action Models.
[Paper]
2025
SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model.
[Paper]
2025
VLA-Cache: Towards Efficient Vision-Language-Action Model via Adaptive Token Caching in Robotic Manipulation.
[Paper]
[Code]
2025
Beyond Sight: Finetuning Generalist Robot Policies with Heterogeneous Sensors via Language Grounding.
[Paper]
2025
MoLe-VLA: Dynamic Layer-skipping Vision-Language-Action Model via Mixture-of-Layers for Efficient Robot Manipulation.
[Paper]
[Code]
2025
Accelerating Vision-Language-Action Model Integrated with Action Chunking via Parallel Decoding.
[Paper]
2025
BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation
[Paper]
[Code]
2025
GR-MG: Leveraging Partially-Annotated Data Via Multi-Modal Goal-Conditioned Policy
[Paper]
[Code]
2025
LoHoVLA: A Vision-Language-Action Model for Long-Horizon Embodied Tasks
[Paper]
2025
OTTER: A Vision-Language-Action Model with Text-Aware Visual Feature Extraction.
[Paper]
2025
ChatVLA: Unified Multimodal Understanding and Robot Control with Vision-Language-Action Model.
[Paper]
2025
From Foresight to Forethought: VLM-In-the-Loop Policy Steering via Latent Alignment
[Paper]
2025
VLAS: Vision-Language-Action Model with Speech Instructions for Customized Robot Manipulation.
[Paper]
[Code]
2025
Agentic Robot: A Brain-Inspired Framework for Vision-Language-Action Models in Embodied Agents.
[Paper]
2025
VLA Model – Expert Collaboration for Bi-directional Manipulation Learning.
[Paper]
2025
CombatVLA: An Efficient Vision-Language-Action Model for Combat Tasks in 3D Action Role-Playing Games.
[Paper]
2025
JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse.
[Paper]
2025
CronusVLA: Transferring Latent Motion Across Time for Multi-Frame Prediction in Manipulation
[Paper]
[Code]
2025
Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success.
[Paper]
2025
VaViM and VaVAM: Autonomous Driving through Video Generative Modeling.
[Paper]
2025
SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment.
[Paper]
2025
ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation.
[Paper]
2025
Pre-training Auto-regressive Robotic Models with 4D Representations
[Paper]
2025
TLA: Tactile-Language-Action Model for Contact-Rich Manipulation.
[Paper]
2025
OG-VLA: 3D-Aware Vision Language Action Model via Orthographic Image Generation
[Paper]
2025
4D-VLA: Spatiotemporal Vision-Language-Action Pretraining with Cross-Scene Calibration
[Paper]
[Code]
2025
Fast ECoT: Efficient Embodied Chain-of-Thought via Thoughts Reuse
[Paper]
2024
OpenVLA: An Open-Source Vision-Language-Action Model.
[Paper]
2024
Octo: An Open-Source Generalist Robot Policy.
[Paper]
2024
Robot Utility Models: General Policies for Zero-Shot Deployment in New Environments.
[Paper]
2024
RoboMM: All-in-One Multimodal Large Model for Robotic Manipulation.
[Paper]
2024
LLaRA: Supercharging Robot Learning Data for Vision-Language Policy.
[Paper]
[Code]
2024
Robotic Control via Embodied Chain-of-Thought Reasoning.
[Paper]
2024
Mobility VLA: Multimodal Instruction Navigation with Long-Context VLMs and Topological Graphs.
[Paper]
2024
Scaling Cross-Embodied Learning: One Policy for Manipulation, Navigation, Locomotion and Aviation
[Paper]
2024
GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation.
[Paper]
2024
Bi-VLA: Vision-Language-Action Model-Based System for Bimanual Robotic Dexterous Manipulations.
[Paper]
2024
RoboNurse-VLA: Robotic Scrub Nurse System based on Vision-Language-Action Model.
[Paper]
2024
Moto: Latent Motion Token as the Bridging Language for Robot Manipulation.
[Paper]
2024
TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies.
[Paper]
2024
Actra: Optimized Transformer Architecture for Vision-Language-Action Models in Robot Learning.
[Paper]
2024
DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution.
[Paper]
[Code]
2024
Language Reasoning in Vision-Language-Action Model for Robotic Grasping.
[Paper]
2024
RT-Affordance: Affordances are Versatile Intermediate Representations for Robot Manipulation.
[Paper]
2024
OccLLaMA: An Occupancy-Language-Action Generative World Model for Autonomous Driving.
[Paper]
2024
Language Models as Zero-Shot Trajectory Generators.
[Paper]
2024
Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks.
[Paper]
2024
Latent Action Pretraining from Videos.
[Paper]
[Code]
2024
RVT-2: Learning Precise Manipulation from Few Demonstrations.
[Paper]
2024
In-Context Imitation Learning via Next-Token Prediction
[Paper]
2024
BAKU: An Efficient Transformer for Multi-Task Policy Learning.
[Paper]
2024
QUAR-VLA: Vision-Language-Action Model for Quadruped Robots.
[Paper]
2024
HiRT: Enhancing Robotic Control with Hierarchical Robot Transformers.
[Paper]
2024
MissionGPT: Mission Planner for Mobile Robot based on Robotics Transformer Model.
[Paper]
2023
PaLM-E: An Embodied Multimodal Language Model.
[Paper]
2023
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control.
[Paper]
2023
An Embodied Generalist Agent in 3D World.
[Paper]
2023
Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model.
[Paper]
2023
Vision-Language Foundation Models as Effective Robot Imitators.
[Paper]
2023
Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation.
[Paper]
2023
Compositional Foundation Models for Hierarchical Planning.
[Paper]
[Code]
2023
Prompt a Robot to Walk with Large Language Models.
[Paper]
2023
Open-Ended Instructable Embodied Agents with Memory-Augmented Large Language Models.
[Paper]
2023
RoboAgent: Generalization and Efficiency in Robot Manipulation via Semantic Augmentations and Action Chunking.
[Paper]
2023
Look Before You Leap: Unveiling the Power of GPT-4V in Robotic Vision-Language Planning.
[Paper]
2023
Open-World Object Manipulation using Pre-trained Vision-Language Models.
[Paper]
2023
Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware.
[Paper]
2023
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action
[Paper]
2022
RT-1: Robotics Transformer for Real-World Control at Scale.
[Paper]
2022
A Generalist Agent.
[Paper]
2022
Inner Monologue: Embodied Reasoning through Planning with Language Models.
[Paper]
2022
LATTE: LAnguage Trajectory TransformEr.
[Paper]
2022
VIMA: General Robot Manipulation with Multimodal Prompts.
[Paper]
2022
Instruction-Following Agents with Multimodal Transformer.
[Paper]
2022
Interactive Language: Talking to Robots in Real Time.
[Paper]
Diffusion-based VLA
2025
RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation.
[Paper]
2025
Time- Diffusion Policy with Action Discrimination for Robotic Manipulation
[Paper]
2025
CDP: Towards Robust Autoregressive Visuomotor Policy Learning via Causal Diffusion
[Paper]
2025
Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies
[Paper]
2025
Task Reconstruction and Extrapolation for π_0 using Text Latent
[Paper]
2025
Dita: Scaling diffusion transformer for generalist vision-language-action policy
[Paper]
[Code]
2025
ForceVLA: Enhancing VLA Models with a Force-aware MoE for Contact-rich Manipulation
[Paper]
2025
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
[Paper]
2025
VQ-VLA: Improving Vision-Language-Action Models via Scaling Vector-Quantized Action Tokenizers
[Paper]
[Code]
2025
DexVLG: Dexterous Vision-Language-Grasp Model at Scale
[Paper]
[Code]
2025
AC-DiT: Adaptive Coordination Diffusion Transformer for Mobile Manipulation
[Paper]
[Code]
2025
DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control.
[Paper]
[Code]
2025
MinD: Unified Visual Imagination and Control via Hierarchical World Models
[Paper]
[Code]
2025
Hume: Introducing System-2 Thinking in Visual-Language-Action Model
[Paper]
2025
TriVLA: A Triple-System-Based Unified Vision-Language-Action Model for General Robot Control
[Paper]
[Code]
2025
DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge
[Paper]
[Code]
2025
GEVRM: Goal-Expressive Video Generation Model For Robust Visual Manipulation
[Paper]
2025
DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving
[Paper]
[Code]
2025
DreamGen: Unlocking Generalization in Robot Learning through Video World Models
[Paper]
[Code]
2025
Enerverse: Envisioning embodied future space for robotics manipulation
[Paper]
[Code]
2025
VidBot: Learning Generalizable 3D Actions from In-the-Wild 2D Human Videos for Zero-Shot Robotic Manipulation
[Paper]
[Code]
2025
Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success.
[Paper]
2025
TrackVLA: Embodied Visual Tracking in the Wild.
[Paper]
2025
FP3: A 3D Foundation Policy for Robotic Manipulation.
[Paper]
2025
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots.
[Paper]
2025
ObjectVLA: End-to-End Open-World Object Manipulation Without Demonstration.
[Paper]
2025
SwitchVLA: Execution-Aware Task Switching for Vision-Language-Action Models
[Paper]
2025
A0: An affordance-aware hierarchical model for general robotic manipulation
[Paper]
[Code]
2025
Pixel Motion as Universal Representation for Robot Control
[Paper]
[Code]
2025
Evo-0: Vision-Language-Action Model with Implicit Spatial Understanding
[Paper]
[Code]
2024
3D Diffuser Actor: Policy Diffusion with 3D Scene Representations.
[Paper]
2024
Multimodal Diffusion Transformer: Learning Versatile Behavior from Multimodal Goals.
[Paper]
2024
CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation.
[Paper]
2024
PERIA: Perceive, Reason, Imagine, Act via Holistic Language and Vision Planning for Manipulation.
[Paper]
2024
Improving Vision-Language-Action Models via Chain-of-Affordance.
[Paper]
2024
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control.
[Paper]
2024
Track2Act: Predicting Point Tracks from Internet Videos enables Generalizable Robot Manipulation
[Paper]
2024
Diffusion-VLA: Generalizable and Interpretable Robot Foundation Model via Self-Generated Reasoning.
[Paper]
2024
TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation.
[Paper]
[Code]
2024
Run-time Observation Interventions Make Vision-Language-Action Models More Visually Robust.
[Paper]
[Code]
2024
Diffusion Transformer Policy
[Paper]
2023
Learning Universal Policies via Text-Guided Video Generation.
[Paper]
2023
Diffusion Policy: Visuomotor Policy Learning via Action Diffusion.
[Paper]
2023
Learning to Act from Actionless Videos through Dense Correspondences.
[Paper]
2023
Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models.
[Paper]
2023
NoMaD: Goal Masked Diffusion Policies for Navigation and Exploration
[Paper]
2022
SE(3)-DiffusionFields: Learning smooth cost functions for joint grasp and motion optimization through diffusion.
[Paper]
2022
StructDiffusion: Language-Guided Creation of Physically-Valid Structures using Unseen Objects
[Paper]
Reinforcement-based Fine-Tune Models
2025
SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Constrained Learning.
[Paper]
2025
Improving Vision-Language-Action Model with Online Reinforcement Learning.
[Paper]
2025
ReinboT: Amplifying Robot Visual-Language Manipulation with Reinforcement Learning.
[Paper]
2025
ConRFT: A Reinforced Fine-tuning Method for VLA Models via Consistency Policy.
[Paper]
[Code]
2025
MoRE: Unlocking Scalability in Reinforcement Learning for Quadruped Vision-Language-Action Models.
[Paper]
2025
Online RL with Simple Reward Enables Training VLA Models with Only One Trajectory
[Paper]
2025
LeVERB: Humanoid Whole-Body Control with Latent Vision-Language Instruction
[Paper]
2025
AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning
[Paper]
[Code]
2025
Refined Policy Distillation: From VLA Generalists to RL Experts
[Paper]
2025
RLRC: Reinforcement Learning-based Recovery for Compressed Vision-Language-Action Models
[Paper]
[Code]
2025
VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning
[Paper]
[Code]
2025
AutoDrive-R$^2$: Incentivizing Reasoning and Self-Reflection Capacity for VLA Model in Autonomous Driving
[Paper]
2025
ReWiND: Language-Guided Rewards Teach Robot Policies without New Demonstrations
[Paper]
[Code]
2025
Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation
[Paper]
2025
ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning
[Paper]
[Code]
2025
IRL-VLA: Training an Vision-Language-Action Policy via Reward World Model.
[Paper]
2024
Vision-Language Models Provide Promptable Representations for Reinforcement Learning.
[Paper]
2024
Adaptive Language-Guided Abstraction from Contrastive Explanations.
[Paper]
2024
GRAPE: Generalizing Robot Policy via Preference Alignment
[Paper]
[Code]
2024
RLDG: Robotic Generalist Policy Distillation via Reinforcement Learning
[Paper]
2024
ELEMENTAL: Interactive Learning from Demonstrations and Vision-Language Models for Reward Design in Robotics.
[Paper]
2024
NaVILA: Legged Robot Vision-Language-Action Model for Navigation.
[Paper]
[Code]
2023
LIV: Language-Image Representations and Rewards for Robotic Control.
[Paper]
2022
VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training.
[Paper]
2017
Proximal Policy Optimization Algorithms
[Paper]
Other Advanced VLA
2025
HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model.
[Paper]
2025
RationalVLA: A Rational Vision-Language-Action Model with Dual System
[Paper]
2025
Openhelix: A short survey, empirical analysis, and open-source dual-system vla model for robotic manipulation
[Paper]
[Code]
2025
EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos
[Paper]
2025
ACTLLM: Action Consistency Tuned Large Language Model
[Paper]
2025
Time- Diffusion Policy with Action Discrimination for Robotic Manipulation
[Paper]
2025
BridgeVLA: Input-Output Alignment for Efficient 3D Manipulation Learning with Vision-Language Models
[Paper]
[Code]
2025
GeoManip: Geometric Constraints as General Interfaces for Robot Manipulation
[Paper]
[Code]
2025
TTF-VLA: Temporal Token Fusion via Pixel-Attention Integration for Vision-Language-Action Models
[Paper]
2025
MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation
[Paper]
2025
LeVERB: Humanoid Whole-Body Control with Latent Vision-Language Instruction
[Paper]
2025
MoManipVLA: Transferring Vision-language-action Models for General Mobile Manipulation.
[Paper]
2025
CubeRobot: Grounding Language in Rubik's Cube Manipulation via Vision-Language Model
[Paper]
2025
ViSA-Flow: Accelerating Robot Skill Learning via Large-Scale Video Semantic Action Flow
[Paper]
[Code]
2025
Training Strategies for Efficient Embodied Reasoning
[Paper]
2025
CAST: Counterfactual Labels Improve Instruction Following in Vision-Language-Action Models
[Paper]
2025
RoboBrain: A Unified Brain Model for Robotic Manipulation from Abstract to Concrete
[Paper]
2025
CEED-VLA: Consistency Vision-Language-Action Model with Early-Exit Decoding
[Paper]
2025
SAFE: Multitask Failure Detection for Vision-Language-Action Models.
[Paper]
[Code]
2025
DyWA: Dynamics-adaptive World Action Model for Generalizable Non-prehensile Manipulation.
[Paper]
2025
CrayonRobo: Object-Centric Prompt-Driven Vision-Language-Action Model for Robotic Manipulation
[Paper]
[Code]
2025
cVLA: Towards Efficient Camera-Space VLAs
[Paper]
2025
Real-Time Action Chunking with Large Models
[Paper]
2025
Fast-in-Slow: A Dual-System Foundation Model Unifying Fast Manipulation within Slow Reasoning.
[Paper]
[Code]
2025
Knowledge Insulating Vision-Language-Action Models: Train Fast, Run Fast, Generalize Better
[Paper]
2025
Probing a Vision-Language-Action Model for Symbolic States and Integration into a Cognitive Architecture.
[Paper]
2025
VLA Model-Expert Collaboration for Bi-directional Manipulation Learning
[Paper]
[Code]
2025
An Atomic Skill Library Construction Method for Data-Efficient Embodied Manipulation
[Paper]
2025
RoboGround: Robotic Manipulation with Grounded Vision-Language Priors
[Paper]
[Code]
2024
Yell At Your Robot: Improving On-the-Fly from Language Corrections.
[Paper]
2024
3D-VLA: A 3D Vision-Language-Action Generative World Model.
[Paper]
2024
ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation
[Paper]
2024
RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics
[Paper]
2024
Grounding Multimodal Large Language Models in Actions
[Paper]
2024
CoVLA: Comprehensive Vision-Language-Action Dataset for Autonomous Driving
[Paper]
2024
Helix: A vision-language-action model for generalist humanoid control
[Paper]
2024
AutoRT: Embodied Foundation Models for Large Scale Orchestration of Robotic Agents
[Paper]
2024
Exploring the Adversarial Vulnerabilities of Vision-Language-Action Models in Robotics
[Paper]
2024
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
[Paper]
2024
Effective Tuning Strategies for Generalist Robot Manipulation Policies
[Paper]
2024
Edgevla: Efficient vision-language-action models.
[Paper]
2024
DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution
[Paper]
2024
A Dual Process VLA: Efficient Robotic Manipulation Leveraging VLM
[Paper]
2024
RoboMamba: Efficient Vision-Language-Action Model for Robotic Reasoning and Manipulation.
[Paper]
2024
ReVLA: Reverting Visual Domain Limitation of Robotic Foundation Models
[Paper]
2024
Scaling Diffusion Policy in Transformer to 1 Billion Parameters for Robotic Manipulation
[Paper]
2024
RoboUniView: Visual-Language Model with Unified View Representation for Robotic Manipulation
[Paper]
2024
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
[Paper]
2024
A Survey on Robotics with Foundation Models: toward Embodied AI.
[Paper]
2024
General Flow as Foundation Affordance for Scalable Robot Learning.
[Paper]
[Code]
2024
Manipulation Facing Threats: Evaluating Physical Vulnerabilities in End-to-End Vision Language Action Models.
[Paper]
2024
Closed-Loop Open-Vocabulary Mobile Manipulation with GPT-4V
[Paper]
2023
Affordances from Human Videos as a Versatile Representation for Robotics
[Paper]
2023
GenAug: Retargeting behaviors to unseen situations via Generative Augmentation
[Paper]
2023
Scaling Robot Learning with Semantically Imagined Experience
[Paper]
2023
Where are we in the search for an Artificial Visual Cortex for Embodied Intelligence?
[Paper]
2023
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models.
[Paper]
2023
Physically Grounded Vision-Language Models for Robotic Manipulation
[Paper]
2022
R3M: A Universal Visual Representation for Robot Manipulation
[Paper]
2022
CACTI: A Framework for Scalable Multi-Task Multi-Scene Visual Imitation Learning
[Paper]
2022
Instruction-driven history-aware policies for robotic manipulations
[Paper]
2021
CLIPort: What and Where Pathways for Robotic Manipulation.
[Paper]
Real-World Datasets and Benchmarks
2024
Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot
[Paper]
[Code]
2024
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
[Paper]
[Code]
2024
Autort: Embodied foundation models for large scale orchestration of robotic agents
[Paper]
2024
Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots
[Paper]
[Code]
2024
Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0
[Paper]
[Code]
2024
LoTa-Bench: Benchmarking Language-oriented Task Planners for Embodied Agents
[Paper]
[Code]
2024
Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation
[Paper]
2024
RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots
[Paper]
2024
Benchmarking Vision, Language, & Action Models on Robotic Learning Tasks
[Paper]
2024
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control.
[Paper]
2024
NAVSIM: Data-Driven Non-Reactive Autonomous Vehicle Simulation and Benchmarking
[Paper]
2024
OpenVLA: An Open-Source Vision-Language-Action Model.
[Paper]
2023
Open-World Object Manipulation using Pre-Trained Vision-Language Models
[Paper]
2023
Robohive: A unified framework for robot learning
[Paper]
[Code]
2023
On Bringing Robots Home
[Paper]
2023
LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning
[Paper]
2023
RoboGen: Towards Unleashing Infinite Data for Automated Robot Learning via Generative Simulation
[Paper]
[Code]
2022
Bc-z: Zero-shot task generalization with robotic imitation learning
[Paper]
2022
Rt-1: Robotics transformer for real-world control at scale
[Paper]
[Code]
2022
Vima: General robot manipulation with multimodal prompts
[Paper]
[Code]
2022
PersFormer: 3D Lane Detection via Perspective Transformer and the OpenLane Benchmark
[Paper]
2022
Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks
[Paper]
[Code]
2021
Mt-opt: Continuous multi-task robotic reinforcement learning at scale
[Paper]
2021
Are we ready for autonomous driving? The KITTI vision benchmark suite
[Paper]
[Code]
2021
One Million Scenes for Autonomous Driving: ONCE Dataset
[Paper]
2021
Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain Datasets
[Paper]
2020
Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning
[Paper]
[Code]
2020
Alfred: A benchmark for interpreting grounded instructions for everyday tasks
[Paper]
[Code]
2019
RoboNet: Large-Scale Multi-Robot Learning
[Paper]
2019
Rlbench: The robot learning benchmark & learning environment
[Paper]
2019
The ApolloScape Open Dataset for Autonomous Driving and Its Application
[Paper]
2019
Argoverse: 3D Tracking and Forecasting with Rich Maps
[Paper]
2019
Scalability in Perception for Autonomous Driving: Waymo Open Dataset
[Paper]
2019
nuScenes: A multimodal dataset for autonomous driving
[Paper]
2018
Multiple Interactions Made Easy (MIME): Large Scale Demonstrations Data for Imitation
[Paper]
2018
Scaling Egocentric Vision: The EPIC-KITCHENS Dataset
[Paper]
2018
ROBOTURK: A Crowdsourcing Platform for Robotic Skill Learning through Imitation
[Paper]
2018
BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning
[Paper]
2016
The Cityscapes Dataset for Semantic Urban Scene Understanding
[Paper]
Simulation Datasets and Benchmarks
2024
LoTa-Bench: Benchmarking Language-oriented Task Planners for Embodied Agents
[Paper]
[Code]
2024
Bench2Drive: Towards Multi-Ability Benchmarking of Closed-Loop End-To-End Autonomous Driving.
[Paper]
2023
RoboGen: Towards Unleashing Infinite Data for Automated Robot Learning via Generative Simulation
[Paper]
[Code]
2023
UniSim: A Neural Closed-Loop Sensor Simulator
[Paper]
2022
Vima: General robot manipulation with multimodal prompts
[Paper]
[Code]
2022
Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks
[Paper]
[Code]
2020
Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning
[Paper]
[Code]
2020
Relay Policy Learning: Solving Long-Horizon Tasks via Imitation and Reinforcement Learning
[Paper]
[Code]
2019
Rlbench: The robot learning benchmark & learning environment
[Paper]
2019
Interactive gibson benchmark: A benchmark for interactive navigation in cluttered environments
[Paper]
2018
Virtualhome: Simulating household activities via programs
[Paper]
[Code]
2018
ROBOTURK: A Crowdsourcing Platform for Robotic Skill Learning through Imitation
[Paper]
unknown LMDrive: Closed-Loop End-to-End Driving with Large Language Models. \
Simulators
2025
Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems
[Paper]
[Code]
2024
Genesis: A universal and generative physics engine for robotics and beyond, December 2024
[Paper]
[Code]
2024
ManiSkill3: GPU Parallelized Robotics Simulation and Rendering for Generalizable Embodied AI
[Paper]
2022
Close the Optical Sensing Domain Gap by Physics-Grounded Active Stereo Sensor Simulation
[Paper]
[Code]
2022
Bullet Physics SDK.
[Code]
2021
Habitat 2.0: Training home assistants to rearrange their habitat
[Paper]
2021
igibson 1.0: A simulation environment for interactive tasks in large realistic scenes
[Paper]
2021
iGibson 2.0: Object-Centric Simulation for Robot Learning of Everyday Household Tasks
[Paper]
2021
Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning
[Paper]
[Code]
2020
Sapien: A simulated part-based interactive environment
[Paper]
2020
Alfred: A benchmark for interpreting grounded instructions for everyday tasks
[Paper]
[Code]
2020
ThreeDWorld: A platform for interactive multi-modal physical simulation
[Paper]
2020
LGSVL Simulator: A High Fidelity Simulator for Autonomous Driving.
[Paper]
2019
Habitat: A platform for embodied ai research
[Paper]
2018
Virtualhome: Simulating household activities via programs
[Paper]
[Code]
2018
Deepmind control suite
[Paper]
2018
Gibson env: Real-world perception for embodied agents
[Paper]
[Code]
2017
Ai2-thor: An interactive 3d environment for visual ai
[Paper]
[Code]
2017
CARLA: An Open Urban Driving Simulator.
[Paper]
2012
MuJoCo: A physics engine for model-based control
[Paper]
Robot Hardware
unknown
FRANKA ROBOTICS.
[Paper]
unknown
A1 Robot Arm.
[Paper]
unknown
LeRobot: State-of-the-art Machine Learning for Real-World Robotics in Pytorch.
[Code]
🎈Citation
If you find this pure vla helpful, please cite us:
@misc{zhang2025purevisionlanguageaction, title={Pure Vision Language Action (VLA) Models: A Comprehensive Survey}, author={Dapeng Zhang and Jing Sun and Chenghui Hu and Xiaoyan Wu and Zhenlong Yuan and Rui Zhou and Fei Shen and Qingguo Zhou}, year={2025}, eprint={2509.19012}, archivePrefix={arXiv}, primaryClass={cs.RO}, url={https://arxiv.org/abs/2509.19012}, }