ViSA-Flow: Accelerating Robot Skill Learning via Large-Scale Video Semantic Action Flow
Abstract
Enabling robots to acquire complex manipulation skills remains a great challenge, primarily bottlenecked by the prohibitive cost of collecting large-scale robot demonstration data. Humans are able to learn efficiently by watching others interact with their environment. To bridge this gap, we introduce semantic action flow as a core intermediate representation capturing the essential spatio-temporal manipulator-object interactions, invariant to superficial visual differences. We present ViSA-Flow, a framework that learns this representation self-supervised from unlabeled large-scale video data. First, a generative model is pre-trained on semantic action flows automatically extracted from large-scale human-object interaction video data, learning a robust prior over manipulation structure. Second, this prior is efficiently adapted to a target robot by fine-tuning on a small set of robot demonstrations processed through the same semantic abstraction pipeline. We demonstrate through extensive experiments on the CALVIN benchmark and real-world tasks that ViSA-Flow achieves state-of-the-art performance, particularly in low-data regimes, outperforming prior methods by effectively transferring knowledge from human video observation to robotic execution.
Simulation Videos (2x Speed)
- Open the drawer
- Turn off the lightbulb
- Rotate the red block to the right
- Lift the red block from the table
- Place the red block into the drawer
- Open the drawer
- Move the slider to the right
- Lift the red block from the slider
- Place the red block into the drawer
- Turn off the lightbulb
Real-world Videos (1x Speed)
MoveContainer
PickEggplant
Real-world Long Horizon (1x Speed)
MoveContainer → PickEggplant
Real-world Flow Representation Comparison
Decoded ẑt
Ground truth zt
Decoded ẑt
Ground truth zt
MoveContainer PickEggplant