General Flow as Foundation Affordance for Scalable Robot Learning

General Flow

We propose General Flow as a Foundation Affordance. Its properties and applications are analyzed to reveal its great power. We design a scale-aware algorithm for general flow prediction and achieve stable zero-shot cross-embodiment skill transfer in the real world. These findings highlight the transformative potential of general flow in spearheading scalable general robot learning.



Zero-Shot Real World Execution


For all 18 tasks, we provide 4 human videos example and 3 demos of robot trials.
We keep gripper position, grasp manner, initial state, scene setting and policy behaviour as diverse as possible.
Please check out our paper for detailed method and deployment setting.

Video Demo for

with "

" action.

Video Examples from Human Datasets




Zero-Shot General Flow Prediction

We provide visualiaztion for general flow prediction during zero-shot execution.
25 trajectories are selected for clearity.
General Flow prediction is robustness to embodiment-transfer and segmentation error to some extend.

Visualization for

with "

" action.

Description of First Image

Scene Reference

Description of Second Image

General Flow Prediction




More Visualization from Human Videos

We provide more visualiaztion of prediction on human videos. 25 or 50 trajectories are selected for clearity.

Visualization




Emergent Properties of General Flow

When trained on the general flow prediction task at scale, our model acquires multiple emergent properties that are unfeasible in a small-scale imitation learning setting.

  • Semantic Richness & Controllability [(a)(b)]: The semantics of the flow can be easily altered by switching language instructions, showcasing the model's flexibility.

  • Robustness to Label Noise [(c)(d)]: Despite severe noise in labels, such as significant deviation in 'open Safe' or almost static labels in 'pickup Toy Car', our model consistently predicts the correct trend. In the figure, red indicates the label and green represents the prediction.

  • Spatial Commonsense Acquisition [(e)(f)]: For instance, in task 'putdown Mug', the model adjusts its prediction scale to accurately reflect the spatial relationships of objects, ensuring both ends are on the table and adapting to longer distances with a larger scale.

Visualization


Failure Case