CIRCLE: Capture in Rich Contextual Environments

We introduce CIRCLE, a dataset of 3D whole body reaching motion and corresponding first person RGDB videos within a fully furnished virtual home.

¹Stanford University ²Meta AI

Synthesizing 3D human motion in a contextual, ecological environment is important for simulating realistic activities people perform in the real world. However, conventional optics-based motion capture systems are not suited for simultaneously capturing human movements and complex scenes. The lack of rich contextual 3D human motion datasets presents a roadblock to creating high-quality generative human motion models. We propose a novel motion acquisition system in which the actor perceives and operates in a highly contextual virtual world while being motion captured in the real world. Our system enables rapid collection of high-quality human motion in highly diverse scenes, without the concern of occlusion or the need for physical scene construction in the real world. We present CIRCLE, a dataset containing 10 hours of full-body reaching motion from 5 subjects across nine scenes, paired with ego-centric information of the environment represented in various forms, such as RGBD videos. We use this dataset to train a model that generates human motion conditioned on scene information. Leveraging our dataset, the model learns to use ego-centric scene information to achieve nontrivial reaching tasks in the context of complex 3D scenes.

Example poses from CIRCLE captured from real human motion in a virtual environment.

After post-processing, each sequence in CIRCLE include the SMPL-X parameters of a body model fit to the mocap data using MoSh, the VR headset trajectory synchronized to the mocap skeleton, egocentric RGB-D video (rendered with Habitat and Blender), task specific data, such as initial and goal conditions, and the scene where the data are collected.

Our system enables collection of high-quality human motion in highly diverse scenes, without the concern of occlusion or the need for physical scene construction. For the purposes of this dataset, we used the Oculus Quest 2--though, our system is hardware agnostic.

Real World

Virtual World

Egocentric View

CIRCLE sequences range in difficulty, from easy sequences with no obstacles to sequences that require considerable full-body adjustment. The videos below shows the live-capture skeleton and scene rendered within the headset while the subject completes tasks.

Easy

Medium

Hard

Given a start pose, goal position, and 3D scene (a), we first initialize the input poses using constant local joint rotation and linearly interpolated root translation (b). We extract scene features for each time step using BPS or PointNet++ from the initialized poses, a fixed point set sampled from a cylinder, and scene point clouds (c). We then concatenate scene features and initialized poses and feed them to a transformer-based model (d) to generate final poses (e).

Model Input

We initialize the input to our model using a constant pose that linearly interpolates the start and goal wrist positions. The output is the edited motion.

Input

Output

Ground Truth

Model Results

We compare our approach against two baselines, GOAL and NO-SCENE. GOAL is an MLP architecture that, given a start and goal pose, autoregressively predicts the next pose in the sequence. NO-SCENE employs our same architecture without scene encoding.

Real World

Virtual World

Egocentric View

Easy

Medium

Hard

Model Input

Input

Output

Ground Truth

Model Results

Our Model (BPS)

No Scene

GOAL

Ground Truth