AMEGO: Active Memory from long EGOcentric videos

1 Politecnico di Torino 2 FAIR, Meta 3 University of Bristol
ECCV 2024

img

Section 1 Section 2 Section 3 Section 4 Section 5 Section 6 Section 7 Section 8 Section 9

What did I use with with the left hand after [VQ] at time 00:10?

[VQ]

Question Image

What did is the correct sequence of objects I have interacted with?

Where did I leave [VQ]?

[VQ]

Question Image

What did I use with [VQ]?

[VQ]

Question Image

Where did I use with [VQ]?

[VQ]

Question Image

When did I use with [VQ]?

[VQ]

Question Image

Possible answers:

B

[00:05-00:15]
[00:20-00:24]

C

[00:10-00:20]
[00:22-00:28]

When did I visit with [VQ]?

[VQ]

Question Image

Possible answers:

C

[00:05-00:22]
[00:25-00:30]

AMEGO captures key locations and object interactions in a structured representation. In the each frame on top, the external border colour refers to a specific location in AMEGO while colours of objects define specific instances.
Hover on the colored bars for each location and click through the questions below to explore AMEGO in action!

Abstract

Egocentric videos provide a unique perspective into individuals' daily experiences, yet their unstructured nature presents challenges for perception. In this paper, we introduce AMEGO, a novel approach aimed at enhancing the comprehension of very-long egocentric videos. Inspired by the human's ability to memorise information from a single watching, our method focuses on constructing self-contained representations from the egocentric video, capturing key locations and object interactions. This representation is semantic-free and facilitates multiple queries without the need to reprocess the entire visual content. Additionally, to evaluate our understanding of very-long egocentric videos, we introduce the new Active Memories Benchmark (AMB), composed of more than 20K of highly challenging visual queries from EPIC-KITCHENS. These queries cover different levels of video reasoning (sequencing, concurrency and temporal grounding) to assess detailed video understanding capabilities. We showcase improved performance of AMEGO on AMB, surpassing other video QA baselines by a substantial margin.

AMEGO representation

We Propose AMEGO - a representation of long videos. AMEGO breaks the video into Hand-Object Interaction (HOI) tracklets, and location segments. This forms a semantic-free memory of the video. AMEGO is built in an online fashion, eliminating the need to reprocess past frames.

AMEGO representation on P01_03.

AMEGO representation on P01_104.

AMEGO representation on P02_07.

AMEGO representation on P15_02.

AMEGO representation on P28_109.

Interacting with AMEGO

Right Hand - Interactions with left hand
Left Hand - Interactions with right hand
Location - Interaction location
Click on the object to see the current interaction and all other interactions with the same object

Active Memory Benchmark (AMB)

Explore a sample of our 20.5K semantic-free VQA benchmark, arranged in 8 question templates.

Querying AMB with AMEGO

We visually demonstrate how AMB is answered using the AMEGO representation through sample clips [scroll for different questions].

Querying AMB with AMEGO: Q1

Querying AMB with AMEGO: Q2

Querying AMB with AMEGO: Q3

Querying AMB with AMEGO: Q4

Querying AMB with AMEGO: Q5

Querying AMB with AMEGO: Q6

Querying AMB with AMEGO: Q7

Querying AMB with AMEGO: Q8

BibTeX

@inproceedings{goletto2024amego,
        title={AMEGO: Active Memory from long EGOcentric videos},
        author={Goletto, Gabriele and Nagarajan, Tushar and Averta, Giuseppe and Damen, Dima},
        booktitle={European Conference on Computer Vision},
        year={2024}
      }