Rohit Girdhar

I am a Research Scientist at the Meta Superintelligence Labs. My current research focuses on understanding and generating multimodal data. I obtained a PhD from Carnegie Mellon University (here’s a link to my dissertation), where I worked on learning from and understanding videos. I was previously part of the Facebook AI Research (FAIR) group at Meta, and have spent time at DeepMind, Adobe and Facebook as an intern. See here for a formal bio.

News

Jun 2025
EgoVis 2023-24 Distinguished Paper Award for HierVL.
Oct 2024
Mark Zuckerberg announced our work on MovieGen, the new state-of-the-art media generation and editing system, outperforming SORA, Emu Video and more! Covered in NY Times, FT, Forbes, WIRED, Bloomberg, TechCrunch, etc.
Jul 2024
Mark Zuckerberg announced Llama 3.1, along with our state-of-the-art video recognition capabilities!
Jun 2024
Invited panelist for the AI for Content Creation (AI4CC) workshop at CVPR 2024 (along with Cynthia Lu and Robin Rombach).
Jun 2024
LaViLa and Ego4D among the winners of the EgoVis 2022-23 Distinguished Paper Awards!
Apr 2024
/animate functionality based on Emu Video is publicly released! Try it out to animate images generated using /imagine on meta.ai!
Apr 2024
Presented Emu Video at RunwayML’s inaugural Research and Art (RNA) event.
Feb 2024
Invited judge for the MIT Filmmaking Hackathhon 2024.
Nov 2023
Mark Zuckerberg announced our state-of-the-art video generation work, Emu Video! Also see coverage by TechCrunch, TheVerge, VentureBeat, Reuters, and others!
Jun 2023
Giving a talk at HVU Workshop and presenting 5 papers at CVPR 2023!
May 2023
Mark Zuckerberg announced our multimodal embedding work, ImageBind! Also see coverage by TheVerge, Engadget, SiliconANGLE, maginative and others!
Jun 2022
Presenting 3 papers at CVPR 2022, including Omivore, a single model that obtains state-of-the-art results across 3 different modalities: images, videos and single-view 3D!
Oct 2021
We announced Ego4D, the largest egocentric video dataset to date! See this video for a quick intro, and see coverage from TechCrunch, TheVerge, Axios, Fast Company, and others!

Education

PhD in Robotics, 2019

Carnegie Mellon University, Pittsburgh PA
MS in Robotics, 2016

Carnegie Mellon University, Pittsburgh PA
B. Tech. in Computer Science, 2014

IIIT Hyderabad, India

Experience

Meta · Research Scientist

New York · 2019 -- Present
DeepMind · Research Scientist Intern

London · Summer 2018
Facebook · Research Scientist Intern

Menlo Park · Summer 2017
Adobe · Research Scientist Intern

San Francisco · Summer 2016
Facebook · Software Engineering Intern

Menlo Park · Summer 2013

Highlights

Videos powered by MovieGen and Emu Video!

Projects and Publications

.js-id-selected

Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Duval, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, Ishan Misra

November, 2023 In ECCV, 2024

A simple and effective approach to high-quality video generation by learning to animate high quality images.

Mannat Singh, Quentin Duval, Kalyan Vasudev Alwala, Haoqi Fan, Vaibhav Aggarwal, Aaron Adcock, Armand Joulin, Piotr Dollár, Christoph Feichtenhofer, Ross Girshick, Rohit Girdhar, Ishan Misra

March, 2023 In ICCV, 2023

Scaling up MAE pre-pretraining, followed by weakly supervised pretraining, leads to strong representations.

Kumar Ashutosh, Rohit Girdhar, Lorenzo Torresani, Kristen Grauman

January, 2023 In CVPR, 2023 (Highlighted Presentation)

Video-language embeddings are a promising avenue for injecting semantics into visual representations, but existing methods capture only short-term associations between seconds-long video clips and their accompanying text. We propose HierVL, a novel hierarchical video-language embedding that simultaneously accounts for both long-term and short-term associations. As training data, we take videos accompanied by timestamped text descriptions of human actions, together with a high-level text summary of the activity throughout the long video (as are available in Ego4D). We introduce a hierarchical contrastive training objective that encourages text-visual alignment at both the clip level and video level. While the clip-level constraints use the step-by-step descriptions to capture what is happening in that instant, the video-level constraints use the summary text to capture why it is happening, i.e., the broader context for the activity and the intent of the actor. Our hierarchical scheme yields a clip representation that outperforms its single-level counterpart as well as a long-term video representation that achieves SotA results on tasks requiring long-term video modeling. HierVL successfully transfers to multiple challenging downstream tasks (in EPIC-KITCHENS-100, Charades-Ego, HowTo100M) in both zero-shot and fine-tuned settings.