Licheng Yu - Facebook AI

The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation

Llama team

[Blog]

(Led 17Bx128 and 17Bx16's text+image reinforcement learning Stage)

Accelerating Multimodal Large Language Models by Searching Optimal Vision Token Reduction

CVPR 2025

Shiyu Zhao, Zhenting Wang, Felix Juefei-Xu, Xide Xia, Miao Liu, Xiaofang Wang, Mingfu Liang, Ning Zhang, Dimitris N. Metaxas, Licheng Yu

[Paper]

Apollo: An Exploration of Video Understanding in Large Multimodal Models

CVPR 2025

Orr Zohar, Xiaohan Wang, Yann Dubois, Nikhil Mehta, Tong Xiao, Philippe Hansen-Estruch, Licheng Yu, Xiaofang Wang, Felix Juefei-Xu, Ning Zhang, Serena Yeung-Levy, Xide Xia

[Paper]

Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs for Effective Long Video Analysis with LLMs

CVPR 2025

Zeiyi Huang, Yuyang Ji, Xiaofang Wang, Nikhil Mehta, Tong Xiao, Donghyun Lee, Sigmund Vanvalkenburgh, Shengxin Zha, Bolin Lai, Licheng Yu, Ning Zhang, Yong Jae Lee, Miao Liu

[Paper]

ROICtrl: Boosting Instance Control for Visual Generation

CVPR 2025

Yuchao Gu, Yipin Zhou, Yunfan Ye, Yinxin Nie, Licheng Yu, Pingchuan Ma, Kevin Qinghong Lin, Mike Zheng Shou

The Llama 3 Herd of Models

arXiv:2407.21783v2

Llama team

(Led Llama3.2 Multimodal 11B/90B Pre-training + 11B Post-training)

Animated Stickers: Bringing Stickers to Life with Video Diffusion

arXiv:2402.06088

David Yan, Winnie Zhang, Luxin Zhang, Anmol Kalia, Dingkang Wang, Ankit Ramchandani, Miao Liu, Albert Pumarola, Edgar Schoenfeld, Elliot Blanchard, Krishna Narni, Yaqiao Luo, Lawrence Chen, Guan Pang, Ali Thabet, Peter Vajda, Amy Bearman, Licheng Yu

[Paper]

AVID: Any-Length Video Inpainting with Diffusion Model

CVPR 2024

Zhixing Zhang, Bichen Wu, Xiaoyan Wang, Yaqiao Luo, Luxin Zhang, Yinan Zhao, Peter Vajda, Dimitris Metaxas, Licheng Yu

SceneTextGen: Layout-Agnostic Scene Text Image Synthesis with Integrated Character-Level Diffusion and Contextual Consistency

CVPR 2024

Qilong Zhangli, Praveen Krishnan, Ankit Ramchandani, Xiaoliang Dai, Licheng Yu, Di Liu, Jindong Jiang, Dimitris N. Metaxas, Guan Pang

[Paper]

FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis

CVPR 2024

Feng Liang, Bichen Wu, Jialiang Wang, Licheng Yu, Kunpeng Li, Yinan Zhao, Ishan Misra, Jia-Bin Huang, Peizhao Zhang, Peter Vajda, Diana Marculescu

Fairy: Fast Parallelized Instruction-Guided Video-to-Video Synthesis

CVPR 2024

Bichen Wu, Ching-Yao Chuang, Xiaoyan Wang, Yichen Jia, Kapil Krishnakumar, Tong Xiao, Feng Liang, Licheng Yu, Peter Vajda

VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence

CVPR 2024

Yuchao Gu, Yipin Zhou, Bichen Wu, Licheng Yu, Jia-Wei Liu, Rui Zhao, Jay Zhangjie Wu, David Junhao Zhang, Mike Zheng Shou, Kevin Tang

Text-to-Sticker: Style Tailoring Latent Diffusion Models for Human Expression

arXiv:2311.10794

Animesh Sinha, Bo Sun, Anmol Kalia, Arantxa Casanova, Elliot Blanchard, David Yan, Winnie Zhang, Tony Nelli, Jiahui Chen, Hardik Shah, Licheng Yu, Mitesh Kumar Singh, Ankit Ramchandani, Maziar Sanjabi, Sonal Gupta, Amy Bearman, Dhruv Mahajan

[Paper]

CiT: Curation in Training for Effective Vision-Language Data

ICCV 2023

Hu Xu, Saining Xie, Po-Yao Huang, Licheng Yu, Russell Howes, Gargi Ghosh Luke Zettlemoyer, Christoph Feichtenhofe

[Paper][Code]

Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation

CVPR 2023

Tsu-Jui Fu, Licheng Yu, Ning Zhang, Cheng-Yang Fu, Jong-Chyi Su, William Yang Wang, Sean Bell

Learning Procedure-aware Video Representation from Instructional Videos and Their Narrations

CVPR 2023

Yiwu Zhong, Licheng Yu, Yang Bai, Shangwen Li, Xueting Yan, Yin Li

[Paper][Code]

FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion Tasks

CVPR 2023

Xiao Han, Xiatian Zhu, Licheng Yu, Li Zhang, Yi-Zhe Song, Tao Xiang

[Paper][Code] (Oral)

Learning and Verification of Task Structure in Instructional Videos

arXiv:2303.13519

Medhini Narasimhan, Licheng Yu, Sean Bell, Ning Zhang, Trevor Darrell

AMELI: Enhancing Multimodal Entity Linking with Fine-Grained Attributes

arXiv:2305.14725

Barry Menglong Yao, Yu Chen, Qifan Wang, Sijia Wang, Minqian Liu, Zhiyang Xu, Licheng Yu, Lifu Huang

[Paper]

RoPAWS: Robust Semi-supervised Representation Learning from Uncurated Data

ICLR 2023

Sangwoo Mo, Jong-Chyi Su, Kevin Chih-Yao Ma, Mido Assran, Ishan Misra, Licheng Yu, Sean Bell

[Paper]

Que2Engage: Embedding-based Retrieval for Relevant and Engaging Products at Facebook Marketplace

WWW 2023

Yunzhong He, Yuxin Tian, Mengjiao Wang, Feier Chen, Licheng Yu, Maolong Tang, Congcong Chen, Ning Zhang, Bin Kuang, Arul Prakash

[Paper]

FaD-VLP: Fashion Vision-and-Language Pre-training towards Unified Retrieval and Captioning

EMNLP 2022

Suvir Mirchandani, Licheng Yu, Mengjiao Wang, Animesh Sinha, Wenwen Jiang, Tao Xiang, Ning Zhang

[Paper]

FashionViL: Fashion-Focused Vision-and-Language Representation Learning

ECCV 2022

Xiao Han, Licheng Yu, Xiatian Zhu, Li Zhang, Yi-Zhe Song, Tao Xiang

[Paper][Code]

Generic Event Boundary Captioning: A Benchmark for Status Changes Understanding

ECCV 2022

Yuxuan Wang, Difei Gao, Licheng Yu, Weixian Lei, Matt Feiszli, Mike Zheng Shou

[Paper]

CommerceMM: Large-Scale Commerce MultiModal Representation Learning with Omni Retrieval

KDD 2022

Licheng Yu, Jun Chen, Animesh Sinha, Mengjiao Wang, Yu Chen, Tamara L. Berg, Ning Zhang

Unsupervised Vision-and-Language Pre-training via Retrieval-based Multi-Granular Alignment

CVPR 2022

Mingyang Zhou*, Licheng Yu*, Amanpreet Singh, Mengjiao Wang, Yu Zhou, Ning Zhang
(*First 2 authors contribute equally.)

[Paper][Code] (Oral)

LOOPITR: Combining Dual and Cross Encoder Architectures for Image-Text Retrieval

arxiv:2203.05465v1

Jie Lei, Xinlei Chen, Ning Zhang, Mengjiao Wang, Mohit Bansal, Tamara L. Berg, Licheng Yu

[Paper]

VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation

NeurIPS 2021

Linjie Li, Jie Lei, Zhe Gan, Licheng Yu, Yen-Chun Chen, Rohit Pillai, Yu Cheng, Luowei Zhou, Xin Eric Wang, William Yang Wang, Tamara L. Berg, Mohit Bansal, Jingjing Liu, Lijuan Wang, Zicheng Liu

Connecting What to Say With Where to Look by Modeling Human Attention Traces

CVPR 2021

Zihang Meng, Licheng Yu, Ning Zhang, Tamara L. Berg, Babak Damavandi, Vikas Singh, Amy Bearman

[Paper][Code]

What is More Likely to Happen Next? Video-and-Language Future Event Prediction

EMNLP 2020

Jie Lei, Licheng Yu, Tamara L. Berg, Mohit Bansal

[Paper][Code]

HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training

EMNLP 2020

Linjie Li*, Yen-Chun Chen*, Yu Cheng, Zhe Gan, Licheng Yu, Jingjing Liu
(*First 2 authors contribute equally.)

Rank 1 on TVR Leaderboard
Rank 1 on TVC Leaderboard

Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

ECCV 2020

Jize Cao, Zhe Gan, Yu Cheng, Licheng Yu, Yen-Chun Chen, Jingjing Liu

[Paper] (Spotlight)

TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval

ECCV 2020

Jie Lei, Licheng Yu, Tamara L. Berg, Mohit Bansal

UNITER: Learning UNiversal Image-Text Representations

ECCV 2020

Yen-Chun Chen*, Linjie Li*, Licheng Yu*, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, Jingjing Liu
(*First 3 authors contribute equally.)

Achieving SOTA on 13 Vision+Language Datasets/Tasks, and
Rank 1 on VCR Leaderboard
Rank 1 on NLVR2 Leaderboard

TVQA+: Spatio-Temporal Grounding for Video Question Answering

ACL 2020

Jie Lei, Licheng Yu, Tamara L. Berg, Mohit Bansal

BachGAN: High-Resolution Image Synthesis from Salient Object Layout

CVPR 2020

Yandong Li, Yu Cheng, Zhe Gan, Licheng Yu, Liqiang Wang, Jingjing Liu

[Paper][Code]

VIOLIN: A Large-Scale Dataset for Video-and-Language Inference

CVPR 2020

Jingzhou Liu, Wenhu Chen, Yu Cheng, Zhe Gan, Licheng Yu, Yiming Yang, Jingjing Liu

Multi-Target Embodied Question Answering

CVPR 2019

Licheng Yu, Xinlei Chen, Georgia Gkioxari, Mohit Bansal, Tamara L. Berg, Dhruv Batra

[Paper] [Video]

Learning to Navigate Unseen Environments: Back Translation with Environmental Dropout

NAACL 2019

Hao Tan, Licheng Yu, Mohit Bansal

[Paper] [Code]

TVQA: Localized Compositional Video Question Answering

EMNLP 2018

Jie Lei, Licheng Yu, Mohit Bansal, Tamara L. Berg

[Paper] [Project] [Explore] (Oral)

MAttNet: Modular Attention Network for Referring Expression Comprehension

CVPR 2018

Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, Tamara L. Berg

From Image to Language and Back Again

Journal of Natural Language Engineering (JNLE), 2018

Anya Belz, Tamara L. Berg, Licheng Yu

[Paper]

Physics-Inspired Garment Recovery from a Single-View Image

ACM Transactions on Graphics, 2018

Shan Yang, Tanya Ambert, Zherong Pan, Ke Wang, Licheng Yu, Tamara L. Berg, Ming C. Lin

A Unified Framework for Manifold Landmarking

IEEE Transactions on Signal Processing, 2018

Hongteng Xu, Licheng Yu, Mark Davenport, Hongyuan Zha

[Paper]

Hierarchically-Attentive RNN for Album Summarization and Storytelling

EMNLP 2017

Licheng Yu, Mohit Bansal, Tamara L. Berg

A Joint Speaker-Listener-Reinforcer Model for Referring Expressions

CVPR 2017

Licheng Yu, Hao Tan, Mohit Bansal, Tamara L. Berg

[Paper] [Code] [Project] [Talk]

(Spotlight presentation 8%)

Modeling Context in Referring Expressions

ECCV 2016

Licheng Yu, Patrick Poirson, Shan Yang, Alexander C. Berg, Tamara L. Berg

[Paper] [Dataset]

[Talk] (Spotlight presentation 4.7%)

Visual Madlibs: Fill-in-the-blank Image Description and Question Answering

ICCV 2015

Licheng Yu, Eunbyung Park, Alexander C. Berg, Tamara L. Berg

Dictionary Learning with Mutually Reinforcing Group-Graph Structures

AAAI 2015

Licheng Yu*, Hongteng Xu*, Hongyuan Zha, Yi Xu
(* denotes equal contribution)

[Paper]

Vector Sparse Representation of Color Image Using Quaternion Matrix Analysis

IEEE Transactions on Image Processing, TIP 2015

Yi Xu, Licheng Yu, Hongteng Xu, Truong Nguyen, Hao Zhang

[Paper][Code]

Quaternion-based Sparse Representation of Color Image

IEEE International Conference on Multimedia and Expo, ICME 2013

Licheng Yu, Yi Xu, Hongteng Xu, Hao Zhang

[Paper][Supplementary File] (Oral presentation)

Single Image Super-resolution via Phase Congruency Analysis

IEEE Visual Communications and Image Processing, VCIP 2013

Licheng Yu, Yi Xu, Bo Zhang

[Paper] (Oral presentation)

Self-Example Based Super-resolution with Fractal-based Gradient Enhancement

IEEE International Conference on Multimedia and Expo, ICME workshop 2013

Licheng Yu, Yi Xu, Hongteng Xu

[Paper]

Robust Single Image Super-resolution based on Gradient Enhancement

APSIPA Annual Summit and Conference, APSIPA 2012

Licheng Yu, Yi Xu, Hongteng Xu, Xiaokang Yang