Yutong Bai

Research Vision

I aim to build intelligent systems from first principles: systems that do not merely fit patterns or follow instructions, but that gradually develop structure, abstraction, and behavior through learning itself.

I'm interested in how intelligence emerges, not from handcrafted pipelines or task-specific heuristics, but from exposure to behaviorally rich, understructured environments, where models must learn what to attend to, how to reason, and how to improve. This requires designing learning systems that are not narrowly optimized for a goal, but that can self-organize and grow increasingly competent through interaction, experience, and computation.

I see scale as a tool, but not as the whole solution. Larger models open up more capacity, but what fills that capacity—and how it forms—is just as important. My research explores how we can use scale to amplify the right signals: not just data quantity, but the structural richness of behavior, and the dynamics of learning itself.

To that end, I focus on:

Understanding what makes behavior intelligent, especially when it's easy for humans but hard for machines;
Designing systems that learn internal structure from raw behavioral input, without task scaffolds or dense supervision;
Creating conditions where models discover abstraction and reasoning, not because they are explicitly told to—but because learning leads them there.

I believe intelligence is not something we can fully define or supervise in advance—it must emerge over time, shaped by data, computation, and inductive processes inside the model. My work is an attempt to understand and enable that emergence.

Publications

( show selected / show all by date / show all by topic )

	Whole-Body Conditioned Egocentric Video Prediction , Danny Tran, Amir Bar*, Yann LeCun†, Trevor Darrell†, Jitendra Malik† NeurIPS, 2025 paper / project page
	Sequential Modeling Enables Scalable Learning for Large Vision Models Yutong Bai, Xinyang Geng, Karttikeya Mangalam, Amir Bar, Alan Yuille, Trevor Darrell, Jitendra Malik, Alexei A. Efros CVPR, 2024 paper / project page / code / model
	Transformers Discover Molecular Structure Without Graph Priors Tobias Kreiman, Yutong Bai, Fadi Atieh, Elizabeth Weaver, Eric Qu, Aditi S. Krishnapriyan ArXiv, 2025 paper / project page / code & model
	The Serial Scaling Hypothesis Yuxi Liu, Konpat Preechakul, Kananart Kuwaranancharoen, Yutong Bai ICLR, 2026 paper
	TARDIS STRIDE: A Spatio-Temporal Road Image Dataset and World Model for Autonomy Héctor Carrión, Yutong Bai, Víctor A. Hernández Castro*, Kishan Panaganti, Ayush Zenith, Matthew Trang, Tony Zhang, Pietro Perona, Jitendra Malik Arxiv, 2025 paper / project page / data / code / model
	Point-Level Region Contrast for Object Detection Pre-Training Yutong Bai, Xinlei Chen, Alexander Kirillov, Alan Yuille, Alexander C. Berg CVPR, 2022 (Nominated for CVPR Best Paper - Top 0.4%) paper / code / video / poster
	Evaluating Multiview Object Consistency in Humans and Image Models Tyler Bonnen, Stephanie Fu, Yutong Bai, Thomas O'Connell, Yoni Friedman, Nancy Kanwisher, Josh Tenenbaum, Alexei Efros NeurIPS, 2024 paper / project page / code / data
	Intriguing Properties of Text-guided Diffusion Models Qihao Liu, Adam Kortylewski, Yutong Bai, Song Bai, Alan Yuille ICLR, 2024 paper / project page
	Analyzing The Language of Visual Tokens David M Chan, Rodolfo Corona, Joonyong Park, Cheol Jun Cho, Yutong Bai, Trevor Darrell Arxiv, 2024 paper
	KiVA: Kid-inspired Visual Analogies for Testing Large Multimodal Models Eunice Yiu, Maan Qraitem, Anisa Noor Majhi, Charlie Wong, Yutong Bai, Shiry Ginosar, Alison Gopnik, Kate Saenko ICLR, 2025 paper / project page / code
	"I Know It When I See It": Mood Spaces for Connecting and Expressing Visual Concepts Huzheng Yang, Katherine Xu, Michael D. Grossberg, Yutong Bai, Jianbo Shi Arxiv, 2025 paper / project page / demo
	AlphaOne: Reasoning Models Thinking Slow and Fast at Test Time Junyu Zhang, Runpei Dong, Han Wang, Xuying Ning, Haoran Geng, Peihao Li, Xialin He, Yutong Bai, Jitendra Malik, Saurabh Gupta, Huan Zhang EMNLP, 2025 paper / project page / code
	Masked Autoencoders Enable Efficient Knowledge Distillers Yutong Bai, Zeyu Wang, Junfei Xiao, Chen Wei, Huiyu Wang, Alan L Yuille, Yuyin Zhou, Cihang Xie CVPR, 2023 paper / code / model
	Are Transformers More Robust than CNNs? Yutong Bai, Jieru Mei, Alan Yuille, Cihang Xie NeurIPS, 2021 paper / code / model
	Can Temporal Information Help with Contrastive Self-Supervised Learning? Yutong Bai, Haoqi Fan, Ishan Misra, Ganesh Venkatesh, Yongyi Lu, Yuyin Zhou, Qihang Yu, Vikas Chandra, Alan Yuille Arxiv, 2020 paper
	C2FNAS: Coarse-to-Fine Neural Architecture Search for 3D Medical Image Segmentation Qihang Yu, Dong Yang, Holger Roth, Yutong Bai, Yixiao Zhang, Alan Yuille, Daguang Xu CVPR, 2020 paper
	Semantic Part Detection via Matching: Learning to Generalize to Novel Viewpoints from Limited Training Data Yutong Bai, Qing Liu, Lingxi Xie, Weichao Qiu, Yan Zheng, Alan Yuille ICCV, 2019 paper / code
	Clevr-ref+: Diagnosing Visual Reasoning with Referring Expressions Runtao Liu, Chenxi Liu, Yutong Bai, Alan L Yuille CVPR, 2019 paper / project page
	CoKe: Contrastive Learning for Robust Keypoint Detection Yutong Bai, Angtian Wang, Adam Kortylewski, Alan Yuille WACV, 2023 paper
	Delving Into Masked Autoencoders for Multi-Label Thorax Disease Classification Junfei Xiao, Yutong Bai, Alan Yuille, Zongwei Zhou WACV, 2023 paper / code
	REOrdering Patches Improves Vision Models Declan Kutscher, David M Chan, Yutong Bai, Trevor Darrell, Ritwik Gupta Arxiv, 2025 paper / project page / code
	AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information? Kaixiong Gong, Kaituo Feng, Bohao Li, Yibing Wang, Mofan Cheng, Shijia Yang, Jiaming Han, Benyou Wang, Yutong Bai, Zhuoran Yang, Xiangyu Yue Arxiv, 2024 paper / project page / code / data
	Finding Visual Task Vectors Alberto Hojel, Yutong Bai, Trevor Darrell, Amir Globerson, Amir Bar ECCV, 2024 paper / code / model
	Mask Guided Matting via Progressive Refinement Network Qihang Yu, Jianming Zhang, He Zhang, Yilin Wang, Zhe Lin, Ning Xu, Yutong Bai, Alan Yuille CVPR, 2021 paper / code
	Glance-and-Gaze Vision Transformer Qihang Yu, Yingda Xia, Yutong Bai, Yongyi Lu, Alan L. Yuille, Wei Shen NeurIPS, 2021 paper
	Can CNNs Be More Robust Than Transformers? Zeyu Wang, Yutong Bai, Yuyin Zhou, Cihang Xie ICLR, 2023 paper / code
	LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning Dantong Niu, Yuvan Sharma, Giscard Biamby, Jerome Quenum, Yutong Bai, Baifeng Shi, Trevor Darrell, Roei Herzig CoRL, 2024 paper / project page / code / data
	Making Your First Choice: To Address Cold Start Problem in Medical Active Learning Liangyu Chen, Yutong Bai, Siyu Huang, Yongyi Lu, Bihan Wen, Alan Yuille, Zongwei Zhou PMLR, 2022 paper / code
	Focalizing regions of biomarker relevance facilitates biomarker prediction on histopathological images Jiefeng Gan, Hanchen Wang, Hui Yu, Zitong He, Wenjuan Zhang, Ke Ma, Lianghui Zhu, Yutong Bai, Zongwei Zhou, Alan Yullie, Xiang Bai, Mingwei Wang, Dehua Yang, Yanyan Chen, Guoan Chen, Joan Lasenby, Chao Cheng, Jia Wu, Jianjun Zhang, Xinggang Wang, Yaobing Chen, Guoping Wang, Tian Xia iScience, 2023 paper
	Vector Quantized Feature Fields for Fast 3D Semantic Lifting George Tang, Aditya Agarwal, Weiqiao Han, Trevor Darrell, Yutong Bai Arxiv, 2025 paper
	Fast AdvProp Jieru Mei, Yucheng Han, Yutong Bai, Yixiao Zhang, Yingwei Li, Xianhang Li, Alan Yuille, Cihang Xie ICLR, 2022 paper / code / model