Zhiding's Homepage

I am a principal research scientist and research lead at the Learning and Perception Research Group, NVIDIA Research. Before joining NVIDIA in 2018, I obtained Ph.D. in ECE from Carnegie Mellon University in 2017, and M.Phil. in ECE from The Hong Kong University of Science and Technology in 2012. I graduated with a bachelor's degree from the Union Class of Electrical Engineering (FENG Bingquan Pilot Class), South China University of Technology in 2008.

I am interested in building general autonomy and intelligence across both virtual and physical domains. My recent focus lies in Vision Transformers, LLMs, multimodal LLMs, and vision-language-action (VLA) models, with applications spanning open-world understanding, reasoning, AV/robot perception-planning, and agentic systems. I have led or contributed to numerous flagship research efforts and products at NVIDIA, including SegFormer (Most Influential NeurIPS Paper, Demo), VoxFormer, FB-BEV/FB-OCC, (CVPR23 3D Occ Pred Challenge winner, video), Hydra-MDP (CVPR24 E2E Driving Challenge winner, video), the Eagle VLM project, Nemotron, Llama-Nemotron-VL, and GR00T N1/GR00T N1.5 (NVIDIA’s foundation models for humanoid robots). I also participated in designing NVIDIA’s next-generation end-to-end autonomous driving system. My works are characterized by state-of-the-art performance, scalable architectures, and data-centric strategies towards real-world generalization.

Please refer to Google Scholar for the list of my latest publications.

NVIDIA (Santa Clara, CA)
Principal Research Scientist & Research Lead
I conduct research in multimodal learning and intelligent data strategies. I lead the Eagle VLM project which develops a family of frontier vision-language models with public training/data recipes and state-of-the-art performance matching or outperforming existing top-tier VLMs. Our work has laid the core VLM foundation and data strategy behind several flagship NVIDIA products/projects, including Llama-Nemotron-VL, Nemo Retriever Multimodal Embedding, GR00T N1, and GR00T N1.5.


2018.01 - Present
 

Mitsubishi Electric Research Laboratories (Cambridge, MA)
Research Intern, Computer Vision Group
Proposed a SOTA deep learning framework for semantic edge detection

2016.07 - 2016.10

Microsoft Research (Redmond, WA)
Research Intern, Multimedia, Interaction, and Communication (MIC) Group
Worked on deep learning based facial expression recognition. Work integrated into the Azure Cognitive Services (Media Coverage)

2015.06 - 2015.08

Adobe Research (San Jose, CA)
Research Intern, Computer Vision Group
Worked on voice-based photo editing

2013.06 - 2013.08

Carnegie Mellon University

Ph.D. in Electrical and Computer Engineering

2012 - 2017

The Hong Kong University of Science and Technology

M.Phil. in Electronic and Computer Engineering

2009 - 2012

South China University of Technology

B.Eng. in Information Engineering (Talented Student Program)

2005 - 2008