Qing Li
Efficient Multi-turn RL for GUI Agents via Decoupled Training and Adaptive Data Curation
Pengxiang Li , Zechen Hu , Zirui Shang , Jingrong Wu , Yang Liu , Hui Liu , Zhi Gao , Chenrui Shi , Bofei Zhang , Zihao Zhang , Xiaochuan Shi , Zedong YU , Yuwei Wu✉ , Xinxiao Wu , Yunde Jia , Liuyu Xiang , Zhaofeng He , and Qing Li✉
arXiv preprint arXiv:2509.23866, 2025
TongUI: Building Generalized GUI Agents by Learning from Multimodal Web Tutorials
Bofei Zhang , Zirui Shang , Zhi Gao , Wang Zhang , Rui Xie , Xiaojian Ma , Tao Yuan , Xinxiao Wu , Song-Chun Zhu , and Qing Li✉
arXiv preprint arXiv:2504.12679, 2025
Iterative Tool Usage Exploration for Multimodal Agents via Step-wise Preference Tuning
Pengxiang Li* , Zhi Gao* , Bofei Zhang , Yapeng Mi , Xiaojian Ma , Chenrui Shi , Tao Yuan , Yuwei Wu✉ , Yunde Jia , Song-Chun Zhu , and Qing Li✉
Neural Information Processing Systems (NeurIPS), 2025
From Objects to Anywhere: A Holistic Benchmark for Multi-level Visual Grounding in 3D Scenes
Neural Information Processing Systems: Datasets and Benchmarks (NeurIPS D&B), 2025
Move to Understand a 3D Scene: Bridging Visual Grounding and Exploration for Efficient and Versatile Embodied Navigation Highlight
Ziyu Zhu , Xilin Wang , Yixuan Li , Zhuofan Zhang , Xiaojian Ma , Yixin Chen , Baoxiong Jia , Wei Liang , Qian Yu , Zhidong Deng✉ , Siyuan Huang✉ , and Qing Li✉
International Conference on Computer Vision (ICCV), 2025
Embodied VideoAgent: Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding Highlight
International Conference on Computer Vision (ICCV), 2025
Falcon: Fast Visuomotor Policies via Partial Denoising
Haojun Chen , Minghao Liu , Chengdong Ma , Xiaojian Ma , Zailin Ma , Huimin Wu , Yuanpei Chen , Yifan Zhong , Mingzhi Wang , Qing Li✉ , and Yaodong Yang✉
International Conference on Machine Learning (ICML), 2025
Unveiling the Mist over 3D Vision-Language Understanding: Object-centric Evaluation with Chain-of-Analysis
The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025
Multi-modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool Usage Spotlight
International Conference on Learning Representations (ICLR), 2025
MMKE-Bench: A Multimodal Editing Benchmark for Diverse Visual Knowledge
International Conference on Learning Representations (ICLR), 2025
FIRE: A Dataset for Feedback Integration and Refinement Evaluation of Multimodal Models
Neural Information Processing Systems: Datasets and Benchmarks (NeurIPS D&B), 2024
UltraEdit: Instruction-based Fine-Grained Image Editing at Scale
Haozhe Zhao* , Xiaojian Ma* , Liang Chen , Shuzheng Si , Rujie Wu , Kaikai An , Peiyu Yu , Minjia Zhang , Qing Li✉ , and Baobao Chang✉
Neural Information Processing Systems: Datasets and Benchmarks (NeurIPS D&B), 2024
Task-oriented Sequential Grounding in 3D Scenes
arXiv preprint arXiv:2408.04034, 2024
End-to-End Neuro-Symbolic Reinforcement Learning with Textual Explanations Spotlight (top 3.5%)
International Conference on Machine Learning (ICML), 2024
Unifying 3D Vision-Language Understanding Via Promptable Queries
European Conference on Computer Vision (ECCV), 2024
VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding
European Conference on Computer Vision (ECCV), 2024
SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding
European Conference on Computer Vision (ECCV), 2024
CLOVA: A Closed-Loop Visual Assistant with Tool Usage and Update
The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
An Embodied Generalist Agent in 3D World
International Conference on Machine Learning (ICML), 2024
Neural-Symbolic Recursive Machine for Systematic Generalization
International Conference on Learning Representations (ICLR), 2024
Bongard-OpenWorld: Few-Shot Reasoning for Free-Form Visual Concepts in the Real World
International Conference on Learning Representations (ICLR), 2024
Learning Non-Markovian Decision-Making from State-Only Sequences
Neural Information Processing Systems (NeurIPS), 2023
A Minimalist Dataset for Systematic Generalization of Perception, Syntax, and Semantics Notable-top-25%
International Conference on Learning Representations (ICLR), 2023
3D-VisTA: Pre-Trained Transformer for 3D Vision and Text Alignment
International Conference on Computer Vision (ICCV), 2023
SQA3D: Situated Question Answering in 3D Scenes
International Conference on Learning Representations (ICLR), 2023
SMART: A Situation Model for Algebra Story Problems via Attributed Grammar
AAAI Conference on Artificial Intelligence (AAAI), 2021
Learning by Fixing: Solving Math Word Problems with Weak Supervision
AAAI Conference on Artificial Intelligence (AAAI), 2021
YouRefIt: Embodied Reference Understanding with Language and Gesture Oral
International Conference on Computer Vision (ICCV), 2021
VLGrammar: Grounded Grammar Induction of Vision and Language
International Conference on Computer Vision (ICCV), 2021
A Competence-Aware Curriculum for Visual Concepts Learning Via Question Answering Oral
European Conference on Computer Vision (ECCV), 2020
Closed Loop Neural-Symbolic Learning Via Integrating Neural Perception, Grammar Parsing, and Symbolic Reasoning Best Paper in ICML Workshop
International Conference on Machine Learning (ICML), 2020
Why Does a Visual Question Have Different Answers?
Nilavra Bhattacharya , Qing Li , and Danna Gurari
International Conference on Computer Vision (ICCV), 2019
VizWiz-Priv: A Dataset for Recognizing the Presence and Purpose of Private Visual Information in Images Taken by Blind People
Danna Gurari , Qing Li , Chi Lin , Yinan Zhao , Anhong Guo , Abigale Stangl , and Jeffrey P Bigham
The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019
Tell-and-Answer: Towards Explainable Visual Question Answering Using Attributes and Captions Oral
Annual Conference on Empirical Methods in Natural Language Processing (EMNLP), 2018
VQA-E: Explaining, Elaborating, and Enhancing Your Answers for Visual Questions
European Conference on Computer Vision (ECCV), 2018
VizWiz Grand Challenge: Answering Visual Questions from Blind People Spotlight
Danna Gurari , Qing Li , Abigale J Stangl , Anhong Guo , Chi Lin , Kristen Grauman , Jiebo Luo , and Jeffrey P Bigham
The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018
Action Recognition by Learning Deep Multi-Granular Spatio-Temporal Video Representation Best Paper Finalist
International Conference on Multimedia Retrieval, 2016