Qing Li

  • 2025dart.jpg

    Efficient Multi-turn RL for GUI Agents via Decoupled Training and Adaptive Data Curation

    Pengxiang Li , Zechen Hu , Zirui Shang , Jingrong Wu , Yang Liu , Hui Liu , Zhi Gao , Chenrui Shi , Bofei Zhang , Zihao Zhang , Xiaochuan Shi , Zedong YU , Yuwei Wu , Xinxiao Wu , Yunde Jia , Liuyu Xiang , Zhaofeng He , and Qing Li

    arXiv preprint arXiv:2509.23866, 2025

  • 2025tongui.jpg

    TongUI: Building Generalized GUI Agents by Learning from Multimodal Web Tutorials

    Bofei Zhang , Zirui Shang , Zhi Gao , Wang Zhang , Rui Xie , Xiaojian Ma , Tao Yuan , Xinxiao Wu , Song-Chun Zhu , and Qing Li

    arXiv preprint arXiv:2504.12679, 2025

  • 2025sport.jpg

    Iterative Tool Usage Exploration for Multimodal Agents via Step-wise Preference Tuning

    Pengxiang Li* , Zhi Gao* , Bofei Zhang , Yapeng Mi , Xiaojian Ma , Chenrui Shi , Tao Yuan , Yuwei Wu , Yunde Jia , Song-Chun Zhu , and Qing Li

    Neural Information Processing Systems (NeurIPS), 2025

  • 2025anywhere3d.jpg

    From Objects to Anywhere: A Holistic Benchmark for Multi-level Visual Grounding in 3D Scenes

    Neural Information Processing Systems: Datasets and Benchmarks (NeurIPS D&B), 2025

  • 2025mtu.gif

    Move to Understand a 3D Scene: Bridging Visual Grounding and Exploration for Efficient and Versatile Embodied Navigation Highlight

    Ziyu Zhu , Xilin Wang , Yixuan Li , Zhuofan Zhang , Xiaojian MaYixin ChenBaoxiong Jia , Wei Liang , Qian Yu , Zhidong DengSiyuan Huang , and Qing Li

    International Conference on Computer Vision (ICCV), 2025

  • 2025eva.png

    Embodied VideoAgent: Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding Highlight

    International Conference on Computer Vision (ICCV), 2025

  • Falcon: Fast Visuomotor Policies via Partial Denoising

    Haojun Chen , Minghao Liu , Chengdong Ma , Xiaojian Ma , Zailin Ma , Huimin Wu , Yuanpei Chen , Yifan Zhong , Mingzhi Wang , Qing Li , and Yaodong Yang

    International Conference on Machine Learning (ICML), 2025

  • 2025metascenes.gif
  • 2025beacon3d.png

    Unveiling the Mist over 3D Vision-Language Understanding: Object-centric Evaluation with Chain-of-Analysis

    The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  • 2025mat.png

    Multi-modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool Usage Spotlight

    International Conference on Learning Representations (ICLR), 2025

  • 2025mmke.png

    MMKE-Bench: A Multimodal Editing Benchmark for Diverse Visual Knowledge

    International Conference on Learning Representations (ICLR), 2025

  • 2024fire.png

    FIRE: A Dataset for Feedback Integration and Refinement Evaluation of Multimodal Models

    Neural Information Processing Systems: Datasets and Benchmarks (NeurIPS D&B), 2024

  • 2024ultraedit.png

    UltraEdit: Instruction-based Fine-Grained Image Editing at Scale

    Haozhe Zhao* , Xiaojian Ma* , Liang Chen , Shuzheng Si , Rujie Wu , Kaikai An , Peiyu Yu , Minjia Zhang , Qing Li , and Baobao Chang

    Neural Information Processing Systems: Datasets and Benchmarks (NeurIPS D&B), 2024

  • 2024sg3d.png

    Task-oriented Sequential Grounding in 3D Scenes

    arXiv preprint arXiv:2408.04034, 2024

  • luo2024insight.png

    End-to-End Neuro-Symbolic Reinforcement Learning with Textual Explanations Spotlight (top 3.5%)

    International Conference on Machine Learning (ICML), 2024

  • zhu2024unifying.png

    Unifying 3D Vision-Language Understanding Via Promptable Queries

    European Conference on Computer Vision (ECCV), 2024

  • fan2024videoagent.png

    VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding

    European Conference on Computer Vision (ECCV), 2024

  • jia2024sceneverse.png

    SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding

    European Conference on Computer Vision (ECCV), 2024

  • gao2024clova.png

    CLOVA: A Closed-Loop Visual Assistant with Tool Usage and Update

    The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  • huang2024embodied.png

    An Embodied Generalist Agent in 3D World

    International Conference on Machine Learning (ICML), 2024

  • li2024nsr.png

    Neural-Symbolic Recursive Machine for Systematic Generalization

    International Conference on Learning Representations (ICLR), 2024

  • wu2024bongard.png

    Bongard-OpenWorld: Few-Shot Reasoning for Free-Form Visual Concepts in the Real World

    International Conference on Learning Representations (ICLR), 2024

  • qin2023learning.png

    Learning Non-Markovian Decision-Making from State-Only Sequences

    Neural Information Processing Systems (NeurIPS), 2023

  • li2023hint.png

    A Minimalist Dataset for Systematic Generalization of Perception, Syntax, and Semantics Notable-top-25%

    International Conference on Learning Representations (ICLR), 2023

  • zhu2023vista.png

    3D-VisTA: Pre-Trained Transformer for 3D Vision and Text Alignment

    International Conference on Computer Vision (ICCV), 2023

  • ma2023sqa3d.png

    SQA3D: Situated Question Answering in 3D Scenes

    International Conference on Learning Representations (ICLR), 2023

  • hong2021smart.png

    SMART: A Situation Model for Algebra Story Problems via Attributed Grammar

    AAAI Conference on Artificial Intelligence (AAAI), 2021

  • hong2021learning.png

    Learning by Fixing: Solving Math Word Problems with Weak Supervision

    AAAI Conference on Artificial Intelligence (AAAI), 2021

  • chen2021yourefit.png

    YouRefIt: Embodied Reference Understanding with Language and Gesture Oral

    International Conference on Computer Vision (ICCV), 2021

  • hong2021vlgrammar.png

    VLGrammar: Grounded Grammar Induction of Vision and Language

    International Conference on Computer Vision (ICCV), 2021

  • li2020competence.png

    A Competence-Aware Curriculum for Visual Concepts Learning Via Question Answering Oral

    European Conference on Computer Vision (ECCV), 2020

  • li2020ngs.png

    Closed Loop Neural-Symbolic Learning Via Integrating Neural Perception, Grammar Parsing, and Symbolic Reasoning Best Paper in ICML Workshop

    International Conference on Machine Learning (ICML), 2020

  • bhattacharya2019visual.png

    Why Does a Visual Question Have Different Answers?

    Nilavra Bhattacharya , Qing Li , and Danna Gurari

    International Conference on Computer Vision (ICCV), 2019

  • gurari2019vizwizpriv.png

    VizWiz-Priv: A Dataset for Recognizing the Presence and Purpose of Private Visual Information in Images Taken by Blind People

    Danna GurariQing Li , Chi Lin , Yinan Zhao , Anhong Guo , Abigale Stangl , and Jeffrey P Bigham

    The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019

  • li2018tell.png

    Tell-and-Answer: Towards Explainable Visual Question Answering Using Attributes and Captions Oral

    Annual Conference on Empirical Methods in Natural Language Processing (EMNLP), 2018

  • li2018vqa.png

    VQA-E: Explaining, Elaborating, and Enhancing Your Answers for Visual Questions

    European Conference on Computer Vision (ECCV), 2018

  • gurari2018vizwiz.png

    VizWiz Grand Challenge: Answering Visual Questions from Blind People Spotlight

    Danna GurariQing Li , Abigale J Stangl , Anhong Guo , Chi Lin , Kristen Grauman , Jiebo Luo , and Jeffrey P Bigham

    The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018

  • li2016action.png

    Action Recognition by Learning Deep Multi-Granular Spatio-Temporal Video Representation Best Paper Finalist

    International Conference on Multimedia Retrieval, 2016