🎬 VideoExplorer: Thinking with Video for Long-Form Understanding
🎬 Demo
example_sports.mov
👉 Introduction
VideoExplorer is a novel framework for long-video understanding that moves beyond single-pass reasoning. Inspired by the "thinking with video" principle, it performs faithful, efficient, and interpretable reasoning by dynamically exploring video content.
🎉 News
2025.10.16 - We released the newest version of VideoDeepResearch called VideoExplorer! It's smaller, cheaper, but just as effective in long video understanding. Details refer to our updated paper. ✨
2025.06.10 - We released the first version of VideoDeepResearch. 🎬
🚀 Overview
Long-video understanding is challenging. Existing methods often sacrifice detail by downsampling or rely on task-agnostic representations, limiting their perception.
VideoExplorer solves this by intertwining planning, temporal grounding, and scalable perception into a coherent, iterative loop:
- Formulates a sub-question.
- Locates the relevant moments.
- Performs task-oriented, fine-grained perception.
- Repeats until the final answer is reached.
💡 Key Features
- Iterative Reasoning: Dynamically explores video content instead of relying on a static context.
- Task-Oriented Perception: Focuses computational resources on relevant moments, enabling scalable analysis.
- Interpretable Trajectories: Each step of the reasoning process is transparent and traceable.
🏛️ Framework & Training
To overcome the lack of LVU training data, we constructed a high-quality dataset using difficulty-adaptive sampling. Our training pipeline consists of:
- Supervised Trajectory Initialization
- Trajectory-level Preference Optimization
This two-stage approach encourages adaptive temporal grounding and iterative information integration guided by downstream rewards.
📈 Results
Extensive evaluations on popular long-video benchmarks show that VideoExplorer achieves significant performance advantages over existing baselines, demonstrating its robustness, adaptability, and efficiency.
🚀 Quick Start
1. Clone & Install
# Clone repository git clone https://github.com/yhy-2000/VideoDeepResearch.git cd VideoDeepResearch # Install dependencies pip install -r requirements.txt
Project Layout:
VideoDeepResearch/
├── requirements.txt # Python dependencies
├── eval/ # Code for evaluating benchmarks
├── train/ # Code for supervised finetuning (SFT) and trajectory-based direct preference optimization (TDPO)
├── asset/ # Assets used in the demo
├── data/
│ ├── videos/ # Raw video files
│ ├── clips/ # Generated video clips
│ ├── dense_frames/ # Extracted key frames
└── README.md # This documentation
Launch Demo
Evaluation on Benchmarks
Training
Our training dataset is available at https://huggingface.co/datasets/avery00/VideoExplorer-Dataset/tree/main. To set up:
-
Place dpo_marathon.json in train/LLaMA-Factory-dpo/data.
-
Place the remaining two files in train/LLaMA-Factory-sft/data.
Environment Setting
mv train/LLaMA-Factory-sft train/LLaMA-Factory-main cd train/LLaMA-Factory-main pip install -e ".[torch,metrics]" --no-build-isolation mv train/LLaMA-Factory-main train/LLaMA-Factory-sft
Supervised Finetuning
cd train # load the right code mv train/LLaMA-Factory-sft train/LLaMA-Factory-main # finetuning planner bash sft_planner.sh # finetuning temporal grounder bash sft_temporal_grounding_agent.sh mv train/LLaMA-Factory-main train/LLaMA-Factory-sft
Trajectory-based Direct Preference Optimization
# load the right code mv train/LLaMA-Factory-dpo train/LLaMA-Factory-main # Trajectory-based DPO bash train/dpo_planner.sh mv train/LLaMA-Factory-main train/LLaMA-Factory-dpo
📬 Contact
Encounter issues or have questions? Reach out to:
H.Y. Yuan Email: hyyuan@ruc.edu.cn
📄 Citation
If you find this work helpful, please cite our paper:
@misc{yuan2025thinkvideosagenticlongvideo, title={Think With Videos For Agentic Long-Video Understanding}, author={Huaying Yuan and Zheng Liu and Junjie Zhou and Hongjin Qian and Yan Shu and Nicu Sebe and Ji-Rong Wen and Zhicheng Dou}, year={2025}, eprint={2506.10821}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2506.10821}, }