Lighthouse is a user-friendly library for reproducible video moment retrieval and highlight detection (MR-HD). It supports seven models, four features (video and audio features), and six datasets for reproducible MR-HD, MR, and HD. In addition, we prepare an inference API and Gradio demo for developers to use state-of-the-art MR-HD approaches easily. Furthermore, Lighthouse supports audio moment retrieval, a task to identify relevant moments from an audio input based on a given text query.
News
- [2026/01/18] Our work "CASTELLA: Long Audio Dataset with Captions and Temporal Boundaries " has been accepted at ICASSP 2026.
- [2025/11/20] Version 1.2 Our work "CASTELLA: Long Audio Dataset with Captions and Temporal Boundaries" has been released. This update adds support for a new AMR dataset called CASTELLA.
- [2025/06/04] Version 1.1 has been released. It includes API changes, AMR gradio demo, and huggingface wrappers for the audio moment retrieval and clotho dataset.
- [2024/12/24] Our work "Language-based audio moment retrieval" has been accepted at ICASSP 2025.
- [2024/10/22] Version 1.0 has been released.
- [2024/10/6] Our paper has been accepted at EMNLP2024, system demonstration track.
- [2024/09/25] Our work "Language-based audio moment retrieval" has been released. Lighthouse supports AMR.
- [2024/08/22] Our demo paper is available on arXiv. Any comments are welcome: Lighthouse: A User-Friendly Library for Reproducible Video Moment Retrieval and Highlight Detection.
Installation
Install ffmpeg first. If you are an Ubuntu user, run:
Then, install pytorch, torchvision, and torchaudio based on your GPU environments. Note that the inference API is available for CPU environments. We tested the codes on Python 3.9 and CUDA 11.8:
pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu118
Finally, run to install dependency libraries:
pip install 'git+https://github.com/line/lighthouse.git'
Inference API (Available for both CPU/GPU mode)
Lighthouse supports the following inference API:
import torch from lighthouse.models import CGDETRPredictor # use GPU if available device = "cuda" if torch.cuda.is_available() else "cpu" # slowfast_path is necesary if you use clip_slowfast features query = 'A man is speaking in front of the camera' model = CGDETRPredictor('/path/to/weight.ckpt', device=device, feature_name='clip_slowfast', slowfast_path='SLOWFAST_8x8_R50.pkl') # encode video features video = model.encode_video('api_example/RoripwjYFp8_60.0_210.0.mp4') # moment retrieval & highlight detection prediction = model.predict(query, video) print(prediction) """ pred_relevant_windows: [[start, end, score], ...,] pred_saliency_scores: [score, ...] {'query': 'A man is speaking in front of the camera', 'pred_relevant_windows': [[117.1296, 149.4698, 0.9993], [-0.1683, 5.4323, 0.9631], [13.3151, 23.42, 0.8129], ...], 'pred_saliency_scores': [-10.868017196655273, -12.097496032714844, -12.483806610107422, ...]} """
Lighthouse also supports the AMR inference API:
import torch from lighthouse.models import QDDETRPredictor device = "cuda" if torch.cuda.is_available() else "cpu" model = QDDETRPredictor('/path/to/weight.ckpt', device=device, feature_name='clap') audio = model.encode_audio('api_example/1a-ODBWMUAE.wav') query = 'Water cascades down from a waterfall.' prediction = model.predict(query, audio) print(prediction)
Run python api_example/demo.py (MR-HD) or python api_example/amr_demo.py (AMR) to reproduce the results. It automatically downloads pre-trained weights.
If you want to use other models, download pre-trained weights.
When using clip_slowfast features, it is necessary to download slowfast pre-trained weights.
When using clip_slowfast_pann features, in addition to the slowfast weight, download panns weights.
Run python api_example/amr_demo.py to reproduce the AMR results.
Limitation: The maximum video duration is 150s due to the current benchmark datasets.
For CPU users, set feature_name='clip' because CLIP+Slowfast or CLIP+Slowfast+PANNs features are very slow without GPUs.
Gradio demo
Run python gradio_demo/demo.py. Upload the video and input text query, and click the blue button. For AMR demo, run python gradio_demo/amr_demo.py.
Supported models, datasets, and features
Models
Moment retrieval & highlight detection
- : Moment-DETR (Lei et al. NeurIPS21)
- : QD-DETR (Moon et al. CVPR23)
- : EaTR (Jang et al. ICCV23)
- : CG-DETR (Moon et al. arXiv24)
- : UVCOM (Xiao et al. CVPR24)
- : TR-DETR (Sun et al. AAAI24)
- : TaskWeave (Jin et al. CVPR24)
- : R2-Tuning (Liu et al. ECCV24)
Datasets
Moment retrieval & highlight detection
- : QVHighlights (Lei et al. NeurIPS21)
- : QVHighlights w/ Audio Features (Lei et al. NeurIPS21)
- : QVHighlights ASR Pretraining (Lei et al. NeurIPS21)
Moment retrieval
- : ActivityNet Captions (Krishna et al. ICCV17)
- : Charades-STA (Gao et al. ICCV17)
- : TaCoS (Regneri et al. TACL13)
Highlight detection
Audio moment retrieval
Features
- : ResNet+GloVe
- : CLIP
- : CLIP+Slowfast
- : CLIP+Slowfast+PANNs (Audio) for QVHighlights
- : I3D+CLIP (Text) for TVSum
Reproduce the experiments
Pre-trained weights
Pre-trained weights can be downloaded from here. Download and unzip on the home directory. AMR models trained on CASTELLA and Clotho-Moment is available in here
Datasets
Due to the copyright issue, we here distribute only feature files.
Download and place them under ./features directory.
To extract features from videos, we use HERO_Video_Feature_Extractor.
For AMR, download features from here.
The whole directory should be look like this:
lighthouse/
├── api_example
├── configs
├── data
├── features # Download the features and place them here
│ ├── ActivityNet
│ │ ├── clip
│ │ ├── clip_text
│ │ ├── resnet
│ │ └── slowfast
│ ├── Charades
│ │ ├── clip
│ │ ├── clip_text
│ │ ├── resnet
│ │ ├── slowfast
│ ├── QVHighlight
│ │ ├── clip
│ │ ├── clip_text
│ │ ├── pann
│ │ ├── resnet
│ │ └── slowfast
│ ├── tacos
│ │ ├── clip
│ │ ├── clip_text
│ │ ├── resnet
│ │ └── slowfast
│ ├── tvsum
│ │ ├── clip
│ │ ├── clip_text
│ │ ├── i3d
│ │ ├── resnet
│ │ ├── slowfast
│ ├── youtube_highlight
│ │ ├── clip
│ │ ├── clip_text
│ │ └── slowfast
│ └── clotho-moments
│ ├── clap
│ └── clap_text
├── gradio_demo
├── images
├── lighthouse
├── results # The pre-trained weights are saved in this directory
└── training
Training and evaluation
Training
The training command is:
python training/train.py --model MODEL --dataset DATASET --feature FEATURE [--resume RESUME] [--domain DOMAIN]
| Options | |
|---|---|
| Model | moment_detr, qd_detr, eatr, cg_detr, uvcom, tr_detr, taskweave_mr2hd, taskweave_hd2mr |
| Feature | resnet_glove, clip, clip_slowfast, clip_slowfast_pann, i3d_clip, clap |
| Dataset | qvhighlight, qvhighlight_pretrain, activitynet, charades, tacos, tvsum, youtube_highlight, clotho-moment |
(Example 1) Moment DETR w/ CLIP+Slowfast on QVHighlights:
python training/train.py --model moment_detr --dataset qvhighlight --feature clip_slowfast
(Example 2) Moment DETR w/ CLIP+Slowfast+PANNs (Audio) on QVHighlights:
python training/train.py --model moment_detr --dataset qvhighlight --feature clip_slowfast_pann
(Pre-train & Fine-tuning, QVHighlights only) Lighthouse supports pre-training. Run:
python training/train.py --model moment_detr --dataset qvhighlight_pretrain --feature clip_slowfast
Then fine-tune the model with --resume option:
python training/train.py --model moment_detr --dataset qvhighlight --feature clip_slowfast --resume results/moment_detr/qvhighlight_pretrain/clip_slowfast/best.ckpt
(TVSum and YouTube Highlight) To train models on these two datasets, you need to specify domain:
python training/train.py --model moment_detr --dataset tvsum --feature clip_slowfast --domain BK
Evaluation
The evaluation command is:
python training/evaluate.py --model MODEL --dataset DATASET --feature FEATURE --split {val,test} --model_path MODEL_PATH --eval_path EVAL_PATH [--domain DOMAIN]
(Example 1) Evaluating Moment DETR w/ CLIP+Slowfast on the QVHighlights val set:
python training/evaluate.py --model moment_detr --dataset qvhighlight --feature clip_slowfast --split val --model_path results/moment_detr/qvhighlight/clip_slowfast/best.ckpt --eval_path data/qvhighlight/highlight_val_release.jsonl
To generate submission files for QVHighlight test sets, change split into test (QVHighlights only):
python training/evaluate.py --model moment_detr --dataset qvhighlight --feature clip_slowfast --split test --model_path results/moment_detr/qvhighlight/clip_slowfast/best.ckpt --eval_path data/qvhighlight/highlight_test_release.jsonl
Then zip hl_val_submission.jsonl and hl_test_submission.jsonl, and submit it to the Codalab (QVHighlights only):
zip -r submission.zip val_submission.jsonl test_submission.jsonl
HuggingFace Wrapper
We support wrappers for HuggingFace.
You can easily use models and dataset via AutoModel and huggingface_hub.
The following models and datasets are provided by the wrapper for HuggingFace.
Models
Datasets
Citation
Lighthouse
@InProceedings{taichi2024emnlp, author = {Taichi Nishimura and Shota Nakada and Hokuto Munakata and Tatsuya Komatsu}, title = {Lighthouse: A User-Friendly Library for Reproducible Video Moment Retrieval and Highlight Detection}, booktitle = {Proceedings of The 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations}, year = {2024}, }
Audio moment retrieval
@InProceedings{hokuto2025icassp, author = {Hokuto Munakata and Taichi Nishimura and Shota Nakada and Tatsuya Komatsu}, title = {Language-based Audio Moment Retrieval}, booktitle = {IEEE International Conference on Acoustic, Speech, and Signal Processing}, year = {2025}, }
Contributing
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
LICENSE
Apache License 2.0
Contact
Taichi Nishimura (taichitary@gmail.com)

