GitHub - line/lighthouse: [EMNLP2024 Demo], [ICASSP 2025], [ICASSP 2026] A user-friendly library for reproducible video moment retrieval and highlight detection. It also supports audio moment retrieval.

Lighthouse is a user-friendly library for reproducible video moment retrieval and highlight detection (MR-HD). It supports seven models, four features (video and audio features), and six datasets for reproducible MR-HD, MR, and HD. In addition, we prepare an inference API and Gradio demo for developers to use state-of-the-art MR-HD approaches easily. Furthermore, Lighthouse supports audio moment retrieval, a task to identify relevant moments from an audio input based on a given text query.

News

[2026/01/18] Our work "CASTELLA: Long Audio Dataset with Captions and Temporal Boundaries " has been accepted at ICASSP 2026.
[2025/11/20] Version 1.2 Our work "CASTELLA: Long Audio Dataset with Captions and Temporal Boundaries" has been released. This update adds support for a new AMR dataset called CASTELLA.
[2025/06/04] Version 1.1 has been released. It includes API changes, AMR gradio demo, and huggingface wrappers for the audio moment retrieval and clotho dataset.
[2024/12/24] Our work "Language-based audio moment retrieval" has been accepted at ICASSP 2025.
[2024/10/22] Version 1.0 has been released.
[2024/10/6] Our paper has been accepted at EMNLP2024, system demonstration track.
[2024/09/25] Our work "Language-based audio moment retrieval" has been released. Lighthouse supports AMR.
[2024/08/22] Our demo paper is available on arXiv. Any comments are welcome: Lighthouse: A User-Friendly Library for Reproducible Video Moment Retrieval and Highlight Detection.

Installation

Install ffmpeg first. If you are an Ubuntu user, run:

Then, install pytorch, torchvision, and torchaudio based on your GPU environments. Note that the inference API is available for CPU environments. We tested the codes on Python 3.9 and CUDA 11.8:

pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu118

Finally, run to install dependency libraries:

pip install 'git+https://github.com/line/lighthouse.git'

Inference API (Available for both CPU/GPU mode)

Lighthouse supports the following inference API:

import torch
from lighthouse.models import CGDETRPredictor

# use GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"

# slowfast_path is necesary if you use clip_slowfast features
query = 'A man is speaking in front of the camera'
model = CGDETRPredictor('/path/to/weight.ckpt', device=device,
                        feature_name='clip_slowfast', slowfast_path='SLOWFAST_8x8_R50.pkl')

# encode video features
video = model.encode_video('api_example/RoripwjYFp8_60.0_210.0.mp4')

# moment retrieval & highlight detection
prediction = model.predict(query, video)
print(prediction)
"""
pred_relevant_windows: [[start, end, score], ...,]
pred_saliency_scores: [score, ...]

{'query': 'A man is speaking in front of the camera',
 'pred_relevant_windows': [[117.1296, 149.4698, 0.9993],
                           [-0.1683, 5.4323, 0.9631],
                           [13.3151, 23.42, 0.8129],
                           ...],
 'pred_saliency_scores': [-10.868017196655273,
                          -12.097496032714844,
                          -12.483806610107422,
                          ...]}
"""

Lighthouse also supports the AMR inference API:

import torch
from lighthouse.models import QDDETRPredictor

device = "cuda" if torch.cuda.is_available() else "cpu"
model = QDDETRPredictor('/path/to/weight.ckpt', device=device, feature_name='clap')

audio = model.encode_audio('api_example/1a-ODBWMUAE.wav')
query = 'Water cascades down from a waterfall.'
prediction = model.predict(query, audio)
print(prediction)

Run python api_example/demo.py (MR-HD) or python api_example/amr_demo.py (AMR) to reproduce the results. It automatically downloads pre-trained weights. If you want to use other models, download pre-trained weights. When using clip_slowfast features, it is necessary to download slowfast pre-trained weights. When using clip_slowfast_pann features, in addition to the slowfast weight, download panns weights. Run python api_example/amr_demo.py to reproduce the AMR results.

Limitation: The maximum video duration is 150s due to the current benchmark datasets. For CPU users, set feature_name='clip' because CLIP+Slowfast or CLIP+Slowfast+PANNs features are very slow without GPUs.

Gradio demo

Run python gradio_demo/demo.py. Upload the video and input text query, and click the blue button. For AMR demo, run python gradio_demo/amr_demo.py.

MR-HD demo

AMR demo

Supported models, datasets, and features

Models

Moment retrieval & highlight detection

Datasets

Moment retrieval & highlight detection

Moment retrieval

Highlight detection

Audio moment retrieval

: Clotho Moment/TUT2017/UnAV100-subset (Munakata et al. arXiv24)

Features

: ResNet+GloVe
: CLIP
: CLIP+Slowfast
: CLIP+Slowfast+PANNs (Audio) for QVHighlights
: I3D+CLIP (Text) for TVSum

Reproduce the experiments

Pre-trained weights

Pre-trained weights can be downloaded from here. Download and unzip on the home directory. AMR models trained on CASTELLA and Clotho-Moment is available in here

Datasets

Due to the copyright issue, we here distribute only feature files. Download and place them under ./features directory. To extract features from videos, we use HERO_Video_Feature_Extractor.

For AMR, download features from here.

The whole directory should be look like this:

lighthouse/
├── api_example
├── configs
├── data
├── features # Download the features and place them here
│   ├── ActivityNet
│   │   ├── clip
│   │   ├── clip_text
│   │   ├── resnet
│   │   └── slowfast
│   ├── Charades
│   │   ├── clip
│   │   ├── clip_text
│   │   ├── resnet
│   │   ├── slowfast
│   ├── QVHighlight
│   │   ├── clip
│   │   ├── clip_text
│   │   ├── pann
│   │   ├── resnet
│   │   └── slowfast
│   ├── tacos
│   │   ├── clip
│   │   ├── clip_text
│   │   ├── resnet
│   │   └── slowfast
│   ├── tvsum
│   │   ├── clip
│   │   ├── clip_text
│   │   ├── i3d
│   │   ├── resnet
│   │   ├── slowfast
│   ├── youtube_highlight
│   │   ├── clip
│   │   ├── clip_text
│   │   └── slowfast
│   └── clotho-moments
│       ├── clap
│       └── clap_text
├── gradio_demo
├── images
├── lighthouse
├── results # The pre-trained weights are saved in this directory
└── training

Training and evaluation

Training

The training command is:

python training/train.py --model MODEL --dataset DATASET --feature FEATURE [--resume RESUME] [--domain DOMAIN]

	Options
Model	moment_detr, qd_detr, eatr, cg_detr, uvcom, tr_detr, taskweave_mr2hd, taskweave_hd2mr
Feature	resnet_glove, clip, clip_slowfast, clip_slowfast_pann, i3d_clip, clap
Dataset	qvhighlight, qvhighlight_pretrain, activitynet, charades, tacos, tvsum, youtube_highlight, clotho-moment

(Example 1) Moment DETR w/ CLIP+Slowfast on QVHighlights:

python training/train.py --model moment_detr --dataset qvhighlight --feature clip_slowfast

(Example 2) Moment DETR w/ CLIP+Slowfast+PANNs (Audio) on QVHighlights:

python training/train.py --model moment_detr --dataset qvhighlight --feature clip_slowfast_pann

(Pre-train & Fine-tuning, QVHighlights only) Lighthouse supports pre-training. Run:

python training/train.py --model moment_detr --dataset qvhighlight_pretrain --feature clip_slowfast

Then fine-tune the model with --resume option:

python training/train.py --model moment_detr --dataset qvhighlight --feature clip_slowfast --resume results/moment_detr/qvhighlight_pretrain/clip_slowfast/best.ckpt

(TVSum and YouTube Highlight) To train models on these two datasets, you need to specify domain:

python training/train.py --model moment_detr --dataset tvsum --feature clip_slowfast --domain BK

Evaluation

The evaluation command is:

python training/evaluate.py --model MODEL --dataset DATASET --feature FEATURE --split {val,test} --model_path MODEL_PATH --eval_path EVAL_PATH [--domain DOMAIN]

(Example 1) Evaluating Moment DETR w/ CLIP+Slowfast on the QVHighlights val set:

python training/evaluate.py --model moment_detr --dataset qvhighlight --feature clip_slowfast --split val --model_path results/moment_detr/qvhighlight/clip_slowfast/best.ckpt --eval_path data/qvhighlight/highlight_val_release.jsonl

To generate submission files for QVHighlight test sets, change split into test (QVHighlights only):

python training/evaluate.py --model moment_detr --dataset qvhighlight --feature clip_slowfast --split test --model_path results/moment_detr/qvhighlight/clip_slowfast/best.ckpt --eval_path data/qvhighlight/highlight_test_release.jsonl

Then zip hl_val_submission.jsonl and hl_test_submission.jsonl, and submit it to the Codalab (QVHighlights only):

zip -r submission.zip val_submission.jsonl test_submission.jsonl

HuggingFace Wrapper

We support wrappers for HuggingFace. You can easily use models and dataset via AutoModel and huggingface_hub.

The following models and datasets are provided by the wrapper for HuggingFace.

Models

Audio Moment DETR (Munakata et al. ICASSP2024)

Datasets

Clotho Moment (Munakata et al. ICASSP2024)

Citation

Lighthouse

@InProceedings{taichi2024emnlp,
  author    = {Taichi Nishimura and Shota Nakada and Hokuto Munakata and Tatsuya Komatsu},
  title     = {Lighthouse: A User-Friendly Library for Reproducible Video Moment Retrieval and Highlight Detection},
  booktitle = {Proceedings of The 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations},
  year      = {2024},
}

Audio moment retrieval

@InProceedings{hokuto2025icassp,
  author    = {Hokuto Munakata and Taichi Nishimura and Shota Nakada and Tatsuya Komatsu},
  title     = {Language-based Audio Moment Retrieval},
  booktitle = {IEEE International Conference on Acoustic, Speech, and Signal Processing},
  year      = {2025},
}

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

LICENSE

Apache License 2.0

Contact

Taichi Nishimura (taichitary@gmail.com)