We're hiring! If you like what we're building here, come join us at LMNT.
DiffWave is a fast, high-quality neural vocoder and waveform synthesizer. It starts with Gaussian noise and converts it into speech via iterative refinement. The speech can be controlled by providing a conditioning signal (e.g. log-scaled Mel spectrogram). The model and architecture details are described in DiffWave: A Versatile Diffusion Model for Audio Synthesis.
What's new (2021-11-09)
- unconditional waveform synthesis (thanks to Andrechang!)
What's new (2021-04-01)
- fast sampling algorithm based on v3 of the DiffWave paper
What's new (2020-10-14)
- new pretrained model trained for 1M steps
- updated audio samples with output from new model
Status (2021-11-09)
- fast inference procedure
- stable training
- high-quality synthesis
- mixed-precision training
- multi-GPU training
- command-line inference
- programmatic inference API
- PyPI package
- audio samples
- pretrained models
- unconditional waveform synthesis
Big thanks to Zhifeng Kong (lead author of DiffWave) for pointers and bug fixes.
Audio samples
Pretrained models
22.05 kHz pretrained model (31 MB, SHA256: d415d2117bb0bba3999afabdd67ed11d9e43400af26193a451d112e2560821a8)
This pre-trained model is able to synthesize speech with a real-time factor of 0.87 (smaller is faster).
Pre-trained model details
- trained on 4x 1080Ti
- default parameters
- single precision floating point (FP32)
- trained on LJSpeech dataset excluding LJ001* and LJ002*
- trained for 1000578 steps (1273 epochs)
Install
Install using pip:
or from GitHub:
git clone https://github.com/lmnt-com/diffwave.git
cd diffwave
pip install .
Training
Before you start training, you'll need to prepare a training dataset. The dataset can have any directory structure as long as the contained .wav files are 16-bit mono (e.g. LJSpeech, VCTK). By default, this implementation assumes a sample rate of 22.05 kHz. If you need to change this value, edit params.py.
python -m diffwave.preprocess /path/to/dir/containing/wavs
python -m diffwave /path/to/model/dir /path/to/dir/containing/wavs
# in another shell to monitor training progress:
tensorboard --logdir /path/to/model/dir --bind_all
Optional global label conditioning
You can optionally provide per-example global label vectors from .npy files.
Each label file is loaded per .wav sample and projected in every residual block.
Enable it during training:
python -m diffwave /path/to/model/dir /path/to/wavs \
--global_conditioning \
--global_conditioning_suffix .label.npy
Global-conditioning-only training (no spectrogram conditioning):
python -m diffwave /path/to/model/dir /path/to/wavs \
--global_conditioning_only \
--global_conditioning_suffix .label.npy
Equivalent explicit flags:
python -m diffwave /path/to/model/dir /path/to/wavs \
--unconditional \
--global_conditioning
File matching rules for each audio.wav:
- Default (same directory as audio):
audio.wav.label.npyoraudio.label.npy - With
--global_conditioning_dir /path/to/labels: relative-path and basename fallbacks are both supported.
Label dimension:
- inferred automatically from the first matched
.npyfile - or set explicitly with
--global_condition_dim <int>
You should expect to hear intelligible (but noisy) speech by ~8k steps (~1.5h on a 2080 Ti).
Optional Dev-set evaluation during training
You can optionally specify a dev set for evaluation during training. Evaluation will run every eval_interval_steps training steps (default: once per train epoch) and will compute the average loss across dev_max_eval_batches batches from the dev set (default: all batches).
python -m diffwave /path/to/model/dir /path/to/train/wavs \
--dev_data_dirs /path/to/dev/wavs \
--eval_interval_steps 1000 \
--dev_max_eval_batches 100
Multi-GPU training
By default, this implementation uses as many GPUs in parallel as returned by torch.cuda.device_count(). You can specify which GPUs to use by setting the CUDA_DEVICES_AVAILABLE environment variable before running the training module.
Inference API
Basic usage:
from diffwave.inference import predict as diffwave_predict model_dir = '/path/to/model/dir' spectrogram = # get your hands on a spectrogram in [N,C,W] format audio, sample_rate = diffwave_predict(spectrogram, model_dir, fast_sampling=True) # audio is a GPU tensor in [N,T] format.
Inference CLI
python -m diffwave.inference --fast /path/to/model /path/to/spectrogram -o output.wav
Inference from Global Feature Vectors or Arbitrary Points
For global-conditioning models, you can run inference either from a precomputed 1-D feature vector or by providing arbitrary source/receiver points.
From a feature .npy file:
python -m diffwave.infer_points /path/to/model_or_weights.pt \
--feature_npy /path/to/pair_feature.npy \
--output generated.wav \
--fast
From arbitrary source/receiver points (feature vector built automatically):
python -m diffwave.infer_points /path/to/model_or_weights.pt \
--source_point 0.10 1.25 1.60 \
--receiver_point 3.40 2.10 1.45 \
--metadata_path /path/to/exported_data/all_metadata.json \
--mesh_path /path/to/empty_room1.stl \
--save_feature_npy /path/to/new_pair_feature.npy \
--output generated.wav \
--fast
If your model also expects local spectrogram conditioning, pass --spectrogram_path /path/to/input.spec.npy.
Batch Inference from Label Directory + Metric Report
If you have a directory of global feature label .npy files, you can run inference for all labels and compare generated WAVs against GT WAVs in one step.
Use the helper script:
bash scripts/run_infer_and_compare.sh \ /path/to/model_or_weights.pt \ /path/to/labels_dir \ /path/to/gt_wavs_dir \ /path/to/generated_wavs_dir
Defaults:
- inference uses
--fast - labels are scanned recursively (
**/*.npy) - generated WAVs keep the same relative directory and basename as labels
- metric pairing is by relative path (
--match-mode relative)
Optional flags:
bash scripts/run_infer_and_compare.sh \ /path/to/model_or_weights.pt \ /path/to/labels_dir \ /path/to/gt_wavs_dir \ /path/to/generated_wavs_dir \ --no-fast \ --match-mode basename \ --resample-to 22050 \ --csv-out /path/to/report.csv
You can also run the comparison directly:
python -m diffwave.compare_wavs /path/to/generated_wavs /path/to/gt_wavs --recursive
References
- DiffWave: A Versatile Diffusion Model for Audio Synthesis
- Denoising Diffusion Probabilistic Models
- Code for Denoising Diffusion Probabilistic Models
PYTHONPATH=src python -m diffwave model ../empty_room_data/exported_data/splits/train/wav --global_conditioning_only --global_conditioning_suffix .npy --global_conditioning_dir ../empty_room_data/exported_data/splits/train/labels