GitHub - TheJavaNoob/diffwave_cond: A fork of diffwave with global conditioner

We're hiring! If you like what we're building here, come join us at LMNT.

DiffWave is a fast, high-quality neural vocoder and waveform synthesizer. It starts with Gaussian noise and converts it into speech via iterative refinement. The speech can be controlled by providing a conditioning signal (e.g. log-scaled Mel spectrogram). The model and architecture details are described in DiffWave: A Versatile Diffusion Model for Audio Synthesis.

What's new (2021-11-09)

unconditional waveform synthesis (thanks to Andrechang!)

What's new (2021-04-01)

fast sampling algorithm based on v3 of the DiffWave paper

What's new (2020-10-14)

new pretrained model trained for 1M steps
updated audio samples with output from new model

Status (2021-11-09)

fast inference procedure
stable training
high-quality synthesis
mixed-precision training
multi-GPU training
command-line inference
programmatic inference API
PyPI package
audio samples
pretrained models
unconditional waveform synthesis

Big thanks to Zhifeng Kong (lead author of DiffWave) for pointers and bug fixes.

Audio samples

22.05 kHz audio samples

Pretrained models

22.05 kHz pretrained model (31 MB, SHA256: d415d2117bb0bba3999afabdd67ed11d9e43400af26193a451d112e2560821a8)

This pre-trained model is able to synthesize speech with a real-time factor of 0.87 (smaller is faster).

Pre-trained model details

trained on 4x 1080Ti
default parameters
single precision floating point (FP32)
trained on LJSpeech dataset excluding LJ001* and LJ002*
trained for 1000578 steps (1273 epochs)

Install

Install using pip:

or from GitHub:

git clone https://github.com/lmnt-com/diffwave.git
cd diffwave
pip install .

Training

Before you start training, you'll need to prepare a training dataset. The dataset can have any directory structure as long as the contained .wav files are 16-bit mono (e.g. LJSpeech, VCTK). By default, this implementation assumes a sample rate of 22.05 kHz. If you need to change this value, edit params.py.

python -m diffwave.preprocess /path/to/dir/containing/wavs
python -m diffwave /path/to/model/dir /path/to/dir/containing/wavs

# in another shell to monitor training progress:
tensorboard --logdir /path/to/model/dir --bind_all

Optional global label conditioning

You can optionally provide per-example global label vectors from .npy files. Each label file is loaded per .wav sample and projected in every residual block.

Enable it during training:

python -m diffwave /path/to/model/dir /path/to/wavs \
	--global_conditioning \
	--global_conditioning_suffix .label.npy

Global-conditioning-only training (no spectrogram conditioning):

python -m diffwave /path/to/model/dir /path/to/wavs \
	--global_conditioning_only \
	--global_conditioning_suffix .label.npy

Equivalent explicit flags:

python -m diffwave /path/to/model/dir /path/to/wavs \
	--unconditional \
	--global_conditioning

File matching rules for each audio.wav:

Default (same directory as audio): audio.wav.label.npy or audio.label.npy
With --global_conditioning_dir /path/to/labels: relative-path and basename fallbacks are both supported.

Label dimension:

inferred automatically from the first matched .npy file
or set explicitly with --global_condition_dim <int>

You should expect to hear intelligible (but noisy) speech by ~8k steps (~1.5h on a 2080 Ti).

Optional Dev-set evaluation during training

You can optionally specify a dev set for evaluation during training. Evaluation will run every eval_interval_steps training steps (default: once per train epoch) and will compute the average loss across dev_max_eval_batches batches from the dev set (default: all batches).

python -m diffwave /path/to/model/dir /path/to/train/wavs \
	--dev_data_dirs /path/to/dev/wavs \
	--eval_interval_steps 1000 \
	--dev_max_eval_batches 100

Multi-GPU training

By default, this implementation uses as many GPUs in parallel as returned by torch.cuda.device_count(). You can specify which GPUs to use by setting the CUDA_DEVICES_AVAILABLE environment variable before running the training module.

Inference API

Basic usage:

from diffwave.inference import predict as diffwave_predict

model_dir = '/path/to/model/dir'
spectrogram = # get your hands on a spectrogram in [N,C,W] format
audio, sample_rate = diffwave_predict(spectrogram, model_dir, fast_sampling=True)

# audio is a GPU tensor in [N,T] format.

Inference CLI

python -m diffwave.inference --fast /path/to/model /path/to/spectrogram -o output.wav

Inference from Global Feature Vectors or Arbitrary Points

For global-conditioning models, you can run inference either from a precomputed 1-D feature vector or by providing arbitrary source/receiver points.

From a feature .npy file:

python -m diffwave.infer_points /path/to/model_or_weights.pt \
	--feature_npy /path/to/pair_feature.npy \
	--output generated.wav \
	--fast

From arbitrary source/receiver points (feature vector built automatically):

python -m diffwave.infer_points /path/to/model_or_weights.pt \
	--source_point 0.10 1.25 1.60 \
	--receiver_point 3.40 2.10 1.45 \
	--metadata_path /path/to/exported_data/all_metadata.json \
	--mesh_path /path/to/empty_room1.stl \
	--save_feature_npy /path/to/new_pair_feature.npy \
	--output generated.wav \
	--fast

If your model also expects local spectrogram conditioning, pass --spectrogram_path /path/to/input.spec.npy.

Batch Inference from Label Directory + Metric Report

If you have a directory of global feature label .npy files, you can run inference for all labels and compare generated WAVs against GT WAVs in one step.

Use the helper script:

bash scripts/run_infer_and_compare.sh \
	/path/to/model_or_weights.pt \
	/path/to/labels_dir \
	/path/to/gt_wavs_dir \
	/path/to/generated_wavs_dir

Defaults:

inference uses --fast
labels are scanned recursively (**/*.npy)
generated WAVs keep the same relative directory and basename as labels
metric pairing is by relative path (--match-mode relative)

Optional flags:

bash scripts/run_infer_and_compare.sh \
	/path/to/model_or_weights.pt \
	/path/to/labels_dir \
	/path/to/gt_wavs_dir \
	/path/to/generated_wavs_dir \
	--no-fast \
	--match-mode basename \
	--resample-to 22050 \
	--csv-out /path/to/report.csv

You can also run the comparison directly:

python -m diffwave.compare_wavs /path/to/generated_wavs /path/to/gt_wavs --recursive

References

PYTHONPATH=src python -m diffwave model ../empty_room_data/exported_data/splits/train/wav --global_conditioning_only --global_conditioning_suffix .npy --global_conditioning_dir ../empty_room_data/exported_data/splits/train/labels