Improvements to the training UX by n-poulsen · Pull Request #2775

Improvements to the training UX by n-poulsen · Pull Request #2775 · DeepLabCut/DeepLabCut

Many minor improvements to the user experience when training models. Overview of changes (see below for description of each change):

Patch pycocotools printing during bounding box evaluation (so we don't get annoying prints)
Same mAP scale for pose and object detection metrics (0 to 100, not 0 to 1)
Typing fixes
Logging to learning_stats.csv (log detector and pose model to different files)
Non-zero starting epoch (bug fix)
Detector training - don't show evaluation loss
Improved look of printed metrics during training
Automatic save model snapshots with best test error in DLC3.0 #2663
Add option to select weights to resume training from the pytorch_config.yaml file
Fix: running PAF predictors on MPS
Could not load a saved snapshot due to mismatched state_dict keys #2749

Fixes

Patch pycocotools printing during bounding box evaluation

Evaluating object detection performance with pycocotools led to useless/confusing lines being printed:

creating index...
index created!
Loading and preparing results...
Converting ndarray to lists...
(2198, 7)
...

The print functions inside of pycocotools have been patched so these lines are no longer printed.

Same mAP scale for pose and object detection metrics

Object detection mAP was reported between 0 and 1 (the default though pycocotools), while pose mAP was reported between 0 and 100. Both bounding box mAP and pose mAP is now reported between 0 and 100.

Typing fixes

When calling train_network, both snapshot_path and detector_path can be strings, paths or None.

Logging to `learning_stats.csv`

Both detector and pose model stats were logged to learning_stats.csv, so one would overwrite the other. This is no longer the case, with pose model stats logged to learning_stats.csv and detector stats are logged to learning_stats_detector.csv.

Non-zero starting epoch

When continuing to train a model, if the epochs given were larger than the starting epoch (the number of epochs for which the given weights were trained), the model was only trained for epochs - starting_epochs. So in the example below, the 2nd call to train_network would only train for 5 extra epochs. This was so the model is always trained for the number of epochs passed as an argument (so that in the example below, the model would be trained for 10 extra epochs and the last snapshots output would be snapshot-015.pt and snapshot-detector -015.pt)

import deeplabcut
from deeplabcut.pose_estimation_pytorch import DLCLoader

deeplabcut.train_network("dlc-project/config.yaml", shuffle=0, epochs=5, detector_epochs=5)
loader = DLCLoader("dlc-project/config.yaml", shuffle=0) 
deeplabcut.train_network(
    config="dlc-project/config.yaml",
    shuffle=0,
    epochs=10,
    detector_epochs=10,
    snapshot_path=loader.model_folder / f"snapshot-005.pt",
    detector_path=loader.model_folder / f"snapshot-detector-005.pt",
)

Detector training - evaluation loss

Torchvision object detection models cannot return both loss and predictions: it's one or the other. When evaluating during training, the predictions are used to obtain mAP/mAR metrics, so the loss is nan. To avoid any confusion, the validation loss (which is NaN as we don't have it) is no longer printed.

Printing metrics during training

Some visual improvements were made when logging metrics to the console during training.

When training detectors:

# Previous
...
Training for epoch 20 done, starting evaluation
creating index...
index created!
...
Accumulating evaluation results...
DONE (t=0.00s).
Epoch 20 performance:
metrics/test.mAP@50:95:0.134
metrics/test.mAP@50:0.514
metrics/test.mAP@75:0.042
metrics/test.mAR@50:95:0.300
metrics/test.mAR@50:0.833
metrics/test.mAR@75:0.167
Epoch 20/200 (lr=0.0001), train loss 4.69560
...

# ** New **
...
Epoch 20/200 (lr=0.0001), train loss 4.69560
Model performance:
  metrics/test.mAP@50:95:  13.37
  metrics/test.mAP@50:     51.37
  metrics/test.mAP@75:      4.21
  metrics/test.mAR@50:95:  30.00
  metrics/test.mAR@50:     83.33
  metrics/test.mAR@75:     16.67
...

When training pose estimation models:

# Previous
Epoch 193 performance:
metrics/test.rmse:  5.50
metrics/test.rmse_pcutoff:5.27
metrics/test.mAP:   100.000
metrics/test.mAR:   100.000
metrics/test.rmse_detections:23.134
metrics/test.rmse_detections_pcutoff:14.039
Epoch 193/200 (lr=0.0001), train loss 0.00039, valid loss 0.00027

# ** New **
Epoch 193/200 (lr=0.0001), train loss 0.00039, valid loss 0.00027
Model performance:
  metrics/test.rmse:                      5.50
  metrics/test.rmse_pcutoff:              5.27
  metrics/test.mAP:                     100.00
  metrics/test.mAR:                     100.00
  metrics/test.rmse_detections:           5.50
  metrics/test.rmse_detections_pcutoff:   5.27

Saving the best snapshot

Addresses #2663 to save the best snapshot during training. The best snapshot will be saved as snapshot-best-XYZ.pt, where XYZ is the number of epochs for which it was trained.

Resuming Training from a Given Snapshot

Adds an option to the pytorch_config.yaml to resume training from an existing snapshot, with:

...
detector:
    # weights from which to resume training the detector
    resume_training_from: /Users/john/.../train/snapshot-detector-250.pt 
...
# weights from which to resume training the pose model
resume_training_from: /Users/john/.../train/snapshot-100.pt

Fix PAF predictor running on MPS

The PAF predictor would fail when running on MPS (macOS GPU), as torch.round(...) is not yet implemented. An easy fix was to run scripts with PYTORCH_ENABLE_MPS_FALLBACK=1 set as an environment variable. This changes the operation to run with numpy so this fix is no longer needed.

Multi-GPU training: fix saving the state dict

Addresses issue #2749.