Evaluation error with PAF heads: `ValueError: matrix contains invalid numeric entries`

Hello,
while using Deeplabcut with a Pytorch engine, I encountered an issue with the model dlcrnet_stride16_ms5. During the training process, a ValueError occurred stating "matrix contains invalid numeric entries." No matter how much I reduce the learning rate or adjust the training batch size, I cannot resolve this problem.
Training with configuration:
data:
  colormode: RGB
  inference:
    normalize_images: True
  train:
    affine:
      p: 0.5
      rotation: 30
      scaling: [1.0, 1.0]
      translation: 0
    collate:
      type: ResizeFromDataSizeCollate
      min_scale: 0.4
      max_scale: 1.0
      min_short_side: 128
      max_short_side: 1152
      multiple_of: 32
      to_square: False
    covering: False
    gaussian_noise: 12.75
    hist_eq: False
    motion_blur: False
    normalize_images: True
device: auto
metadata:
  project_path: /mnt/Data16Tb/Data/boyang/pose/MMVISV3-BRIAN-2024-06-19
  pose_config_path: /mnt/Data16Tb/Data/boyang/pose/MMVISV3-BRIAN-2024-06-19/dlc-models-pytorch/iteration-3/MMVISV3Jun19-trainset70shuffle0/train/pose_cfg.yaml
  bodyparts: ['Front', 'Right', 'Middle', 'Left', 'FL1', 'BL1', 'FR1', 'BR1', 'BL2', 'BR2', 'FL2', 'FR2', 'Body1', 'Body2', 'Body3']
  unique_bodyparts: []
  individuals: ['MARMOSET_1', 'MARMOSET_2']
  with_identity: True
method: bu
model:
  backbone:
    type: DLCRNet
    model_name: resnet50
    pretrained: True
    output_stride: 16
  backbone_output_channels: 2304
  pose_model:
    stride: 8
  heads:
    bodypart:
      type: DLCRNetHead
      predictor:
        type: PartAffinityFieldPredictor
        num_animals: 2
        num_multibodyparts: 15
        num_uniquebodyparts: 0
        nms_radius: 5
        sigma: 1.0
        locref_stdev: 7.2801
        min_affinity: 0.05
        graph: [[0, 1], [0, 3], [2, 3], [1, 2], [0, 2], [2, 12], [12, 13], [13, 14], [7, 14], [7, 9], [5, 8], [5, 14], [6, 12], [6, 11], [4, 12], [4, 10]]
        edges_to_keep: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
      target_generator:
        type: SequentialGenerator
        generators: [{'type': 'HeatmapPlateauGenerator', 'num_heatmaps': 15, 'pos_dist_thresh': 17, 'heatmap_mode': 'KEYPOINT', 'generate_locref': True, 'locref_std': 7.2801}, {'type': 'PartAffinityFieldGenerator', 'graph': [[0, 1], [0, 3], [2, 3], [1, 2], [0, 2], [2, 12], [12, 13], [13, 14], [7, 14], [7, 9], [5, 8], [5, 14], [6, 12], [6, 11], [4, 12], [4, 10]], 'width': 20}]
      criterion:
        heatmap:
          type: WeightedBCECriterion
          weight: 1.0
        locref:
          type: WeightedHuberCriterion
          weight: 0.05
        paf:
          type: WeightedHuberCriterion
          weight: 0.1
      heatmap_config:
        channels: [2304, 15]
        kernel_size: [3]
        strides: [2]
      locref_config:
        channels: [2304, 30]
        kernel_size: [3]
        strides: [2]
      paf_config:
        channels: [2304, 32]
        kernel_size: [3]
        strides: [2]
      num_stages: 5
    identity:
      type: HeatmapHead
      predictor:
        type: HeatmapPredictor
        location_refinement: False
      target_generator:
        type: HeatmapPlateauGenerator
        num_heatmaps: 2
        pos_dist_thresh: 17
        heatmap_mode: INDIVIDUAL
        generate_locref: False
      criterion:
        heatmap:
          type: WeightedBCECriterion
          weight: 1.0
      heatmap_config:
        channels: [2304, 2]
        kernel_size: [3]
        strides: [2]
net_type: dlcrnet_stride16_ms5
runner:
  type: PoseTrainingRunner
  gpus: None
  key_metric: test.mAP
  key_metric_asc: True
  eval_interval: 25
  optimizer:
    type: AdamW
    params:
      lr: 0.0001
  scheduler:
    type: LRListScheduler
    params:
      lr_list: [[1e-05], [1e-06]]
      milestones: [90, 120]
  snapshots:
    max_snapshots: 5
    save_epochs: 50
    save_optimizer_state: False
train_settings:
  batch_size: 16
  dataloader_workers: 0
  dataloader_pin_memory: True
  display_iters: 1000
  epochs: 200
  seed: 42
Loading pretrained weights from Hugging Face hub (timm/resnet50.a1_in1k)
[timm/resnet50.a1_in1k] Safe alternative available for 'pytorch_model.bin' (as 'model.safetensors'). Loading weights using safetensors.
Data Transforms:
  Training:   Compose([
  Affine(always_apply=False, p=0.5, interpolation=1, mask_interpolation=0, cval=0, mode=0, scale={'x': (1.0, 1.0), 'y': (1.0, 1.0)}, translate_percent=None, translate_px={'x': (0, 0), 'y': (0, 0)}, rotate=(-30, 30), fit_output=False, shear={'x': (0.0, 0.0), 'y': (0.0, 0.0)}, cval_mask=0, keep_ratio=True, rotate_method='largest_box'),
  GaussNoise(always_apply=False, p=0.5, var_limit=(0, 162.5625), per_channel=True, mean=0),
  Normalize(always_apply=False, p=1.0, mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225], max_pixel_value=255.0),
], p=1.0, bbox_params={'format': 'coco', 'label_fields': ['bbox_labels'], 'min_area': 0.0, 'min_visibility': 0.0, 'min_width': 0.0, 'min_height': 0.0, 'check_each_transform': True}, keypoint_params={'format': 'xy', 'label_fields': ['class_labels'], 'remove_invisible': False, 'angle_in_degrees': True, 'check_each_transform': True}, additional_targets={}, is_check_shapes=True)
  Validation: Compose([
  Normalize(always_apply=False, p=1.0, mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225], max_pixel_value=255.0),
], p=1.0, bbox_params={'format': 'coco', 'label_fields': ['bbox_labels'], 'min_area': 0.0, 'min_visibility': 0.0, 'min_width': 0.0, 'min_height': 0.0, 'check_each_transform': True}, keypoint_params={'format': 'xy', 'label_fields': ['class_labels'], 'remove_invisible': False, 'angle_in_degrees': True, 'check_each_transform': True}, additional_targets={}, is_check_shapes=True)
Using custom collate function: {'type': 'ResizeFromDataSizeCollate', 'min_scale': 0.4, 'max_scale': 1.0, 'min_short_side': 128, 'max_short_side': 1152, 'multiple_of': 32, 'to_square': False}
Using 102 images and 44 for testing

Starting pose model training...
--------------------------------------------------
/usr/local/anaconda3/envs/deeplabcut3/lib/python3.10/site-packages/torch/nn/modules/conv.py:456: UserWarning: Plan failed with a cudnnException: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_NOT_SUPPORTED (Triggered internally at ../aten/src/ATen/native/cudnn/Conv_v8.cpp:919.)
  return F.conv2d(input, weight, bias, self.stride,
/usr/local/anaconda3/envs/deeplabcut3/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: Plan failed with a cudnnException: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_NOT_SUPPORTED (Triggered internally at ../aten/src/ATen/native/cudnn/Conv_v8.cpp:919.)
  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
Epoch 1/200 (lr=0.0001), train loss 0.35200
Epoch 2/200 (lr=0.0001), train loss 0.16082
Epoch 3/200 (lr=0.0001), train loss 0.13178
Epoch 4/200 (lr=0.0001), train loss 0.08032
Epoch 5/200 (lr=0.0001), train loss 0.04940
Epoch 6/200 (lr=0.0001), train loss 0.04160
Epoch 7/200 (lr=0.0001), train loss 0.04610
Epoch 8/200 (lr=0.0001), train loss 0.05107
Epoch 9/200 (lr=0.0001), train loss 0.03826
Epoch 10/200 (lr=0.0001), train loss 0.03448
Epoch 11/200 (lr=0.0001), train loss 0.02770
Epoch 12/200 (lr=0.0001), train loss 0.02214
Epoch 13/200 (lr=0.0001), train loss 0.02729
Epoch 14/200 (lr=0.0001), train loss 0.03104
Epoch 15/200 (lr=0.0001), train loss 0.02087
Epoch 16/200 (lr=0.0001), train loss 0.03396
Epoch 17/200 (lr=0.0001), train loss 0.02404
Epoch 18/200 (lr=0.0001), train loss 0.02167
Epoch 19/200 (lr=0.0001), train loss 0.02466
Epoch 20/200 (lr=0.0001), train loss 0.02104
Epoch 21/200 (lr=0.0001), train loss 0.02200
Epoch 22/200 (lr=0.0001), train loss 0.02281
Epoch 23/200 (lr=0.0001), train loss 0.02097
Epoch 24/200 (lr=0.0001), train loss 0.01914
Training for epoch 25 done, starting evaluation
/usr/local/anaconda3/envs/deeplabcut3/lib/python3.10/site-packages/deeplabcut/pose_estimation_pytorch/post_processing/match_predictions_to_gt.py:74: RuntimeWarning: Mean of empty slice
  distance_matrix[i, j] = np.nanmean(d)
{
	"name": "ValueError",
	"message": "matrix contains invalid numeric entries",
	"stack": "---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[4], line 1
----> 1 deeplabcut.train_network(config_path, shuffle=0)

File /usr/local/anaconda3/envs/deeplabcut3/lib/python3.10/site-packages/deeplabcut/compat.py:245, in train_network(config, shuffle, trainingsetindex, max_snapshots_to_keep, displayiters, saveiters, maxiters, allow_growth, gputouse, autotune, keepdeconvweights, modelprefix, superanimal_name, superanimal_transfer_learning, engine, **torch_kwargs)
    242     if \"display_iters\" not in torch_kwargs:
    243         torch_kwargs[\"display_iters\"] = displayiters
--> 245     return train_network(
    246         config,
    247         shuffle=shuffle,
    248         trainingsetindex=trainingsetindex,
    249         modelprefix=modelprefix,
    250         max_snapshots_to_keep=max_snapshots_to_keep,
    251         **torch_kwargs,
    252     )
    254 raise NotImplementedError(f\"This function is not implemented for {engine}\")

File /usr/local/anaconda3/envs/deeplabcut3/lib/python3.10/site-packages/deeplabcut/pose_estimation_pytorch/apis/train.py:336, in train_network(config, shuffle, trainingsetindex, modelprefix, device, snapshot_path, detector_path, batch_size, epochs, save_epochs, detector_batch_size, detector_epochs, detector_save_epochs, display_iters, max_snapshots_to_keep, pose_threshold, **kwargs)
    323     detector_run_config[\"train_settings\"][\"weight_init\"] = loader.model_cfg[
    324         \"train_settings\"
    325     ].get(\"weight_init\")
    326     train(
    327         loader=loader,
    328         run_config=detector_run_config,
   (...)
    333         max_snapshots_to_keep=max_snapshots_to_keep,
    334     )
--> 336 train(
    337     loader=loader,
    338     run_config=loader.model_cfg,
    339     task=pose_task,
    340     device=device,
    341     logger_config=loader.model_cfg.get(\"logger\"),
    342     snapshot_path=snapshot_path,
    343     max_snapshots_to_keep=max_snapshots_to_keep,
    344 )
    346 destroy_file_logging()

File /usr/local/anaconda3/envs/deeplabcut3/lib/python3.10/site-packages/deeplabcut/pose_estimation_pytorch/apis/train.py:189, in train(loader, run_config, task, device, gpus, logger_config, snapshot_path, transform, inference_transform, max_snapshots_to_keep)
    186 else:
    187     logging.info(\"\
Starting pose model training...\
\" + (50 * \"-\"))
--> 189 runner.fit(
    190     train_dataloader,
    191     valid_dataloader,
    192     epochs=run_config[\"train_settings\"][\"epochs\"],
    193     display_iters=run_config[\"train_settings\"][\"display_iters\"],
    194 )

File /usr/local/anaconda3/envs/deeplabcut3/lib/python3.10/site-packages/deeplabcut/pose_estimation_pytorch/runners/train.py:181, in TrainingRunner.fit(self, train_loader, valid_loader, epochs, display_iters)
    179 with torch.no_grad():
    180     logging.info(f\"Training for epoch {e} done, starting evaluation\")
--> 181     valid_loss = self._epoch(
    182         valid_loader, mode=\"eval\", display_iters=display_iters
    183     )
    184     if self._print_valid_loss:
    185         msg += f\", valid loss {float(valid_loss):.5f}\"

File /usr/local/anaconda3/envs/deeplabcut3/lib/python3.10/site-packages/deeplabcut/pose_estimation_pytorch/runners/train.py:236, in TrainingRunner._epoch(self, loader, mode, display_iters)
    234 perf_metrics = None
    235 if mode == \"eval\":
--> 236     perf_metrics = self._compute_epoch_metrics()
    237     self._metadata[\"metrics\"] = perf_metrics
    238     self._epoch_predictions = {}

File /usr/local/anaconda3/envs/deeplabcut3/lib/python3.10/site-packages/deeplabcut/pose_estimation_pytorch/runners/train.py:365, in PoseTrainingRunner._compute_epoch_metrics(self)
    358 \"\"\"Computes the metrics using the data accumulated during an epoch
    359 Returns:
    360     A dictionary containing the different losses for the step
    361 \"\"\"
    362 num_animals = max(
    363     [len(kpts) for kpts in self._epoch_ground_truth[\"bodyparts\"].values()]
    364 )
--> 365 poses = pair_predicted_individuals_with_gt(
    366     self._epoch_predictions[\"bodyparts\"], self._epoch_ground_truth[\"bodyparts\"]
    367 )
    369 # pad predictions if there are any missing (needed for top-down models)
    370 gt, pred = {}, {}

File /usr/local/anaconda3/envs/deeplabcut3/lib/python3.10/site-packages/deeplabcut/pose_estimation_pytorch/metrics/scoring.py:391, in pair_predicted_individuals_with_gt(predictions, ground_truth)
    389 matched_poses = {}
    390 for image, pose in predictions.items():
--> 391     match_individuals = rmse_match_prediction_to_gt(pose, ground_truth[image])
    392     matched_poses[image] = pose[match_individuals]
    394 return matched_poses

File /usr/local/anaconda3/envs/deeplabcut3/lib/python3.10/site-packages/deeplabcut/pose_estimation_pytorch/post_processing/match_predictions_to_gt.py:76, in rmse_match_prediction_to_gt(pred_kpts, gt_kpts)
     73         d = (gt_idv[mask, :2] - pred_idv[mask, :2]) ** 2
     74         distance_matrix[i, j] = np.nanmean(d)
---> 76 _, col_ind = linear_sum_assignment(distance_matrix)  # len == len(valid_gt_indices)
     78 gt_idx_to_pred_idx = {
     79     valid_gt_indices[valid_gt_index]: valid_pred_indices[valid_pred_index]
     80     for valid_gt_index, valid_pred_index in enumerate(col_ind)
     81 }
     82 matched_pred = {valid_pred_indices[i] for i in col_ind}

ValueError: matrix contains invalid numeric entries"
}