Timeout does not work when used with TensorboardTracker

System Info

- `Accelerate` version: 1.6.0
- Platform: Linux-5.15.0-70-generic-x86_64-with-glibc2.31
- `accelerate` bash location: /data2/houxiuquan/envs/cp311pt211/bin/accelerate
- Python version: 3.11.9
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.1.1 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- PyTorch SDAA available: False
- PyTorch MUSA available: False
- System RAM: 503.30 GB
- GPU type: NVIDIA GeForce RTX 3090
- `Accelerate` default config:
        Not found

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • My own task or dataset (give details below)

Reproduction

Here are three minimal scripts to reproduce the bug.

  1. In the following script, I set timeout to 4 seconds and stop the main process for 8 seconds. As expected, it will raise timeout error by watchdog when wait_for_everyone.
import time
from datetime import timedelta

from accelerate import Accelerator, InitProcessGroupKwargs
from accelerate.utils import ProjectConfiguration
from torch import tensor


accelerator = Accelerator(
    kwargs_handlers=[InitProcessGroupKwargs(timeout=timedelta(seconds=4))],
)
accelerator.init_trackers(project_name="my_project")

accelerator.wait_for_everyone()
if accelerator.is_main_process:
    t = tensor(0).to(accelerator.device)
    time.sleep(8)
else:
    t = tensor(0).to(accelerator.device)
accelerator.wait_for_everyone()

print("All called!")
  1. However, when adding TensorboardTracker into accelerator, timeout seems not effective and the script ends successfully.
import time
from datetime import timedelta

from accelerate import Accelerator, InitProcessGroupKwargs
from accelerate.tracking import TensorBoardTracker
from accelerate.utils import ProjectConfiguration
from torch import tensor

accelerator = Accelerator(
    log_with=TensorBoardTracker(run_name="tf_log", logging_dir="tmp"),
    kwargs_handlers=[InitProcessGroupKwargs(timeout=timedelta(seconds=4))],
)
accelerator.init_trackers(project_name="my_project")


accelerator.wait_for_everyone()
if accelerator.is_main_process:
    t = tensor(0).to(accelerator.device)
    time.sleep(8)
else:
    t = tensor(0).to(accelerator.device)
accelerator.wait_for_everyone()

print("All called!")
  1. If replacing TensorboardTracker with "tensorboard", timeout becomes effective and the script raises timeout error as expected again.
import time
from datetime import timedelta

from accelerate import Accelerator, InitProcessGroupKwargs
from accelerate.utils import ProjectConfiguration
from torch import tensor


accelerator = Accelerator(
    log_with="tensorboard",
    project_config=ProjectConfiguration(
        project_dir="tmp",
        total_limit=5,
        automatic_checkpoint_naming=True,
    ),
    kwargs_handlers=[InitProcessGroupKwargs(timeout=timedelta(seconds=4))],
)
accelerator.init_trackers(project_name="my_project")

accelerator.wait_for_everyone()
if accelerator.is_main_process:
    t = tensor(0).to(accelerator.device)
    time.sleep(8)
else:
    t = tensor(0).to(accelerator.device)
accelerator.wait_for_everyone()

print("All called!")

Expected behavior

All the three scripts should raise timeout error by watchdog.