Timeout does not work when used with TensorboardTracker
System Info
- `Accelerate` version: 1.6.0 - Platform: Linux-5.15.0-70-generic-x86_64-with-glibc2.31 - `accelerate` bash location: /data2/houxiuquan/envs/cp311pt211/bin/accelerate - Python version: 3.11.9 - Numpy version: 1.26.4 - PyTorch version (GPU?): 2.1.1 (True) - PyTorch XPU available: False - PyTorch NPU available: False - PyTorch MLU available: False - PyTorch SDAA available: False - PyTorch MUSA available: False - System RAM: 503.30 GB - GPU type: NVIDIA GeForce RTX 3090 - `Accelerate` default config: Not found
Information
- The official example scripts
- My own modified scripts
Tasks
- One of the scripts in the examples/ folder of Accelerate or an officially supported
no_trainerscript in theexamplesfolder of thetransformersrepo (such asrun_no_trainer_glue.py) - My own task or dataset (give details below)
Reproduction
Here are three minimal scripts to reproduce the bug.
- In the following script, I set timeout to
4seconds and stop the main process for8seconds. As expected, it will raisetimeouterror by watchdog whenwait_for_everyone.
import time from datetime import timedelta from accelerate import Accelerator, InitProcessGroupKwargs from accelerate.utils import ProjectConfiguration from torch import tensor accelerator = Accelerator( kwargs_handlers=[InitProcessGroupKwargs(timeout=timedelta(seconds=4))], ) accelerator.init_trackers(project_name="my_project") accelerator.wait_for_everyone() if accelerator.is_main_process: t = tensor(0).to(accelerator.device) time.sleep(8) else: t = tensor(0).to(accelerator.device) accelerator.wait_for_everyone() print("All called!")
- However, when adding
TensorboardTrackerintoaccelerator,timeoutseems not effective and the script ends successfully.
import time from datetime import timedelta from accelerate import Accelerator, InitProcessGroupKwargs from accelerate.tracking import TensorBoardTracker from accelerate.utils import ProjectConfiguration from torch import tensor accelerator = Accelerator( log_with=TensorBoardTracker(run_name="tf_log", logging_dir="tmp"), kwargs_handlers=[InitProcessGroupKwargs(timeout=timedelta(seconds=4))], ) accelerator.init_trackers(project_name="my_project") accelerator.wait_for_everyone() if accelerator.is_main_process: t = tensor(0).to(accelerator.device) time.sleep(8) else: t = tensor(0).to(accelerator.device) accelerator.wait_for_everyone() print("All called!")
- If replacing
TensorboardTrackerwith"tensorboard",timeoutbecomes effective and the script raisestimeouterror as expected again.
import time from datetime import timedelta from accelerate import Accelerator, InitProcessGroupKwargs from accelerate.utils import ProjectConfiguration from torch import tensor accelerator = Accelerator( log_with="tensorboard", project_config=ProjectConfiguration( project_dir="tmp", total_limit=5, automatic_checkpoint_naming=True, ), kwargs_handlers=[InitProcessGroupKwargs(timeout=timedelta(seconds=4))], ) accelerator.init_trackers(project_name="my_project") accelerator.wait_for_everyone() if accelerator.is_main_process: t = tensor(0).to(accelerator.device) time.sleep(8) else: t = tensor(0).to(accelerator.device) accelerator.wait_for_everyone() print("All called!")
Expected behavior
All the three scripts should raise timeout error by watchdog.