Fix prevent duplicate GPU usage in distributed processing by ved1beta · Pull Request #3526 · huggingface/accelerate

What does this PR do?

same PR as last time added test

Test Code

from accelerate import PartialState
from accelerate.utils import gather_object
import os
import torch

# Print CUDA information
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"Number of CUDA devices: {torch.cuda.device_count()}")
    print(f"Current CUDA device: {torch.cuda.current_device()}")
    print(f"CUDA device name: {torch.cuda.get_device_name()}")

# Initialize distributed state first
distributed_state = PartialState()

# Create some test data
prompts = [str(i) for i in range(4)]  # ["0", "1", "2", "3"]
batch_size = 2

# Split into batches
tokenized_prompts = [prompts[i : i + batch_size] for i in range(0, len(prompts), batch_size)]

completions_per_process = []
with distributed_state.split_between_processes(tokenized_prompts, apply_padding=True) as batched_prompts:
    for batch in batched_prompts:
        generated_text = [f"{distributed_state.device}: {t}" for t in batch]
        completions_per_process.extend(generated_text)

# Gather results from all processes
completions_gather = gather_object(completions_per_process)
print(completions_gather)

How to Test

  1. Single GPU test:
CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc-per-node 1 test.py
  1. CPU mode test (for distributed testing):
CUDA_VISIBLE_DEVICES="" torchrun --nproc-per-node 2 test.py

Expected Output

For single GPU test:

CUDA available: True
Number of CUDA devices: 1
Current CUDA device: 0
CUDA device name: NVIDIA GeForce RTX 3050 6GB Laptop GPU
Local rank: 0
Using device: cuda:0
['cuda:0: 0', 'cuda:0: 1', 'cuda:0: 2', 'cuda:0: 3']

Fixes #3485

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@SunMarc