`PartialState` `split_between_processes` resulting in duplicated output with `padding=True`
System Info
accelerate==1.3.0 torch==2.2.2 and torch 2.4.0 python3.8 and 3.11
Information
- The official example scripts
- My own modified scripts
Tasks
- One of the scripts in the examples/ folder of Accelerate or an officially supported
no_trainerscript in theexamplesfolder of thetransformersrepo (such asrun_no_trainer_glue.py) - My own task or dataset (give details below)
Reproduction
Run below script using torchrun --nproc-per-node 2 script.py
from accelerate import PartialState from accelerate.utils import gather_object # Start up the distributed environment without needing the Accelerator. distributed_state = PartialState() prompts = [str(i) for i in range(4)] batch_size = 2 tokenized_prompts = [prompts[i : i + batch_size] for i in range(0, len(prompts), batch_size)] completions_per_process = [] with distributed_state.split_between_processes(tokenized_prompts, apply_padding=True) as batched_prompts: for batch in batched_prompts: generated_text = [f"{distributed_state.device}: {t}" for t in batch] completions_per_process.extend(generated_text) completions_gather = gather_object(completions_per_process) print(completions_gather) # Drop duplicates produced by apply_padding in split_between_processes completions = completions_gather[: len(prompts)]
I found that when the number of prompts is divisible by batch_size, I see completions_gather to be like this:
["cuda 0: 0", "cuda 0: 1", "cuda 0: 0", "cuda 0: 1", "cuda 1: 2", "cuda 1: 3", "cuda 1: 2", "cuda 1: 3"]
This suggested that GPU 0 returns ["cuda 0: 0", "cuda 0: 1", "cuda 0: 0", "cuda 0: 1"] and GPU 1 returns ["cuda 1: 2", "cuda 1: 3", "cuda 1: 2", "cuda 1: 3"]
By further looking at the source code, I found out that the issue occurred after this PR #2781
https://github.com/huggingface/accelerate/blob/main/src/accelerate/state.py#L473
if apply_padding: if isinstance(result, torch.Tensor): from accelerate.utils import pad_across_processes, send_to_device # The tensor needs to be on the device before we can pad it tensorized_result = send_to_device(result, self.device) result = pad_across_processes(tensorized_result, pad_index=inputs[-1]) else: result += [result[-1]] * (num_samples_per_process + 1 - len(result)) return result
If my input is [[0,1], [2,3]], num_samples_per_process equals 1 because I have 2 batches and 2 GPUs.
On GPU 0, the item assigned to it is [[0,1]]. (num_samples_per_process + 1 - len(result)) is 1+1-1=1. The final result is therefore 2 * [[0,1]] = [[0,1], [0,1]] which created the duplicates
It will work correctly if my input is [[0,1], [2,3], [4]].
- In this case GPU 0 gets [[0,1], [2,3]],
(num_samples_per_process + 1 - len(result))is1+1-2=0. It doesn't create any duplicates. - GPU 1 gets [[4]], and
(num_samples_per_process + 1 - len(result))is1+1-1=1. Therefore it got padded to [[4], [4]].
Maybe this line can be changed to
result += [result[-1]] * (num_samples_per_process + (1 if num_extras>0 else 0) - len(result))
Expected behavior
[{"cuda 0: 0", "cuda 0: 1", "cuda 1: 2", "cuda 1: 3"]