`PartialState` `split_between_processes` resulting in duplicated output with `padding=True`

System Info

accelerate==1.3.0
torch==2.2.2 and torch 2.4.0
python3.8 and 3.11

Information

The official example scripts
My own modified scripts

Tasks

One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)

Reproduction

Run below script using torchrun --nproc-per-node 2 script.py

from accelerate import PartialState
from accelerate.utils import gather_object

# Start up the distributed environment without needing the Accelerator.
distributed_state = PartialState()

prompts = [str(i) for i in range(4)]

batch_size = 2
tokenized_prompts = [prompts[i : i + batch_size] for i in range(0, len(prompts), batch_size)]

completions_per_process = []
with distributed_state.split_between_processes(tokenized_prompts, apply_padding=True) as batched_prompts:
    for batch in batched_prompts:
        generated_text = [f"{distributed_state.device}: {t}" for t in batch]
        completions_per_process.extend(generated_text)

completions_gather = gather_object(completions_per_process)
print(completions_gather)
# Drop duplicates produced by apply_padding in split_between_processes
completions = completions_gather[: len(prompts)]

I found that when the number of prompts is divisible by batch_size, I see completions_gather to be like this:
["cuda 0: 0", "cuda 0: 1", "cuda 0: 0", "cuda 0: 1", "cuda 1: 2", "cuda 1: 3", "cuda 1: 2", "cuda 1: 3"]

This suggested that GPU 0 returns ["cuda 0: 0", "cuda 0: 1", "cuda 0: 0", "cuda 0: 1"] and GPU 1 returns ["cuda 1: 2", "cuda 1: 3", "cuda 1: 2", "cuda 1: 3"]

By further looking at the source code, I found out that the issue occurred after this PR #2781

https://github.com/huggingface/accelerate/blob/main/src/accelerate/state.py#L473

if apply_padding:
    if isinstance(result, torch.Tensor):
        from accelerate.utils import pad_across_processes, send_to_device

        # The tensor needs to be on the device before we can pad it
        tensorized_result = send_to_device(result, self.device)
        result = pad_across_processes(tensorized_result, pad_index=inputs[-1])
    else:
        result += [result[-1]] * (num_samples_per_process + 1 - len(result))
return result

If my input is [[0,1], [2,3]], num_samples_per_process equals 1 because I have 2 batches and 2 GPUs.

On GPU 0, the item assigned to it is [[0,1]]. (num_samples_per_process + 1 - len(result)) is 1+1-1=1. The final result is therefore 2 * [[0,1]] = [[0,1], [0,1]] which created the duplicates

It will work correctly if my input is [[0,1], [2,3], [4]].

In this case GPU 0 gets [[0,1], [2,3]], (num_samples_per_process + 1 - len(result)) is 1+1-2=0. It doesn't create any duplicates.
GPU 1 gets [[4]], and (num_samples_per_process + 1 - len(result)) is 1+1-1=1. Therefore it got padded to [[4], [4]].

Maybe this line can be changed to
result += [result[-1]] * (num_samples_per_process + (1 if num_extras>0 else 0) - len(result))

Expected behavior

[{"cuda 0: 0", "cuda 0: 1", "cuda 1: 2", "cuda 1: 3"]