Sample without replacement option when interleaving datasets by radulescupetru · Pull Request #7786 · huggingface/datasets

Right now, interleave_datasets function with probabilities will sample with replacement. The PR adds the ability to sample without replacement.

import datasets

# Create datasets of different sizes to test exhaustion
data_a = [{"value": i, "source": "A"} for i in range(5)]
data_b = [{"value": i, "source": "B"} for i in range(10, 15)]

ds_a = datasets.Dataset.from_list(data_a).to_iterable_dataset()
ds_b = datasets.Dataset.from_list(data_b).to_iterable_dataset()

# Interleave with probabilities
ds_interleaved = datasets.interleave_datasets(
    [ds_a, ds_b],
    probabilities=[0.6, 0.4],
    seed=42,
    stopping_strategy="all_exhausted",
    sample_with_replacement=True,
)
for i, example in enumerate(ds_interleaved):
    print(f"Sample:{i}: value:{example['value']:02d} source:{example['source']}")
Sample:0: value:10 source:B
Sample:1: value:00 source:A
Sample:2: value:11 source:B
Sample:3: value:12 source:B
Sample:4: value:01 source:A
Sample:5: value:13 source:B
Sample:6: value:14 source:B
Sample:7: value:10 source:B
Sample:8: value:02 source:A
Sample:9: value:03 source:A
Sample:10: value:04 source:A

Note that sample with value:10 source: B is sampled twice (Sample:0 and Sample:7)

Sample:0: value:10 source:B
Sample:1: value:00 source:A
Sample:2: value:11 source:B
Sample:3: value:12 source:B
Sample:4: value:01 source:A
Sample:5: value:13 source:B
Sample:6: value:14 source:B
Sample:7: value:02 source:A
Sample:8: value:03 source:A
Sample:9: value:04 source:A

Note that we don't see any repeated items.