Releases ยท huggingface/trl

v0.28.0

Features

Experimental

Fixes

Documentation and Examples

Deprecations

CI Improvements

Miscellaneous

What's Changed

  • โฌ†๏ธ Bump dev version by @qgallouedec in #4835
  • Support triggering CI via push to ci-* branches by @albertvillanova in #4840
  • Revert CI hotfix pinning transformers 4.57.4 after tiny mo...

Read more

v0.27.2

v0.27.1

v0.27.0

Features

  • Add vllm_group_port argument to GRPO, RLOO and OnlineDPO configuration by @pointerhacker in #4545
  • Preserve truncated tokens in BFD packing by @qgallouedec in #4632
  • Support async reward functions and parallelize call to reward functions. by @pramodith in #4567
  • RLOO supports async rewards. by @pramodith in #4718
  • Support vLLM 0.12.0 by @jiqing-feng in #4117
  • feat: DeepSeek V3.2 Off-policy sequence masking by @casinca in #4689
  • ๐ŸŽญ Up to 50% less VRAM during forward with forward_masked_logits function by @qgallouedec in #4729
  • [GRPO] Add a config to limit the number of tool calling iterations by @pramodith in #4761
  • Switch gradient checkpointing default to use_reentrant=False (PyTorch recommended) by @qgallouedec in #4811
  • Add support for GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization by @nbasyl in #4785

Experimental

  • Move AutoModelForCausalLMWithValueHead and AutoModelForSeq2SeqLMWithValueHead to experimental by @qgallouedec in #4654
  • Move DPODataCollatorWithPadding to experimental.utils by @qgallouedec in #4667
  • Move DataCollatorForChatML to experimental.utils by @qgallouedec in #4668
  • Move add_bos_token_if_needed and add_eos_token_if_needed to experimental.utils by @qgallouedec in #4674
  • Move truncate_right and SIMPLE_CHAT_TEMPLATE to experimental.utils by @qgallouedec in #4677
  • Move prepare_model_for_kbit_training, enable_gradient_checkpointing, prepare_peft_model to experimental.utils by @qgallouedec in #4704
  • Move get_reward function to experimental.utils by @qgallouedec in #4683
  • Remove experimental imports from testing_utils by @albertvillanova in #4727
  • ORPO: Avoid catastrophic cancellation in loss function by @hartmans in #4763
  • Refactor KTO [1/N]: Modernize model initialization by @albertvillanova in #4783
  • [GOLD] add probability merging fix to implement chain rule by @kashif in #4765
  • Refactor KTO coordinated with DPO [a/N]: Remove encoder-decoder support by @albertvillanova in #4792
  • Refactor KTO coordinated with DPO [b/N]: Simplify truncation logic by @albertvillanova in #4808

Fixes

  • Accounting for case num_generations_eval=1 in the calculation of the advantage by @qgallouedec in #4662
  • Fix vLLM error for tools usage not supported when running GRPO training by @apalmas-saifh in #4663
  • Fix GRPO config validation in case num_generations_eval is specified and different than num_generations by @apalmas-saifh in #4682
  • Fix top_k default value to 0 for disabling top-k filtering by @albertvillanova in #4695
  • Include generation_config for tiny model uploads by @qgallouedec in #4643
  • Fix KeyError with transformers 5.0.0+ where push_to_hub_token is removed by @Manodeepray in #4691
  • Overwrite model default generation config used by model.generate by @albertvillanova in #4647
  • Fix: handle multiple tool calls in qwen3_schema by @mattbui in #4709
  • Fix bugs when using multi-gpu: dataset streaming for offline trainers + dtype initialization by @kaixuanliu in #3950
  • Ensure llm-blender is importable with transformers >= v5 by @albertvillanova in #4781
  • Monkey patch for HybridCache in Liger-Kernel with transformers v5 by @qgallouedec in #4798
  • [fix] GRPOTrainer: proper access args by @carlyou in #4801
  • Fix vllm compat patches to be applied only to affected versions by @albertvillanova in #4815
  • fix bug when sft calc outputs.token_accuracy by @kaixuanliu in #4814
  • fix xpu vllm client server by @jiqing-feng in #4780

Documentation and Examples

Deprecations

CI Improvements

Read more

v0.26.2

v0.26.1

v0.26.0

Features

๐Ÿ•ต๏ธโ€โ™‚๏ธ GRPO: Agent training

GRPOTrainer now supports training agents using tools. This allows language models to interact with external functions or APIs during training.

from datasets import Dataset
from trl import GRPOTrainer

def multiply(a: int, b: int) -> int:
    """
    Multiplies two integers.

    Args:
        a: The first integer.
        b: The second integer.

    Returns:
        The product of the two integers.
    """
    return a * b


dataset = Dataset.from_list(
    [
        {"prompt": [{"role": "user", "content": "What is 3 multiplied by 4?"}], "answer": 12},
        {"prompt": [{"role": "user", "content": "Calculate 7 times 8."}], "answer": 56},
        {"prompt": [{"role": "user", "content": "Find the product of 5 and 6."}], "answer": 30},
        {"prompt": [{"role": "user", "content": "What do you get when you multiply 9 by 9?"}], "answer": 81},
        {"prompt": [{"role": "user", "content": "Compute 12 multiplied by 11."}], "answer": 132},
        {"prompt": [{"role": "user", "content": "What is 15 times 14?"}], "answer": 210},
    ]
)

def accuracy(completions, answer, **kwargs):
    predictions = [completion[-1]["content"] for completion in completions]
    rewards = [float(str(ans) in pred) for pred, ans in zip(predictions, answer)]
    return rewards

trainer = GRPOTrainer(
    model="Qwen/Qwen3-0.6B",
    train_dataset=dataset,
    tools=[multiply],
    reward_funcs=accuracy,
)
trainer.train()

by @qgallouedec in #4300

ScaleRL: Add CISPO Loss

CISPO Loss was first introduced in the Minimax-M1 paper, the ScaleRL paper subsequently showed that CISPO loss scales the best in terms of performance and efficiency as models are trained for longer.

GRPOTrainer now supports the CISPO loss using loss_type="cispo" in the GRPOConfig.

by @pramodith in #4495

Add vLLM quantization option for colocate

When the input model is quantized using bitsandbytes, vLLM will now also use quantization when in colocate mode.

by @sergiopaniego in #4496

Reasoning reward

TRL nows includes a reasoning reward function

from trl.rewards import reasoning_accuracy_reward

solutions = [r"\frac{1}{3}", r"\frac{1}{3}", r"\frac{1}{3}"]
completions = [
    [
        {
            "role": "assistant",
            "content": r"<think> Reasoning content </think> The final answer is \boxed{\frac{1}{3}}",
        }
    ],
    [
        {
            "role": "assistant",
            "content": r"<think> Reasoning content </think> The final answer is \boxed{\frac{1}{2}}",
        }
    ],
    [
        {
            "role": "assistant",
            "content": r"<think> Reasoning content with partial answers \boxed{\frac{1}{3}} but no final answer",
        }
    ],
]
reasoning_accuracy_reward(completions, solutions)  # [1.0, 0.0, 0.0] 

As any other reward function, it can be used in GRPOTrainer or RLOOTrainer.

from trl import GRPOTrainer
from trl.rewards import reasoning_accuracy_reward

trainer = GRPOTrainer(
    ...,
    reward_funcs=reasoning_accuracy_reward,
)

by @lewtun in #4563

Add shuffle_dataset option to SFTTrainer

You can now shuffle the dataset in SFTTrainer by setting the shuffle_dataset argument to True in SFTConfig. This is useful when the dataset features high similarity between consecutive samples.

from trl import SFTTrainer, SFTConfig

SFTConfig(shuffle_dataset=True)

by @qgallouedec in #4564

Add SAPO Loss in GRPO

Soft Adaptive Policy Optimization (SAPO), replaces hard clipping with a smooth, temperature-controlled gate that adaptively attenuates off-policy updates while preserving useful learning signals. Compared with GSPO and GRPO, SAPO is both sequence-coherent and token-adaptive. Like GSPO, SAPO maintains sequence-level coherence, but its soft gating forms a continuous trust region that avoids the brittle hard clipping band used in GSPO.

You can now use SAPO loss in GRPOTrainer by setting loss_type="sapo" in the GRPOConfig.

by @pramodith in #4600

Other Features

Experimental

Fixes

Documentation and Examples

Read more

v0.25.1

v0.25.0

Features

  • ๐Ÿ’ค Switch to sleep level=2 and split wake-ups in GRPO and RLOO trainers by @xxrjun in #4296
  • Added custom prepare_model_for_kbit_training to save VRAM by @sergiopaniego in #4335
  • Add add_generation_prompt to processor_kwargs in GRPO and RLOO trainer by @qgallouedec in #4361
  • Add support for Trackio completions logging in GRPOTrainer by @taha-yassine in #4359
  • Support chat_template_kwargs by @pramodith in #4350
  • GRPO: ScaleRL -> Support casting LM Head to FP32 by @pramodith in #4303
  • Support casting to fp32 when word embeddings are tied to lm_head by @pramodith in #4446
  • ๐Ÿ’ฌ Add chat to vLLM client and server, update trainer calls by @qgallouedec in #4450

Experimental

  • ๐Ÿšš Move BCO to trl.experimental by @qgallouedec in #4312
  • ๐Ÿ‘‘ [experimental] GOLD Trainer by @kashif in #4349
  • Add PAPOTrainer for preference-based optimization by @SolarWindRider in #4334
  • [GFPO] fix the GFPO loss calculation error caused by unmodified old_per_token_logps by @Peter-Chou in #4454
  • ๐Ÿ•น๏ธ Add rollout function for OpenEnv integration by @lewtun in #4310

Fixes

Documentation and Examples

Deprecations

What's Changed

Read more

v0.24.0

Features

Fixes

  • [Online-DPO] fix the completion_len == max_new_tokens crash by @kashif in #4193
  • Fix entropy and accuracy calculation for prompt_tuning techniques. by @pramodith in #4196
  • Fix prompt-completion labeling with add_generation_prompt and warning by @behroozazarkhalili in #4201
  • ๐ŸŒก๏ธ Have vLLM return processed (temperature scaled) log probs by @YonatanGideoni in #4163
  • Fix handling of f_divergence_type in DPO by @albertvillanova in #4171
  • โšก Fix Flash Attention x Padding-Free loss by @qgallouedec in #4170
  • Pass required token_type_ids by @albertvillanova in #4148
  • ๐Ÿ‘ฉโ€๐Ÿฆฏ Fix usage of VLM using text only by @SamuelBarryCS in #4080
  • โš“ [vllm] ensure MASTER_ADDR/MASTER_PORT are set safely by @kashif in #4057
  • ๐Ÿ“ค Fix a dataset loading bug in scripts by @singing-cat in #4124
  • ๐Ÿฏ fix: use_liger_kernel with IterableDataset by @jue-jue-zi in #4087
  • [GKD] Fix batchmean reduce op in GKDTrainer's loss by @cmpatino in #4105
  • Fix get_peft_model() so that prepare_model_for_kbit_training does not reapply to an instance of PeftModel, thus freezing all the layers by @Hoesu in #4081
  • Aux loss is already included in the loss returned by Transformers by @pramodith in #4078
  • โ™จ๏ธ [GRPO] Fix potential hang in get_high_entropy_mask by @akakakakakaa in #4041

Documentation

Deprecations

Experimental

What's Changed

Read more