Trl

Latest version: v0.8.6

Safety actively analyzes 629599 Python packages for vulnerabilities to keep your Python projects secure.

Page 4 of 6

0.7.0

Text environments, LLMs with tools and agents!

<div style="text-align: center">
<img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/textenv.png">
</div>

Text environments provide a learning ground for language agents. It allows a language model to use tools to accomplish a task such as using a Python interpreter to answer math questions or using a search index for trivia questions. Having access to tools allows language models to solve tasks that would be very hard for the models itself but can be trivial for the appropriate tools.

We are excited to bring to the community a complete set of functionalities and full examples to train LLMs to use tools!

Check out the documentation page [here](https://huggingface.co/docs/trl/text_environments) and few examples below:
* [fine tune a LLM to learn to use a simple calculator tool](https://github.com/huggingface/trl/blob/main/examples/research_projects/tools/calculator.py)
* [fine tune a LLM to learn to use a Question Answering tool to answer general knowledge questions](https://github.com/huggingface/trl/blob/main/examples/research_projects/tools/triviaqa.py)
* [fine tune a LLM to learn to use a Python interpreter](https://github.com/huggingface/trl/blob/main/examples/research_projects/tools/python_interpreter.py)

What's Changed

0.6.0

DDPO for diffusion models

We are excited to welcome the first RLHF + diffusion models algorithm to refine the generations from diffusion models.
Read more about it directly [in the docs](https://huggingface.co/docs/trl/ddpo_trainer).

| Before | After DDPO finetuning |
| --- | --- |
| <div style="text-align: center"><img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/pre_squirrel.png"/></div> | <div style="text-align: center"><img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/post_squirrel.png"/></div> |
| <div style="text-align: center"><img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/pre_starfish.png"/></div> | <div style="text-align: center"><img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/post_starfish.png"/></div> |

* Denoising Diffusion Policy Optimization by metric-space in https://github.com/huggingface/trl/pull/508

Bug fixes and other enhancements

The release also comes with multiple bug fixes reported and/or led by the community, check out the commit history below

What's Changed

0.5.0

This release includes multiple important bugfixes (SFTTrainer, PPOTrainer), the release also extends the current `DataCollatorForCompletionOnlyLM` to support chat-like training.

DPO Trainer

The DPO algorithm (Direct Policy Optimization) has been introduced by Rafailov et al. in [this paper](https://arxiv.org/abs/2305.18290) and introduces a way of performing RL training without having to rely on a reward model. The DPOTrainer is now part of TRL library for anyone that wants to use it thanks to the amazing contributors!

* DPO Trainer by kashif in https://github.com/lvwerra/trl/pull/416
* [DPO] make sure all the concated batches are on same device by kashif in https://github.com/lvwerra/trl/pull/528
* [DPO] remove response/pairs from the DPO side by kashif in https://github.com/lvwerra/trl/pull/540
* [DPO] remove unnecessary batch size arg to Collator by kashif in https://github.com/lvwerra/trl/pull/554
* [`DPO`] Resolve logging for DPOTrainer by tomaarsen in https://github.com/lvwerra/trl/pull/570

What's Changed

* Reward trainer multi-gpu eval bug by rlindskog in https://github.com/lvwerra/trl/pull/513
* Use local process index for `_get_current_device()` by lewtun in https://github.com/lvwerra/trl/pull/515

Extending the `DataCollatorForCompletionOnlyLM`

You can now mask out the users prompts in the `DataCollatorForCompletionOnlyLM` data collator and train only on chat completions. Check out the PR below or the appropriate section on the documentation to learn more about it!

* Introducing DataCollatorForChatCompletionOnlyLM by gaetanlop in https://github.com/lvwerra/trl/pull/456

Important bug fixes

Multiple bugs on the supported trainers have been raised by the community and fixed in the below PRs

* [`core`] Fix offline case by younesbelkada in https://github.com/lvwerra/trl/pull/538
* Relax reward trainer constraint by younesbelkada in https://github.com/lvwerra/trl/pull/539
* ADD: num_proc to SFTTrainer by BramVanroy in https://github.com/lvwerra/trl/pull/547
* [`SFTTrainer`] Add warning for wrong padding_side by younesbelkada in https://github.com/lvwerra/trl/pull/550
* Minor typo and whitespace fixes by tmm1 in https://github.com/lvwerra/trl/pull/559
* [`SFTTrainer`] Add epochs and num steps on CLI by younesbelkada in https://github.com/lvwerra/trl/pull/562
* Add `DataCollatorForCompletionOnlyLM` in the docs by younesbelkada in https://github.com/lvwerra/trl/pull/565
* Add comment to explain how the sentiment pipeline is used to run the … by jvhoffbauer in https://github.com/lvwerra/trl/pull/555
* Fix model output dim in reward trainer example by liutianlin0121 in https://github.com/lvwerra/trl/pull/566
* Computes the KL penalty using the entire distribution by edbeeching in https://github.com/lvwerra/trl/pull/541
* Add missing max_seq_length arg to example sft_trainer.py by SharkWipf in https://github.com/lvwerra/trl/pull/585
* [`PPO`] fix corner cases with PPO batch size and forward_batch_size by younesbelkada in https://github.com/lvwerra/trl/pull/563
* Update the example sft_trainer.py by ZeusFSX in https://github.com/lvwerra/trl/pull/587
* docs: Replace SFTTrainer with RewardTrainer in comment by tomaarsen in https://github.com/lvwerra/trl/pull/589
* Fix comparison in DataCollatorForCompletionOnlyLM (588) by RyujiTamaki in https://github.com/lvwerra/trl/pull/594
* refactor grad accum by vwxyzjn in https://github.com/lvwerra/trl/pull/546

Big refactor of examples and documentation

The examples and documentation has been refactored, check the PRs below for more details

* [`examples`] Big refactor of examples and documentation by younesbelkada in https://github.com/lvwerra/trl/pull/509
* [`examples`] Fix sentiment nit by younesbelkada in https://github.com/lvwerra/trl/pull/517
* [`examples`] make the sft script more modulable by younesbelkada in https://github.com/lvwerra/trl/pull/543
* Add `use_auth_token` arg to sft_trainer example by corey-lambda in https://github.com/lvwerra/trl/pull/544

New Contributors

* rlindskog made their first contribution in https://github.com/lvwerra/trl/pull/513
* corey-lambda made their first contribution in https://github.com/lvwerra/trl/pull/544
* tmm1 made their first contribution in https://github.com/lvwerra/trl/pull/559
* jvhoffbauer made their first contribution in https://github.com/lvwerra/trl/pull/555
* liutianlin0121 made their first contribution in https://github.com/lvwerra/trl/pull/566
* SharkWipf made their first contribution in https://github.com/lvwerra/trl/pull/585
* ZeusFSX made their first contribution in https://github.com/lvwerra/trl/pull/587
* gaetanlop made their first contribution in https://github.com/lvwerra/trl/pull/456
* RyujiTamaki made their first contribution in https://github.com/lvwerra/trl/pull/594

**Full Changelog**: https://github.com/lvwerra/trl/compare/v0.4.7...v0.5.0

0.4.7

Patch release: `SFTTrainer` and `PPOTrainer` bug fixes

What's Changed

* Make shuffle optional by lopez-hector in https://github.com/lvwerra/trl/pull/457
* Pre-commit by vwxyzjn in https://github.com/lvwerra/trl/pull/448
* Debug the tortuous logic in `_prepare_dataset` function by BeibinLi in https://github.com/lvwerra/trl/pull/464
* [`CI`] Fix CI RM by younesbelkada in https://github.com/lvwerra/trl/pull/468
* Update sft_trainer.py by JulesGM in https://github.com/lvwerra/trl/pull/474
* Refactor README by younesbelkada in https://github.com/lvwerra/trl/pull/460
* add ratio threshold to avoid spikes by lvwerra in https://github.com/lvwerra/trl/pull/488
* fix typo in reward_modeling.py by csyourui in https://github.com/lvwerra/trl/pull/494
* FIX: contributing guidelines command by BramVanroy in https://github.com/lvwerra/trl/pull/493
* Remove padding in batched generation. by lvwerra in https://github.com/lvwerra/trl/pull/487
* Adds some options to stabilize the KL penalty by edbeeching in https://github.com/lvwerra/trl/pull/486
* correctly implement gradient checkpointing to multi-adapter example by mnoukhov in https://github.com/lvwerra/trl/pull/479
* Disable mlm by default in DataCollatorForCompletionOnlyLM, add ignore_index and docstring by BramVanroy in https://github.com/lvwerra/trl/pull/476
* Use `float` instead of `double` to avoid issues with MPS device by younesbelkada in https://github.com/lvwerra/trl/pull/499
* [`PPOTrainer`] Add prefix tuning support by younesbelkada in https://github.com/lvwerra/trl/pull/501
* [`PPOTrainer`] Add prompt tuning support on TRL by younesbelkada in https://github.com/lvwerra/trl/pull/500
* [`SFTTrainer`] Fix the sequence length check of `SFTTrainer` by younesbelkada in https://github.com/lvwerra/trl/pull/512

New Contributors

* lopez-hector made their first contribution in https://github.com/lvwerra/trl/pull/457
* BeibinLi made their first contribution in https://github.com/lvwerra/trl/pull/464
* csyourui made their first contribution in https://github.com/lvwerra/trl/pull/494
* BramVanroy made their first contribution in https://github.com/lvwerra/trl/pull/493

**Full Changelog**: https://github.com/lvwerra/trl/compare/v0.4.6...v0.4.7

0.4.6

Patch release

Patch release to fix a bug on google colab with PPOTrainer & PPOConfig + wandb

What's Changed

* Fix google colab issue by younesbelkada in https://github.com/lvwerra/trl/pull/459

**Full Changelog**: https://github.com/lvwerra/trl/compare/v0.4.5...v0.4.6

0.4.5

Patch release 1 - `SFTTrainer` enhancements and fixes

This patch release adds multiple fixes for the SFTTrainer and enhancements. Another patch release is coming for fixing an issue with PPOTrainer and Google Colab combined with wandb logging

What's Changed

* Add slurm utility by vwxyzjn in https://github.com/lvwerra/trl/pull/412
* Enable autotag feature w/ wandb by vwxyzjn in https://github.com/lvwerra/trl/pull/411
* [doc build] Use secrets by mishig25 in https://github.com/lvwerra/trl/pull/420
* Update test_reward_trainer.py by younesbelkada in https://github.com/lvwerra/trl/pull/421
* best-of-n sampler class by metric-space in https://github.com/lvwerra/trl/pull/375
* handle the offline case by younesbelkada in https://github.com/lvwerra/trl/pull/431
* Fix correct gradient accumulation by younesbelkada in https://github.com/lvwerra/trl/pull/407
* Drop support for Python 3.7 by younesbelkada in https://github.com/lvwerra/trl/pull/441
* [`SFTTrainer`] Relax dataset constraints by younesbelkada in https://github.com/lvwerra/trl/pull/442
* [`SFTTrainer`] Fix non packed dataset by younesbelkada in https://github.com/lvwerra/trl/pull/444
* [`core`] Add stale bot by younesbelkada in https://github.com/lvwerra/trl/pull/447
* [`SFTTrainer`] Introducing `DataCollatorForCompletionOnlyLM` by younesbelkada in https://github.com/lvwerra/trl/pull/445
* [`ConstantLengthDataset`] Fix packed dataset issue by younesbelkada in https://github.com/lvwerra/trl/pull/452
* Update accelerate arg passthrourgh for tensorboard logging to reflect logging_dir deprecation. by jganitkevitch in https://github.com/lvwerra/trl/pull/437
* Multi adapter RL (MARL) - a single model for RM & Value Head by younesbelkada in https://github.com/lvwerra/trl/pull/373

New Contributors

* jganitkevitch made their first contribution in https://github.com/lvwerra/trl/pull/437

**Full Changelog**: https://github.com/lvwerra/trl/compare/v0.4.4...v0.4.5

Page 4 of 6

Releases

Has known vulnerabilities

Previous Next

Trl

Page 4 of 6

0.7.0

0.6.0

0.5.0

0.4.7

0.4.6

0.4.5

Page 4 of 6

Links

Releases