This release includes multiple important bugfixes (SFTTrainer, PPOTrainer), the release also extends the current `DataCollatorForCompletionOnlyLM` to support chat-like training.
DPO Trainer
The DPO algorithm (Direct Policy Optimization) has been introduced by Rafailov et al. in [this paper](https://arxiv.org/abs/2305.18290) and introduces a way of performing RL training without having to rely on a reward model. The DPOTrainer is now part of TRL library for anyone that wants to use it thanks to the amazing contributors!
* DPO Trainer by kashif in https://github.com/lvwerra/trl/pull/416
* [DPO] make sure all the concated batches are on same device by kashif in https://github.com/lvwerra/trl/pull/528
* [DPO] remove response/pairs from the DPO side by kashif in https://github.com/lvwerra/trl/pull/540
* [DPO] remove unnecessary batch size arg to Collator by kashif in https://github.com/lvwerra/trl/pull/554
* [`DPO`] Resolve logging for DPOTrainer by tomaarsen in https://github.com/lvwerra/trl/pull/570
What's Changed
* Reward trainer multi-gpu eval bug by rlindskog in https://github.com/lvwerra/trl/pull/513
* Use local process index for `_get_current_device()` by lewtun in https://github.com/lvwerra/trl/pull/515
Extending the `DataCollatorForCompletionOnlyLM`
You can now mask out the users prompts in the `DataCollatorForCompletionOnlyLM` data collator and train only on chat completions. Check out the PR below or the appropriate section on the documentation to learn more about it!
* Introducing DataCollatorForChatCompletionOnlyLM by gaetanlop in https://github.com/lvwerra/trl/pull/456
Important bug fixes
Multiple bugs on the supported trainers have been raised by the community and fixed in the below PRs
* [`core`] Fix offline case by younesbelkada in https://github.com/lvwerra/trl/pull/538
* Relax reward trainer constraint by younesbelkada in https://github.com/lvwerra/trl/pull/539
* ADD: num_proc to SFTTrainer by BramVanroy in https://github.com/lvwerra/trl/pull/547
* [`SFTTrainer`] Add warning for wrong padding_side by younesbelkada in https://github.com/lvwerra/trl/pull/550
* Minor typo and whitespace fixes by tmm1 in https://github.com/lvwerra/trl/pull/559
* [`SFTTrainer`] Add epochs and num steps on CLI by younesbelkada in https://github.com/lvwerra/trl/pull/562
* Add `DataCollatorForCompletionOnlyLM` in the docs by younesbelkada in https://github.com/lvwerra/trl/pull/565
* Add comment to explain how the sentiment pipeline is used to run the … by jvhoffbauer in https://github.com/lvwerra/trl/pull/555
* Fix model output dim in reward trainer example by liutianlin0121 in https://github.com/lvwerra/trl/pull/566
* Computes the KL penalty using the entire distribution by edbeeching in https://github.com/lvwerra/trl/pull/541
* Add missing max_seq_length arg to example sft_trainer.py by SharkWipf in https://github.com/lvwerra/trl/pull/585
* [`PPO`] fix corner cases with PPO batch size and forward_batch_size by younesbelkada in https://github.com/lvwerra/trl/pull/563
* Update the example sft_trainer.py by ZeusFSX in https://github.com/lvwerra/trl/pull/587
* docs: Replace SFTTrainer with RewardTrainer in comment by tomaarsen in https://github.com/lvwerra/trl/pull/589
* Fix comparison in DataCollatorForCompletionOnlyLM (588) by RyujiTamaki in https://github.com/lvwerra/trl/pull/594
* refactor grad accum by vwxyzjn in https://github.com/lvwerra/trl/pull/546
Big refactor of examples and documentation
The examples and documentation has been refactored, check the PRs below for more details
* [`examples`] Big refactor of examples and documentation by younesbelkada in https://github.com/lvwerra/trl/pull/509
* [`examples`] Fix sentiment nit by younesbelkada in https://github.com/lvwerra/trl/pull/517
* [`examples`] make the sft script more modulable by younesbelkada in https://github.com/lvwerra/trl/pull/543
* Add `use_auth_token` arg to sft_trainer example by corey-lambda in https://github.com/lvwerra/trl/pull/544
New Contributors
* rlindskog made their first contribution in https://github.com/lvwerra/trl/pull/513
* corey-lambda made their first contribution in https://github.com/lvwerra/trl/pull/544
* tmm1 made their first contribution in https://github.com/lvwerra/trl/pull/559
* jvhoffbauer made their first contribution in https://github.com/lvwerra/trl/pull/555
* liutianlin0121 made their first contribution in https://github.com/lvwerra/trl/pull/566
* SharkWipf made their first contribution in https://github.com/lvwerra/trl/pull/585
* ZeusFSX made their first contribution in https://github.com/lvwerra/trl/pull/587
* gaetanlop made their first contribution in https://github.com/lvwerra/trl/pull/456
* RyujiTamaki made their first contribution in https://github.com/lvwerra/trl/pull/594
**Full Changelog**: https://github.com/lvwerra/trl/compare/v0.4.7...v0.5.0