Skip to content

Commit 9f8ef40

Browse files
[ORPO] Move ORPOTrainer to experimental (#4480)
1 parent 3bb5d76 commit 9f8ef40

File tree

13 files changed

+1286
-1217
lines changed

13 files changed

+1286
-1217
lines changed

docs/source/_toctree.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -64,8 +64,6 @@
6464
title: GRPO
6565
- local: kto_trainer
6666
title: KTO
67-
- local: orpo_trainer
68-
title: ORPO
6967
- local: prm_trainer
7068
title: PRM
7169
- local: reward_trainer
@@ -115,6 +113,8 @@
115113
title: MiniLLM
116114
- local: nash_md_trainer
117115
title: Nash-MD
116+
- local: orpo_trainer
117+
title: ORPO
118118
- local: papo_trainer
119119
title: PAPO
120120
- local: ppo_trainer

docs/source/community_tutorials.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ Community tutorials are made by active members of the Hugging Face community who
1515
| Instruction tuning | [`SFTTrainer`] | Fine-tuning Google Gemma LLMs using ChatML format with QLoRA | [Philipp Schmid](https://huggingface.co/philschmid) | [Link](https://www.philschmid.de/fine-tune-google-gemma) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/philschmid/deep-learning-pytorch-huggingface/blob/main/training/gemma-lora-example.ipynb) |
1616
| Structured Generation | [`SFTTrainer`] | Fine-tuning Llama-2-7B to generate Persian product catalogs in JSON using QLoRA and PEFT | [Mohammadreza Esmaeilian](https://huggingface.co/Mohammadreza) | [Link](https://huggingface.co/learn/cookbook/en/fine_tuning_llm_to_generate_persian_product_catalogs_in_json_format) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/cookbook/blob/main/notebooks/en/fine_tuning_llm_to_generate_persian_product_catalogs_in_json_format.ipynb) |
1717
| Preference Optimization | [`DPOTrainer`] | Align Mistral-7b using Direct Preference Optimization for human preference alignment | [Maxime Labonne](https://huggingface.co/mlabonne) | [Link](https://mlabonne.github.io/blog/posts/Fine_tune_Mistral_7b_with_DPO.html) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mlabonne/llm-course/blob/main/Fine_tune_a_Mistral_7b_model_with_DPO.ipynb) |
18-
| Preference Optimization | [`ORPOTrainer`] | Fine-tuning Llama 3 with ORPO combining instruction tuning and preference alignment | [Maxime Labonne](https://huggingface.co/mlabonne) | [Link](https://mlabonne.github.io/blog/posts/2024-04-19_Fine_tune_Llama_3_with_ORPO.html) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1eHNWg9gnaXErdAa8_mcvjMupbSS6rDvi) |
18+
| Preference Optimization | [`experimental.orpo.ORPOTrainer`] | Fine-tuning Llama 3 with ORPO combining instruction tuning and preference alignment | [Maxime Labonne](https://huggingface.co/mlabonne) | [Link](https://mlabonne.github.io/blog/posts/2024-04-19_Fine_tune_Llama_3_with_ORPO.html) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1eHNWg9gnaXErdAa8_mcvjMupbSS6rDvi) |
1919
| Instruction tuning | [`SFTTrainer`] | How to fine-tune open LLMs in 2025 with Hugging Face | [Philipp Schmid](https://huggingface.co/philschmid) | [Link](https://www.philschmid.de/fine-tune-llms-in-2025) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/philschmid/deep-learning-pytorch-huggingface/blob/main/training/fine-tune-llms-in-2025.ipynb) |
2020

2121
### Videos

docs/source/dataset_formats.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -387,20 +387,20 @@ Choosing the right dataset type depends on the task you are working on and the s
387387

388388
| Trainer | Expected dataset type |
389389
| --- | --- |
390-
| [`experimental.bco.BCOTrainer`] | [Unpaired preference](#unpaired-preference) or [Preference (explicit prompt recommended)](#preference) |
391-
| [`experimental.cpo.CPOTrainer`] | [Preference (explicit prompt recommended)](#preference) |
392390
| [`DPOTrainer`] | [Preference (explicit prompt recommended)](#preference) |
393-
| [`experimental.gkd.GKDTrainer`] | [Prompt-completion](#prompt-completion) |
394391
| [`GRPOTrainer`] | [Prompt-only](#prompt-only) |
395392
| [`KTOTrainer`] | [Unpaired preference](#unpaired-preference) or [Preference (explicit prompt recommended)](#preference) |
396-
| [`experimental.nash_md.NashMDTrainer`] | [Prompt-only](#prompt-only) |
397393
| [`OnlineDPOTrainer`] | [Prompt-only](#prompt-only) |
398-
| [`ORPOTrainer`] | [Preference (explicit prompt recommended)](#preference) |
399-
| [`experimental.ppo.PPOTrainer`] | Tokenized language modeling |
400394
| [`PRMTrainer`] | [Stepwise supervision](#stepwise-supervision) |
401395
| [`RewardTrainer`] | [Preference (implicit prompt recommended)](#preference) |
402396
| [`RLOOTrainer`] | [Prompt-only](#prompt-only) |
403397
| [`SFTTrainer`] | [Language modeling](#language-modeling) or [Prompt-completion](#prompt-completion) |
398+
| [`experimental.bco.BCOTrainer`] | [Unpaired preference](#unpaired-preference) or [Preference (explicit prompt recommended)](#preference) |
399+
| [`experimental.cpo.CPOTrainer`] | [Preference (explicit prompt recommended)](#preference) |
400+
| [`experimental.gkd.GKDTrainer`] | [Prompt-completion](#prompt-completion) |
401+
| [`experimental.nash_md.NashMDTrainer`] | [Prompt-only](#prompt-only) |
402+
| [`experimental.orpo.ORPOTrainer`] | [Preference (explicit prompt recommended)](#preference) |
403+
| [`experimental.ppo.PPOTrainer`] | Tokenized language modeling |
404404
| [`experimental.xpo.XPOTrainer`] | [Prompt-only](#prompt-only) |
405405

406406
## Using any dataset with TRL: preprocessing and conversion

docs/source/example_overview.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,7 @@ Scripts are maintained in the [`trl/scripts`](https://github.com/huggingface/trl
5858
| [`examples/scripts/openenv/catch.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/openenv/catch.py) | Simple script to run GRPO training via the [`GRPOTrainer`] with OpenEnv's Catch environment (OpenSpiel) and vLLM |
5959
| [`examples/scripts/openenv/echo.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/openenv/echo.py) | Simple script to run GRPO training via the [`GRPOTrainer`] with OpenEnv's Echo environment and vLLM. |
6060
| [`examples/scripts/openenv/wordle.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/openenv/wordle.py) | Simple script to run GRPO training via the [`GRPOTrainer`] with OpenEnv's Wordle environment and vLLM. |
61-
| [`examples/scripts/orpo.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/orpo.py) | This script shows how to use the [`ORPOTrainer`] to fine-tune a model to increase helpfulness and harmlessness using the [Anthropic/hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf) dataset. |
61+
| [`examples/scripts/orpo.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/orpo.py) | This script shows how to use the [`experimental.orpo.ORPOTrainer`] to fine-tune a model to increase helpfulness and harmlessness using the [Anthropic/hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf) dataset. |
6262
| [`examples/scripts/ppo/ppo.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/ppo/ppo.py) | This script shows how to use the [`experimental.ppo.PPOTrainer`] to fine-tune a model to improve its ability to continue text with positive sentiment or physically descriptive language. |
6363
| [`examples/scripts/ppo/ppo_tldr.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/ppo/ppo_tldr.py) | This script shows how to use the [`experimental.ppo.PPOTrainer`] to fine-tune a model to improve its ability to generate TL;DR summaries. |
6464
| [`examples/scripts/prm.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/prm.py) | This script shows how to use the [`PRMTrainer`] to fine-tune a Process-supervised Reward Model (PRM). |

docs/source/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,10 +41,10 @@ Below is the current list of TRL trainers, organized by method type (⚡️ = vL
4141

4242
- [`SFTTrainer`]
4343
- [`DPOTrainer`]
44-
- [`ORPOTrainer`]
4544
- [`KTOTrainer`]
4645
- [`experimental.bco.BCOTrainer`] 🧪
4746
- [`experimental.cpo.CPOTrainer`] 🧪
47+
- [`experimental.orpo.ORPOTrainer`] 🧪
4848

4949
### Knowledge distillation
5050

docs/source/orpo_trainer.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ Below is the script to train the model:
3434
```python
3535
# train_orpo.py
3636
from datasets import load_dataset
37-
from trl import ORPOConfig, ORPOTrainer
37+
from trl.experimental.orpo import ORPOConfig, ORPOTrainer
3838
from transformers import AutoModelForCausalLM, AutoTokenizer
3939

4040
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
@@ -79,9 +79,9 @@ Here are some other factors to consider when choosing a programming language for
7979

8080
## Expected dataset type
8181

82-
ORPO requires a [preference dataset](dataset_formats#preference). The [`ORPOTrainer`] supports both [conversational](dataset_formats#conversational) and [standard](dataset_formats#standard) dataset format. When provided with a conversational dataset, the trainer will automatically apply the chat template to the dataset.
82+
ORPO requires a [preference dataset](dataset_formats#preference). The [`experimental.orpo.ORPOTrainer`] supports both [conversational](dataset_formats#conversational) and [standard](dataset_formats#standard) dataset format. When provided with a conversational dataset, the trainer will automatically apply the chat template to the dataset.
8383

84-
Although the [`ORPOTrainer`] supports both explicit and implicit prompts, we recommend using explicit prompts. If provided with an implicit prompt dataset, the trainer will automatically extract the prompt from the `"chosen"` and `"rejected"` columns. For more information, refer to the [preference style](dataset_formats#preference) section.
84+
Although the [`experimental.orpo.ORPOTrainer`] supports both explicit and implicit prompts, we recommend using explicit prompts. If provided with an implicit prompt dataset, the trainer will automatically extract the prompt from the `"chosen"` and `"rejected"` columns. For more information, refer to the [preference style](dataset_formats#preference) section.
8585

8686
## Example script
8787

@@ -121,11 +121,11 @@ While training and evaluating, we record the following reward metrics:
121121

122122
## ORPOTrainer
123123

124-
[[autodoc]] ORPOTrainer
124+
[[autodoc]] experimental.orpo.ORPOTrainer
125125
- train
126126
- save_model
127127
- push_to_hub
128128

129129
## ORPOConfig
130130

131-
[[autodoc]] ORPOConfig
131+
[[autodoc]] experimental.orpo.ORPOConfig

examples/scripts/orpo.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,8 @@
6363
from datasets import load_dataset
6464
from transformers import AutoModelForCausalLM, AutoTokenizer, HfArgumentParser
6565

66-
from trl import ModelConfig, ORPOConfig, ORPOTrainer, ScriptArguments, get_peft_config
66+
from trl import ModelConfig, ScriptArguments, get_peft_config
67+
from trl.experimental.orpo import ORPOConfig, ORPOTrainer
6768

6869

6970
# Enable logging in a Hugging Face Space

tests/test_orpo_trainer.py renamed to tests/experimental/test_orpo_trainer.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -17,9 +17,9 @@
1717
from datasets import load_dataset
1818
from transformers import AutoModelForCausalLM, AutoModelForSeq2SeqLM, AutoTokenizer
1919

20-
from trl import ORPOConfig, ORPOTrainer
20+
from trl.experimental.orpo import ORPOConfig, ORPOTrainer
2121

22-
from .testing_utils import TrlTestCase, require_peft
22+
from ..testing_utils import TrlTestCase, require_peft
2323

2424

2525
class TestORPOTrainer(TrlTestCase):

trl/experimental/orpo/__init__.py

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
# Copyright 2020-2025 The HuggingFace Team. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
from .orpo_config import ORPOConfig
16+
from .orpo_trainer import ORPOTrainer
17+
18+
19+
__all__ = ["ORPOConfig", "ORPOTrainer"]
Lines changed: 179 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,179 @@
1+
# Copyright 2020-2025 The HuggingFace Team. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
from dataclasses import dataclass, field
16+
from typing import Any
17+
18+
from transformers import TrainingArguments
19+
20+
21+
@dataclass
22+
class ORPOConfig(TrainingArguments):
23+
r"""
24+
Configuration class for the [`experimental.orpo.ORPOTrainer`].
25+
26+
This class includes only the parameters that are specific to ORPO training. For a full list of training arguments,
27+
please refer to the [`~transformers.TrainingArguments`] documentation. Note that default values in this class may
28+
differ from those in [`~transformers.TrainingArguments`].
29+
30+
Using [`~transformers.HfArgumentParser`] we can turn this class into
31+
[argparse](https://docs.python.org/3/library/argparse#module-argparse) arguments that can be specified on the
32+
command line.
33+
34+
Parameters:
35+
max_length (`int` or `None`, *optional*, defaults to `1024`):
36+
Maximum length of the sequences (prompt + completion) in the batch. This argument is required if you want
37+
to use the default data collator.
38+
max_prompt_length (`int` or `None`, *optional*, defaults to `512`):
39+
Maximum length of the prompt. This argument is required if you want to use the default data collator.
40+
max_completion_length (`int`, *optional*):
41+
Maximum length of the completion. This argument is required if you want to use the default data collator
42+
and your model is an encoder-decoder.
43+
beta (`float`, *optional*, defaults to `0.1`):
44+
Parameter controlling the relative ratio loss weight in the ORPO loss. In the
45+
[paper](https://huggingface.co/papers/2403.07691), it is denoted by λ. In the
46+
[code](https://github.com/xfactlab/orpo), it is denoted by `alpha`.
47+
disable_dropout (`bool`, *optional*, defaults to `True`):
48+
Whether to disable dropout in the model.
49+
label_pad_token_id (`int`, *optional*, defaults to `-100`):
50+
Label pad token id. This argument is required if you want to use the default data collator.
51+
padding_value (`int`, *optional*):
52+
Padding value to use. If `None`, the padding value of the tokenizer is used.
53+
truncation_mode (`str`, *optional*, defaults to `"keep_end"`):
54+
Truncation mode to use when the prompt is too long. Possible values are `"keep_end"` or `"keep_start"`.
55+
This argument is required if you want to use the default data collator.
56+
generate_during_eval (`bool`, *optional*, defaults to `False`):
57+
If `True`, generates and logs completions from the model to W&B or Comet during evaluation.
58+
is_encoder_decoder (`bool`, *optional*):
59+
When using the `model_init` argument (callable) to instantiate the model instead of the `model` argument,
60+
you need to specify if the model returned by the callable is an encoder-decoder model.
61+
model_init_kwargs (`dict[str, Any]`, *optional*):
62+
Keyword arguments to pass to `AutoModelForCausalLM.from_pretrained` when instantiating the model from a
63+
string.
64+
dataset_num_proc (`int`, *optional*):
65+
Number of processes to use for processing the dataset.
66+
"""
67+
68+
_VALID_DICT_FIELDS = TrainingArguments._VALID_DICT_FIELDS + ["model_init_kwargs"]
69+
70+
# Parameters whose default values are overridden from TrainingArguments
71+
learning_rate: float = field(
72+
default=1e-6,
73+
metadata={"help": "The initial learning rate for AdamW."},
74+
)
75+
logging_steps: float = field(
76+
default=10,
77+
metadata={
78+
"help": "Log every X updates steps. Should be an integer or a float in range `[0,1)`. If smaller than 1, "
79+
"will be interpreted as ratio of total training steps."
80+
},
81+
)
82+
gradient_checkpointing: bool = field(
83+
default=True,
84+
metadata={
85+
"help": "If True, use gradient checkpointing to save memory at the expense of slower backward pass."
86+
},
87+
)
88+
bf16: bool | None = field(
89+
default=None,
90+
metadata={
91+
"help": "Whether to use bf16 (mixed) precision instead of 32-bit. Requires Ampere or higher NVIDIA "
92+
"architecture or Intel XPU or using CPU (use_cpu) or Ascend NPU. If not set, it defaults to `True` if "
93+
"`fp16` is not set."
94+
},
95+
)
96+
# Transformers 4.57.0 introduced a bug that caused the dtype of `lr_scheduler_kwargs` to be unparsable. This issue
97+
# was fixed in https://github.com/huggingface/transformers/pull/41322, but the fix has not yet been released. We
98+
# add a temporary workaround here, which can be removed once the fix is available—likely in Transformers 4.57.2.
99+
lr_scheduler_kwargs: dict | str | None = field(
100+
default=None,
101+
metadata={
102+
"help": "Additional parameters for the lr_scheduler, such as {'num_cycles': 1} for cosine with hard "
103+
"restarts."
104+
},
105+
)
106+
107+
max_length: int | None = field(
108+
default=1024,
109+
metadata={"help": "Maximum length of the sequences (prompt + completion) in the batch."},
110+
)
111+
max_prompt_length: int | None = field(
112+
default=512,
113+
metadata={
114+
"help": "Maximum length of the prompt. This argument is required if you want to use the default data "
115+
"collator and your model is an encoder-decoder."
116+
},
117+
)
118+
max_completion_length: int | None = field(
119+
default=None,
120+
metadata={
121+
"help": "Maximum length of the completion. This argument is required if you want to use the default data "
122+
"collator and your model is an encoder-decoder."
123+
},
124+
)
125+
beta: float = field(
126+
default=0.1,
127+
metadata={
128+
"help": "Parameter controlling the relative ratio loss weight in the ORPO loss. In the paper, it is "
129+
"denoted by λ."
130+
},
131+
)
132+
disable_dropout: bool = field(
133+
default=True,
134+
metadata={"help": "Whether to disable dropout in the model."},
135+
)
136+
label_pad_token_id: int = field(
137+
default=-100,
138+
metadata={
139+
"help": "Label pad token id. This argument is required if you want to use the default data collator."
140+
},
141+
)
142+
padding_value: int | None = field(
143+
default=None,
144+
metadata={"help": "Padding value to use. If `None`, the padding value of the tokenizer is used."},
145+
)
146+
truncation_mode: str = field(
147+
default="keep_end",
148+
metadata={
149+
"help": "Truncation mode to use when the prompt is too long.",
150+
"choices": ["keep_end", "keep_start"],
151+
},
152+
)
153+
generate_during_eval: bool = field(
154+
default=False,
155+
metadata={"help": "If `True`, generates and logs completions from the model to W&B during evaluation."},
156+
)
157+
is_encoder_decoder: bool | None = field(
158+
default=None,
159+
metadata={
160+
"help": "When using the `model_init` argument (callable) to instantiate the model instead of the `model` "
161+
"argument, you need to specify if the model returned by the callable is an encoder-decoder model."
162+
},
163+
)
164+
model_init_kwargs: dict[str, Any] | None = field(
165+
default=None,
166+
metadata={
167+
"help": "Keyword arguments to pass to `AutoModelForCausalLM.from_pretrained` when instantiating the model "
168+
"from a string."
169+
},
170+
)
171+
dataset_num_proc: int | None = field(
172+
default=None,
173+
metadata={"help": "Number of processes to use for processing the dataset."},
174+
)
175+
176+
def __post_init__(self):
177+
self.bf16 = not (self.fp16) if self.bf16 is None else self.bf16
178+
179+
super().__post_init__()

0 commit comments

Comments
 (0)