# Trainer Classes

## Trainer

At TRL we support PPO (Proximal Policy Optimisation) with an implementation that largely follows the structure introduced in the paper “Fine-Tuning Language Models from Human Preferences” by D. Ziegler et al. \[[paper](https://arxiv.org/pdf/1909.08593.pdf), [code](https://github.com/openai/lm-human-preferences)]. The Trainer and model classes are largely inspired from `transformers.Trainer` and `transformers.AutoModel` classes and adapted for RL. We also support a `RewardTrainer` that can be used to train a reward model.

### PPOConfig

#### class trl.PPOConfig

[\<source>](https://github.com/huggingface/trl/blob/v0.7.2/trl/trainer/ppo_config.py#L30)

( exp\_name: str = 'doc-buil'seed: int = 0log\_with: typing.Union\[typing.Literal\['wandb', 'tensorboard'], NoneType] = Nonetask\_name: typing.Optional\[str] = Nonemodel\_name: typing.Optional\[str] = Nonequery\_dataset: typing.Optional\[str] = Nonereward\_model: typing.Optional\[str] = Noneremove\_unused\_columns: bool = Truetracker\_kwargs: dict = \<factory>accelerator\_kwargs: dict = \<factory>project\_kwargs: dict = \<factory>tracker\_project\_name: str = 'trl'push\_to\_hub\_if\_best\_kwargs: dict = \<factory>steps: int = 20000learning\_rate: float = 1e-05adap\_kl\_ctrl: bool = Trueinit\_kl\_coef: typing.Optional\[float] = 0.2kl\_penalty: typing.Literal\['kl', 'abs', 'mse', 'full'] = 'kl'target: typing.Optional\[float] = 6horizon: typing.Optional\[float] = 10000gamma: float = 1lam: float = 0.95cliprange: float = 0.2cliprange\_value: float = 0.2vf\_coef: float = 0.1batch\_size: int = 256forward\_batch\_size: typing.Optional\[int] = Nonemini\_batch\_size: int = 1gradient\_accumulation\_steps: int = 1world\_size: typing\_extensions.Annotated\[int, Suppress] = Noneppo\_epochs: int = 4max\_grad\_norm: typing.Optional\[float] = Noneoptimize\_cuda\_cache: bool = Falseearly\_stopping: bool = Falsetarget\_kl: float = 1compare\_steps: int = 1ratio\_threshold: float = 10.0use\_score\_scaling: bool = Falseuse\_score\_norm: bool = Falsescore\_clip: typing.Optional\[float] = Noneis\_encoder\_decoder: typing.Union\[typing\_extensions.Annotated\[bool, Suppress], NoneType] = Noneis\_peft\_model: typing.Union\[typing\_extensions.Annotated\[bool, Suppress], NoneType] = Nonebackward\_batch\_size: typing\_extensions.Annotated\[int, Suppress] = Noneglobal\_backward\_batch\_size: typing\_extensions.Annotated\[int, Suppress] = Noneglobal\_batch\_size: typing\_extensions.Annotated\[int, Suppress] = None )

Configuration class for PPOTrainer

### PPOTrainer

#### class trl.PPOTrainer

[\<source>](https://github.com/huggingface/trl/blob/v0.7.2/trl/trainer/ppo_trainer.py#L107)

( config: PPOConfig = Nonemodel: PreTrainedModelWrapper = Noneref\_model: typing.Optional\[trl.models.modeling\_base.PreTrainedModelWrapper] = Nonetokenizer: PreTrainedTokenizerBase = Nonedataset: typing.Union\[torch.utils.data.dataset.Dataset, datasets.arrow\_dataset.Dataset, NoneType] = Noneoptimizer: typing.Optional\[torch.optim.optimizer.Optimizer] = Nonedata\_collator: typing.Optional\[typing.Callable] = Nonenum\_shared\_layers: typing.Optional\[int] = Nonelr\_scheduler: typing.Optional\[torch.optim.lr\_scheduler.\_LRScheduler] = None )

Parameters

* \***\*config\*\*** (`PPOConfig`) — Configuration object for PPOTrainer. Check the documentation of `PPOConfig` for more — details.
* \***\*model\*\*** (`PreTrainedModelWrapper`) — Model to be optimized, BOINC AI transformer model with a value head. — Check the documentation of `PreTrainedModelWrapper` for more details.
* \***\*ref\_model\*\*** (`PreTrainedModelWrapper`, *optional*) — Reference model to be used for KL penalty, BOINC AI  — transformer model with a casual language modelling head. Check the documentation of `PreTrainedModelWrapper` for more details. If no reference model is provided, the trainer will create a reference model with the same architecture as the model to be optimized with shared layers.
* \***\*tokenizer\*\*** (`PreTrainedTokenizerBase`) — Tokenizer to be used for encoding the — data. Check the documentation of `transformers.PreTrainedTokenizer` and `transformers.PreTrainedTokenizerFast` for more details.
* \***\*dataset\*\*** (Union\[`torch.utils.data.Dataset`, `datasets.Dataset`], *optional*) — PyTorch dataset or BOINC AI  dataset. This is used to create a PyTorch dataloader. If no dataset is provided, the dataloader must be created outside the trainer users needs to design their own dataloader and make sure the batch size that is used is the same as the one specified in the configuration object.
* \***\*optimizer\*\*** (`torch.optim.Optimizer`, *optional*) — Optimizer to be used for training. If no optimizer is — provided, the trainer will create an Adam optimizer with the learning rate specified in the configuration object.
* \***\*data\_collator\*\*** (DataCollatorForLanguageModeling, *optional*) — Data collator to be used for training and — passed along the dataloader
* \***\*num\_shared\_layers\*\*** (int, *optional*) — Number of layers to be shared between the model and the reference — model, if no reference model is passed. If no number is provided, all the layers will be shared.
* \***\*lr\_scheduler\*\*** (`torch.optim.lr_scheduler`, *optional*) — Learning rate scheduler to be used for training. —

The PPOTrainer uses Proximal Policy Optimization to optimise language models. Note, this trainer is heavily inspired by the original OpenAI learning to summarize work here: <https://github.com/openai/summarize-from-feedback>

**batched\_forward\_pass**

[\<source>](https://github.com/huggingface/trl/blob/v0.7.2/trl/trainer/ppo_trainer.py#L912)

( model: PreTrainedModelWrapperqueries: Tensorresponses: Tensormodel\_inputs: dictreturn\_logits: bool = Falseresponse\_masks: typing.Optional\[torch.Tensor] = None ) → (tuple)

Parameters

* **queries** (`torch.LongTensor`) — List of tensors containing the encoded queries, shape (`batch_size`, `query_length`)
* **responses** (`torch.LongTensor`) — List of tensors containing the encoded responses, shape (`batch_size`, `response_length`)
* **return\_logits** (`bool`, *optional*, defaults to `False`) — Whether to return all\_logits. Set to `False` if logits are not needed to reduce memory consumption.

Returns

(tuple)

* all\_logprobs (`torch.FloatTensor`): Log probabilities of the responses, shape (`batch_size`, `response_length`)
* all\_ref\_logprobs (`torch.FloatTensor`): Log probabilities of the responses, shape (`batch_size`, `response_length`)
* all\_values (`torch.FloatTensor`): Values of the responses, shape (`batch_size`, `response_length`)

Calculate model outputs in multiple batches.

**compute\_rewards**

[\<source>](https://github.com/huggingface/trl/blob/v0.7.2/trl/trainer/ppo_trainer.py#L1049)

( scores: FloatTensorlogprobs: FloatTensorref\_logprobs: FloatTensormasks: LongTensor )

Parameters

* **scores** (`torch.FloatTensor`) — Scores from the reward model, shape (`batch_size`)
* **logprobs** (`torch.FloatTensor`) — Log probabilities of the model, shape (`batch_size`, `response_length`)
* **ref\_logprobs** (`torch.FloatTensor`) — Log probabilities of the reference model, shape (`batch_size`, `response_length`)

Compute per token rewards from scores and KL-penalty.

**create\_model\_card**

[\<source>](https://github.com/huggingface/trl/blob/v0.7.2/trl/trainer/ppo_trainer.py#L1355)

( path: strmodel\_name: typing.Optional\[str] = 'TRL Model' )

Parameters

* **path** (`str`) — The path to save the model card to.
* **model\_name** (`str`, *optional*) — The name of the model, defaults to `TRL Model`.

Creates and saves a model card for a TRL model.

**gather\_stats**

[\<source>](https://github.com/huggingface/trl/blob/v0.7.2/trl/trainer/ppo_trainer.py#L868)

( stats ) → `dict[str, Any]`

Parameters

* **stats** (dict\[str, Any]) —
* **a** dictionary of stats to be gathered. The stats should contain torch tensors. —

Returns

`dict[str, Any]`

A dictionary of stats with the tensors gathered.

Gather stats from all processes. Useful in the context of distributed training.

**generate**

[\<source>](https://github.com/huggingface/trl/blob/v0.7.2/trl/trainer/ppo_trainer.py#L417)

( query\_tensor: typing.Union\[torch.Tensor, typing.List\[torch.Tensor]]length\_sampler: typing.Callable = Nonebatch\_size: int = 4return\_prompt: bool = True\*\*generation\_kwargs ) → `torch.LongTensor`

Parameters

* **query\_tensor** (`torch.LongTensor`) — A tensor of shape (`seq_len`) containing query tokens or a list of tensors of shape (`seq_len`).
* **generation\_kwargs** (dict\[str, Any]) — Keyword arguments for generation.
* **length\_sampler** (`Callable`, *optional*) — Callable that returns the number of newly generated tokens.
* **batch\_size** (`int`, \*optional) — Batch size used for generation, defaults to `4`.
* **return\_prompt** (`bool`, *optional*) — If set to `False` the prompt is not returned but only the newly generated tokens, defaults to `True`.

Returns

`torch.LongTensor`

A tensor of shape (`batch_size`, `gen_len`) containing response tokens.

Generate response with the model given the query tensor. call the `generate` method of the model.

**log\_stats**

[\<source>](https://github.com/huggingface/trl/blob/v0.7.2/trl/trainer/ppo_trainer.py#L1274)

( stats: dictbatch: dictrewards: typing.List\[torch.FloatTensor]columns\_to\_log: typing.List\[str] = \['query', 'response'] )

Parameters

* **stats** (dict\[str, Any]) — A dictionary of training stats.
* **batch** (dict\[str, Any]) — A dictionary of batch data, this contains the queries and responses.
* **rewards** (`List[torch.FloatTensor]`) — A tensor of rewards.

A function that logs all the training stats. Call it at the end of each epoch.

**loss**

[\<source>](https://github.com/huggingface/trl/blob/v0.7.2/trl/trainer/ppo_trainer.py#L1122)

( old\_logprobs: FloatTensorvalues: FloatTensorlogits: FloatTensorvpreds: FloatTensorlogprobs: FloatTensormask: LongTensoradvantages: FloatTensorreturns: FloatTensor )

Parameters

* **old\_logprobs** (`torch.FloatTensor`) — Log probabilities of the model, shape (`batch_size`, `response_length`)
* **values** (`torch.FloatTensor`) — Values of the value head, shape (`batch_size`, `response_length`)
* **rewards** (`torch.FloatTensor`) — Rewards from the reward model, shape (`batch_size`, `response_length`)
* **logits** (`torch.FloatTensor`) — Logits of the model, shape (`batch_size`, `response_length`, `vocab_size`)
* **v\_pred** (`torch.FloatTensor`) — Values of the value head, shape (`batch_size`, `response_length`)
* **logprobs** (`torch.FloatTensor`) — Log probabilities of the model, shape (`batch_size`, `response_length`)

Calculate policy and value losses.

**prepare\_dataloader**

[\<source>](https://github.com/huggingface/trl/blob/v0.7.2/trl/trainer/ppo_trainer.py#L362)

( dataset: typing.Union\[torch.utils.data.dataset.Dataset, datasets.arrow\_dataset.Dataset]data\_collator = None ) → `torch.utils.data.DataLoader`

Parameters

* **dataset** (Union\[`torch.utils.data.Dataset`, `datasets.Dataset`]) — PyTorch dataset or BOINC AI  dataset. If a BOINC AI  dataset is passed, the dataset will be preprocessed by removing the columns that are not used by the model.
* **data\_collator** (Optional\[function]) — Data collator function.

Returns

`torch.utils.data.DataLoader`

PyTorch dataloader

Prepare the dataloader for training.

**record\_step\_stats**

[\<source>](https://github.com/huggingface/trl/blob/v0.7.2/trl/trainer/ppo_trainer.py#L1211)

( kl\_coef: float\*\*data ) → stats (`dict`)

Parameters

* **kl\_coef** (`float`) — KL coefficient
* **data** (`dict`) — Dictionary of training step data

Returns

stats (`dict`)

Dictionary of training step statistics

Record training step statistics.

**step**

[\<source>](https://github.com/huggingface/trl/blob/v0.7.2/trl/trainer/ppo_trainer.py#L579)

( queries: typing.List\[torch.LongTensor]responses: typing.List\[torch.LongTensor]scores: typing.List\[torch.FloatTensor]response\_masks: typing.Optional\[typing.List\[torch.LongTensor]] = None ) → `dict[str, Any]`

Parameters

* **queries** (List`torch.LongTensor`) — List of tensors containing the encoded queries of shape (`query_length`)
* **responses** (List`torch.LongTensor`) — List of tensors containing the encoded responses of shape (`response_length`)
* **scores** (List`torch.FloatTensor`) — List of tensors containing the scores.
* **response\_masks** (List`torch.FloatTensor`, *optional*)) — List of tensors containing masks of the response tokens.

Returns

`dict[str, Any]`

A summary of the training statistics

Run a PPO optimisation step given a list of queries, model responses, and rewards.

**train\_minibatch**

[\<source>](https://github.com/huggingface/trl/blob/v0.7.2/trl/trainer/ppo_trainer.py#L1003)

( old\_logprobs: FloatTensorvalues: FloatTensorlogprobs: FloatTensorlogits: FloatTensorvpreds: FloatTensormask: LongTensoradvantages: FloatTensorreturns: FloatTensor ) → train\_stats (dict\[str, `torch.Tensor`])

Parameters

* **logprobs** (`torch.FloatTensor`) — Log probabilities of the model, shape \[batch\_size, response\_length]
* **values** (`torch.FloatTensor`) — Values of the value head, shape \[batch\_size, response\_length]
* **query** (`torch.LongTensor`) — Encoded queries, shape \[batch\_size, query\_length]
* **response** (`torch.LongTensor`) — Encoded responses, shape \[batch\_size, response\_length]
* **model\_input** (`torch.LongTensor`) — Concatenated queries and responses, shape \[batch\_size, query\_length+response\_length]

Returns

train\_stats (dict\[str, `torch.Tensor`])

Dictionary of training statistics

Train one PPO minibatch

### RewardConfig

#### class trl.RewardConfig

[\<source>](https://github.com/huggingface/trl/blob/v0.7.2/trl/trainer/training_configs.py#L23)

( output\_dir: stroverwrite\_output\_dir: bool = Falsedo\_train: bool = Falsedo\_eval: bool = Falsedo\_predict: bool = Falseevaluation\_strategy: typing.Union\[transformers.trainer\_utils.IntervalStrategy, str] = 'no'prediction\_loss\_only: bool = Falseper\_device\_train\_batch\_size: int = 8per\_device\_eval\_batch\_size: int = 8per\_gpu\_train\_batch\_size: typing.Optional\[int] = Noneper\_gpu\_eval\_batch\_size: typing.Optional\[int] = Nonegradient\_accumulation\_steps: int = 1eval\_accumulation\_steps: typing.Optional\[int] = Noneeval\_delay: typing.Optional\[float] = 0learning\_rate: float = 5e-05weight\_decay: float = 0.0adam\_beta1: float = 0.9adam\_beta2: float = 0.999adam\_epsilon: float = 1e-08max\_grad\_norm: float = 1.0num\_train\_epochs: float = 3.0max\_steps: int = -1lr\_scheduler\_type: typing.Union\[transformers.trainer\_utils.SchedulerType, str] = 'linear'warmup\_ratio: float = 0.0warmup\_steps: int = 0log\_level: typing.Optional\[str] = 'passive'log\_level\_replica: typing.Optional\[str] = 'warning'log\_on\_each\_node: bool = Truelogging\_dir: typing.Optional\[str] = Nonelogging\_strategy: typing.Union\[transformers.trainer\_utils.IntervalStrategy, str] = 'steps'logging\_first\_step: bool = Falselogging\_steps: float = 500logging\_nan\_inf\_filter: bool = Truesave\_strategy: typing.Union\[transformers.trainer\_utils.IntervalStrategy, str] = 'steps'save\_steps: float = 500save\_total\_limit: typing.Optional\[int] = Nonesave\_safetensors: typing.Optional\[bool] = Falsesave\_on\_each\_node: bool = Falseno\_cuda: bool = Falseuse\_cpu: bool = Falseuse\_mps\_device: bool = Falseseed: int = 42data\_seed: typing.Optional\[int] = Nonejit\_mode\_eval: bool = Falseuse\_ipex: bool = Falsebf16: bool = Falsefp16: bool = Falsefp16\_opt\_level: str = 'O1'half\_precision\_backend: str = 'auto'bf16\_full\_eval: bool = Falsefp16\_full\_eval: bool = Falsetf32: typing.Optional\[bool] = Nonelocal\_rank: int = -1ddp\_backend: typing.Optional\[str] = Nonetpu\_num\_cores: typing.Optional\[int] = Nonetpu\_metrics\_debug: bool = Falsedebug: typing.Union\[str, typing.List\[transformers.debug\_utils.DebugOption]] = ''dataloader\_drop\_last: bool = Falseeval\_steps: typing.Optional\[float] = Nonedataloader\_num\_workers: int = 0past\_index: int = -1run\_name: typing.Optional\[str] = Nonedisable\_tqdm: typing.Optional\[bool] = Noneremove\_unused\_columns: typing.Optional\[bool] = Truelabel\_names: typing.Optional\[typing.List\[str]] = Noneload\_best\_model\_at\_end: typing.Optional\[bool] = Falsemetric\_for\_best\_model: typing.Optional\[str] = Nonegreater\_is\_better: typing.Optional\[bool] = Noneignore\_data\_skip: bool = Falsefsdp: typing.Union\[typing.List\[transformers.trainer\_utils.FSDPOption], str, NoneType] = ''fsdp\_min\_num\_params: int = 0fsdp\_config: typing.Optional\[str] = Nonefsdp\_transformer\_layer\_cls\_to\_wrap: typing.Optional\[str] = Nonedeepspeed: typing.Optional\[str] = Nonelabel\_smoothing\_factor: float = 0.0optim: typing.Union\[transformers.training\_args.OptimizerNames, str] = 'adamw\_torch'optim\_args: typing.Optional\[str] = Noneadafactor: bool = Falsegroup\_by\_length: bool = Falselength\_column\_name: typing.Optional\[str] = 'length'report\_to: typing.Optional\[typing.List\[str]] = Noneddp\_find\_unused\_parameters: typing.Optional\[bool] = Noneddp\_bucket\_cap\_mb: typing.Optional\[int] = Noneddp\_broadcast\_buffers: typing.Optional\[bool] = Nonedataloader\_pin\_memory: bool = Trueskip\_memory\_metrics: bool = Trueuse\_legacy\_prediction\_loop: bool = Falsepush\_to\_hub: bool = Falseresume\_from\_checkpoint: typing.Optional\[str] = Nonehub\_model\_id: typing.Optional\[str] = Nonehub\_strategy: typing.Union\[transformers.trainer\_utils.HubStrategy, str] = 'every\_save'hub\_token: typing.Optional\[str] = Nonehub\_private\_repo: bool = Falsehub\_always\_push: bool = Falsegradient\_checkpointing: typing.Optional\[bool] = Trueinclude\_inputs\_for\_metrics: bool = Falsefp16\_backend: str = 'auto'push\_to\_hub\_model\_id: typing.Optional\[str] = Nonepush\_to\_hub\_organization: typing.Optional\[str] = Nonepush\_to\_hub\_token: typing.Optional\[str] = Nonemp\_parameters: str = ''auto\_find\_batch\_size: bool = Falsefull\_determinism: bool = Falsetorchdynamo: typing.Optional\[str] = Noneray\_scope: typing.Optional\[str] = 'last'ddp\_timeout: typing.Optional\[int] = 1800torch\_compile: bool = Falsetorch\_compile\_backend: typing.Optional\[str] = Nonetorch\_compile\_mode: typing.Optional\[str] = Nonedispatch\_batches: typing.Optional\[bool] = Noneinclude\_tokens\_per\_second: typing.Optional\[bool] = Falsemax\_length: typing.Optional\[int] = None )

Parameters

* **max\_length** (`int`, *optional*, defaults to `None`) — The maximum length of the sequences in the batch. This argument is required if you want to use the default data collator.
* **gradient\_checkpointing** (`bool`, *optional*, defaults to `True`) — If True, use gradient checkpointing to save memory at the expense of slower backward pass.

RewardConfig collects all training arguments related to the [RewardTrainer](https://huggingface.co/docs/trl/v0.7.2/en/reward_trainer#trl.RewardTrainer) class.

Using `HfArgumentParser` we can turn this class into [argparse](https://docs.python.org/3/library/argparse#module-argparse) arguments that can be specified on the command line.

### RewardTrainer

#### class trl.RewardTrainer

[\<source>](https://github.com/huggingface/trl/blob/v0.7.2/trl/trainer/reward_trainer.py#L35)

( model: typing.Union\[transformers.modeling\_utils.PreTrainedModel, torch.nn.modules.module.Module] = Noneargs: typing.Optional\[trl.trainer.training\_configs.RewardConfig] = Nonedata\_collator: typing.Optional\[DataCollator] = Nonetrain\_dataset: typing.Optional\[datasets.arrow\_dataset.Dataset] = Noneeval\_dataset: typing.Union\[datasets.arrow\_dataset.Dataset, typing.Dict\[str, datasets.arrow\_dataset.Dataset], NoneType] = Nonetokenizer: typing.Optional\[transformers.tokenization\_utils\_base.PreTrainedTokenizerBase] = Nonemodel\_init: typing.Union\[typing.Callable\[\[], transformers.modeling\_utils.PreTrainedModel], NoneType] = Nonecompute\_metrics: typing.Union\[typing.Callable\[\[transformers.trainer\_utils.EvalPrediction], typing.Dict], NoneType] = Nonecallbacks: typing.Optional\[typing.List\[transformers.trainer\_callback.TrainerCallback]] = Noneoptimizers: typing.Tuple\[torch.optim.optimizer.Optimizer, torch.optim.lr\_scheduler.LambdaLR] = (None, None)preprocess\_logits\_for\_metrics: typing.Union\[typing.Callable\[\[torch.Tensor, torch.Tensor], torch.Tensor], NoneType] = Nonemax\_length: typing.Optional\[int] = Nonepeft\_config: typing.Optional\[typing.Dict] = None )

The RewardTrainer can be used to train your custom Reward Model. It is a subclass of the `transformers.Trainer` class and inherits all of its attributes and methods. It is recommended to use an `AutoModelForSequenceClassification` as the reward model. The reward model should be trained on a dataset of paired examples, where each example is a tuple of two sequences. The reward model should be trained to predict which example in the pair is more relevant to the task at hand.

The reward trainer expects a very specific format for the dataset. The dataset should contain two 4 entries at least if you don’t use the default `RewardDataCollatorWithPadding` data collator. The entries should be named

* `input_ids_chosen`
* `attention_mask_chosen`
* `input_ids_rejected`
* `attention_mask_rejected`

Optionally, you can also pass a `margin` entry to the dataset. This entry should contain the margin used to modulate the loss of the reward model as outlined in <https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/>. If you don’t pass a margin, no margin will be used.

### SFTTrainer

#### class trl.SFTTrainer

[\<source>](https://github.com/huggingface/trl/blob/v0.7.2/trl/trainer/sft_trainer.py#L42)

( model: typing.Union\[transformers.modeling\_utils.PreTrainedModel, torch.nn.modules.module.Module, str] = Noneargs: TrainingArguments = Nonedata\_collator: typing.Optional\[DataCollator] = Nonetrain\_dataset: typing.Optional\[datasets.arrow\_dataset.Dataset] = Noneeval\_dataset: typing.Union\[datasets.arrow\_dataset.Dataset, typing.Dict\[str, datasets.arrow\_dataset.Dataset], NoneType] = Nonetokenizer: typing.Optional\[transformers.tokenization\_utils\_base.PreTrainedTokenizerBase] = Nonemodel\_init: typing.Union\[typing.Callable\[\[], transformers.modeling\_utils.PreTrainedModel], NoneType] = Nonecompute\_metrics: typing.Union\[typing.Callable\[\[transformers.trainer\_utils.EvalPrediction], typing.Dict], NoneType] = Nonecallbacks: typing.Optional\[typing.List\[transformers.trainer\_callback.TrainerCallback]] = Noneoptimizers: typing.Tuple\[torch.optim.optimizer.Optimizer, torch.optim.lr\_scheduler.LambdaLR] = (None, None)preprocess\_logits\_for\_metrics: typing.Union\[typing.Callable\[\[torch.Tensor, torch.Tensor], torch.Tensor], NoneType] = Nonepeft\_config: typing.Optional\[typing.Dict] = Nonedataset\_text\_field: typing.Optional\[str] = Nonepacking: typing.Optional\[bool] = Falseformatting\_func: typing.Optional\[typing.Callable] = Nonemax\_seq\_length: typing.Optional\[int] = Noneinfinite: typing.Optional\[bool] = Falsenum\_of\_sequences: typing.Optional\[int] = 1024chars\_per\_token: typing.Optional\[float] = 3.6dataset\_num\_proc: typing.Optional\[int] = Nonedataset\_batch\_size: int = 1000 )

Parameters

* **model** (Union\[`transformers.PreTrainedModel`, `nn.Module`, `str`]) — The model to train, can be a `PreTrainedModel`, a `torch.nn.Module` or a string with the model name to load from cache or download. The model can be also converted to a `PeftModel` if a `PeftConfig` object is passed to the `peft_config` argument.
* **args** (Optional[transformers.TrainingArguments](https://huggingface.co/docs/transformers/v4.34.1/en/main_classes/trainer#transformers.TrainingArguments)) — The arguments to tweak for training. Please refer to the official documentation of `transformers.TrainingArguments` for more information.
* **data\_collator** (Optional`transformers.DataCollator`) — The data collator to use for training.
* **train\_dataset** (Optional[datasets.Dataset](https://huggingface.co/docs/datasets/v2.14.5/en/package_reference/main_classes#datasets.Dataset)) — The dataset to use for training. We recommend users to use `trl.trainer.ConstantLengthDataset` to create their dataset.
* **eval\_dataset** (Optional\[Union\[`datasets.Dataset`, Dict\[`str`, `datasets.Dataset`]]]) — The dataset to use for evaluation. We recommend users to use `trl.trainer.ConstantLengthDataset` to create their dataset.
* **tokenizer** (Optional[transformers.PreTrainedTokenizer](https://huggingface.co/docs/transformers/v4.34.1/en/main_classes/tokenizer#transformers.PreTrainedTokenizer)) — The tokenizer to use for training. If not specified, the tokenizer associated to the model will be used.
* **model\_init** (`Callable[[], transformers.PreTrainedModel]`) — The model initializer to use for training. If None is specified, the default model initializer will be used.
* **compute\_metrics** (`Callable[[transformers.EvalPrediction], Dict]`, *optional* defaults to `compute_accuracy`) — The metrics to use for evaluation. If no metrics are specified, the default metric (`compute_accuracy`) will be used.
* **callbacks** (`List[transformers.TrainerCallback]`) — The callbacks to use for training.
* **optimizers** (`Tuple[torch.optim.Optimizer, torch.optim.lr_scheduler.LambdaLR]`) — The optimizer and scheduler to use for training.
* **preprocess\_logits\_for\_metrics** (`Callable[[torch.Tensor, torch.Tensor], torch.Tensor]`) — The function to use to preprocess the logits before computing the metrics.
* **peft\_config** (`Optional[PeftConfig]`) — The PeftConfig object to use to initialize the PeftModel.
* **dataset\_text\_field** (`Optional[str]`) — The name of the text field of the dataset, in case this is passed by a user, the trainer will automatically create a `ConstantLengthDataset` based on the `dataset_text_field` argument.
* **formatting\_func** (`Optional[Callable]`) — The formatting function to be used for creating the `ConstantLengthDataset`.
* **max\_seq\_length** (`Optional[int]`) — The maximum sequence length to use for the `ConstantLengthDataset` and for automaticallty creating the Dataset. Defaults to `512`.
* **infinite** (`Optional[bool]`) — Whether to use an infinite dataset or not. Defaults to `False`.
* **num\_of\_sequences** (`Optional[int]`) — The number of sequences to use for the `ConstantLengthDataset`. Defaults to `1024`.
* **chars\_per\_token** (`Optional[float]`) — The number of characters per token to use for the `ConstantLengthDataset`. Defaults to `3.6`. You can check how this is computed in the stack-llama example: [https://github.com/boincai/trl/blob/08f550674c553c36c51d1027613c29f14f3676a5/examples/stack\_llama/scripts/supervised\_finetuning.py#L53](https://github.com/huggingface/trl/blob/08f550674c553c36c51d1027613c29f14f3676a5/examples/stack_llama/scripts/supervised_finetuning.py#L53).
* **packing** (`Optional[bool]`) — Used only in case `dataset_text_field` is passed. This argument is used by the `ConstantLengthDataset` to pack the sequences of the dataset.
* **dataset\_num\_proc** (`Optional[int]`) — The number of workers to use to tokenize the data. Only used when `packing=False`. Defaults to None.
* **dataset\_batch\_size** (`int`) — The number of examples to tokenize per batch. If batch\_size <= 0 or batch\_size == None, tokenize the full dataset as a single batch. Defaults to 1000.

Class definition of the Supervised Finetuning Trainer (SFT Trainer). This class is a wrapper around the `transformers.Trainer` class and inherits all of its attributes and methods. The trainer takes care of properly initializing the PeftModel in case a user passes a `PeftConfig` object.

### DPOTrainer

#### class trl.DPOTrainer

[\<source>](https://github.com/huggingface/trl/blob/v0.7.2/trl/trainer/dpo_trainer.py#L46)

( model: typing.Union\[transformers.modeling\_utils.PreTrainedModel, torch.nn.modules.module.Module] = Noneref\_model: typing.Union\[transformers.modeling\_utils.PreTrainedModel, torch.nn.modules.module.Module, NoneType] = Nonebeta: float = 0.1args: TrainingArguments = Nonedata\_collator: typing.Optional\[DataCollator] = Nonelabel\_pad\_token\_id: int = -100padding\_value: int = 0truncation\_mode: str = 'keep\_end'train\_dataset: typing.Optional\[datasets.arrow\_dataset.Dataset] = Noneeval\_dataset: typing.Union\[datasets.arrow\_dataset.Dataset, typing.Dict\[str, datasets.arrow\_dataset.Dataset], NoneType] = Nonetokenizer: typing.Optional\[transformers.tokenization\_utils\_base.PreTrainedTokenizerBase] = Nonemodel\_init: typing.Union\[typing.Callable\[\[], transformers.modeling\_utils.PreTrainedModel], NoneType] = Nonecallbacks: typing.Optional\[typing.List\[transformers.trainer\_callback.TrainerCallback]] = Noneoptimizers: typing.Tuple\[torch.optim.optimizer.Optimizer, torch.optim.lr\_scheduler.LambdaLR] = (None, None)preprocess\_logits\_for\_metrics: typing.Union\[typing.Callable\[\[torch.Tensor, torch.Tensor], torch.Tensor], NoneType] = Nonemax\_length: typing.Optional\[int] = Nonemax\_prompt\_length: typing.Optional\[int] = Nonemax\_target\_length: typing.Optional\[int] = Nonepeft\_config: typing.Optional\[typing.Dict] = Noneis\_encoder\_decoder: typing.Optional\[bool] = Nonedisable\_dropout: bool = Truegenerate\_during\_eval: bool = Falsecompute\_metrics: typing.Union\[typing.Callable\[\[transformers.trainer\_utils.EvalLoopOutput], typing.Dict], NoneType] = None )

Parameters

* **model** (`transformers.PreTrainedModel`) — The model to train, preferably an `AutoModelForSequenceClassification`.
* **ref\_model** (`PreTrainedModelWrapper`) — BOINC AI transformer model with a casual language modelling head. Used for implicit reward computation and loss. If no reference model is provided, the trainer will create a reference model with the same architecture as the model to be optimized.
* **beta** (`float`, defaults to 0.1) — The beta factor in DPO loss. Higher beta means less divergence from the initial policy.
* **args** (`transformers.TrainingArguments`) — The arguments to use for training.
* **data\_collator** (`transformers.DataCollator`) — The data collator to use for training. If None is specified, the default data collator (`DPODataCollatorWithPadding`) will be used which will pad the sequences to the maximum length of the sequences in the batch, given a dataset of paired sequences.
* **label\_pad\_token\_id** (`int`, defaults to `-100`) — The label pad token id. This argument is required if you want to use the default data collator.
* **padding\_value** (`int`, defaults to `0`) — The padding value. This argument is required if you want to use the default data collator.
* **truncation\_mode** (`str`, defaults to `keep_end`) — The truncation mode to use, either `keep_end` or `keep_start`. This argument is required if you want to use the default data collator.
* **train\_dataset** (`datasets.Dataset`) — The dataset to use for training.
* **eval\_dataset** (`datasets.Dataset`) — The dataset to use for evaluation.
* **tokenizer** (`transformers.PreTrainedTokenizerBase`) — The tokenizer to use for training. This argument is required if you want to use the default data collator.
* **model\_init** (`Callable[[], transformers.PreTrainedModel]`) — The model initializer to use for training. If None is specified, the default model initializer will be used.
* **callbacks** (`List[transformers.TrainerCallback]`) — The callbacks to use for training.
* **optimizers** (`Tuple[torch.optim.Optimizer, torch.optim.lr_scheduler.LambdaLR]`) — The optimizer and scheduler to use for training.
* **preprocess\_logits\_for\_metrics** (`Callable[[torch.Tensor, torch.Tensor], torch.Tensor]`) — The function to use to preprocess the logits before computing the metrics.
* **max\_length** (`int`, defaults to `None`) — The maximum length of the sequences in the batch. This argument is required if you want to use the default data collator.
* **max\_prompt\_length** (`int`, defaults to `None`) — The maximum length of the prompt. This argument is required if you want to use the default data collator.
* **max\_target\_length** (`int`, defaults to `None`) — The maximum length of the target. This argument is required if you want to use the default data collator and your model is an encoder-decoder.
* **peft\_config** (`Dict`, defaults to `None`) — The PEFT configuration to use for training. If you pass a PEFT configuration, the model will be wrapped in a PEFT model.
* **is\_encoder\_decoder** (`Optional[bool]`, `optional`, defaults to `None`) — If no model is provided, we need to know if the model\_init returns an encoder-decoder.
* **disable\_dropout** (`bool`, defaults to `True`) — Whether or not to disable dropouts in `model` and `ref_model`.
* **generate\_during\_eval** (`bool`, defaults to `False`) — Whether to sample and log generations during evaluation step.
* **compute\_metrics** (`Callable[[EvalPrediction], Dict]`, *optional*) — The function to use to compute the metrics. Must take a `EvalPrediction` and return a dictionary string to metric values.

Initialize DPOTrainer.

**concatenated\_forward**

[\<source>](https://github.com/huggingface/trl/blob/v0.7.2/trl/trainer/dpo_trainer.py#L399)

( model: Modulebatch: typing.Dict\[str, typing.Union\[typing.List, torch.LongTensor]] )

Run the given model on the given batch of inputs, concatenating the chosen and rejected inputs together.

We do this to avoid doing two forward passes, because it’s faster for FSDP.

**concatenated\_inputs**

[\<source>](https://github.com/huggingface/trl/blob/v0.7.2/trl/trainer/dpo_trainer.py#L289)

( batch: typing.Dict\[str, typing.Union\[typing.List, torch.LongTensor]] )

Concatenate the chosen and rejected inputs into a single tensor.

**dpo\_loss**

[\<source>](https://github.com/huggingface/trl/blob/v0.7.2/trl/trainer/dpo_trainer.py#L328)

( policy\_chosen\_logps: FloatTensorpolicy\_rejected\_logps: FloatTensorreference\_chosen\_logps: FloatTensorreference\_rejected\_logps: FloatTensorreference\_free: bool = False ) → A tuple of three tensors

Returns

A tuple of three tensors

(losses, chosen\_rewards, rejected\_rewards). The losses tensor contains the DPO loss for each example in the batch. The chosen\_rewards and rejected\_rewards tensors contain the rewards for the chosen and rejected responses, respectively.

Compute the DPO loss for a batch of policy and reference model log probabilities.

**evaluation\_loop**

[\<source>](https://github.com/huggingface/trl/blob/v0.7.2/trl/trainer/dpo_trainer.py#L590)

( dataloader: DataLoaderdescription: strprediction\_loss\_only: typing.Optional\[bool] = Noneignore\_keys: typing.Optional\[typing.List\[str]] = Nonemetric\_key\_prefix: str = 'eval' )

Overriding built-in evaluation loop to store metrics for each batch. Prediction/evaluation loop, shared by `Trainer.evaluate()` and `Trainer.predict()`.

Works both with or without labels.

**get\_batch\_metrics**

[\<source>](https://github.com/huggingface/trl/blob/v0.7.2/trl/trainer/dpo_trainer.py#L437)

( modelbatch: typing.Dict\[str, typing.Union\[typing.List, torch.LongTensor]]train\_eval: typing.Literal\['train', 'eval'] = 'train' )

Compute the DPO loss and other metrics for the given batch of inputs for train or test.

**get\_batch\_samples**

[\<source>](https://github.com/huggingface/trl/blob/v0.7.2/trl/trainer/dpo_trainer.py#L510)

( modelbatch: typing.Dict\[str, torch.LongTensor] )

Generate samples from the model and reference model for the given batch of inputs.

**log**

[\<source>](https://github.com/huggingface/trl/blob/v0.7.2/trl/trainer/dpo_trainer.py#L639)

( logs: typing.Dict\[str, float] )

Parameters

* **logs** (`Dict[str, float]`) — The values to log.

Log `logs` on the various objects watching training, including stored metrics.

### DDPOConfig

#### class trl.DDPOConfig

[\<source>](https://github.com/huggingface/trl/blob/v0.7.2/trl/trainer/ddpo_config.py#L12)

( exp\_name: str = 'doc-buil'run\_name: typing.Optional\[str] = ''seed: int = 0log\_with: typing.Union\[typing.Literal\['wandb', 'tensorboard'], NoneType] = Nonetracker\_kwargs: dict = \<factory>accelerator\_kwargs: dict = \<factory>project\_kwargs: dict = \<factory>tracker\_project\_name: str = 'trl'logdir: str = 'logs'num\_epochs: int = 100save\_freq: int = 1num\_checkpoint\_limit: int = 5mixed\_precision: str = 'fp16'allow\_tf32: bool = Trueresume\_from: typing.Optional\[str] = ''sample\_num\_steps: int = 50sample\_eta: float = 1.0sample\_guidance\_scale: float = 5.0sample\_batch\_size: int = 1sample\_num\_batches\_per\_epoch: int = 2train\_batch\_size: int = 1train\_use\_8bit\_adam: bool = Falsetrain\_learning\_rate: float = 0.0003train\_adam\_beta1: float = 0.9train\_adam\_beta2: float = 0.999train\_adam\_weight\_decay: float = 0.0001train\_adam\_epsilon: float = 1e-08train\_gradient\_accumulation\_steps: int = 1train\_max\_grad\_norm: float = 1.0train\_num\_inner\_epochs: int = 1train\_cfg: bool = Truetrain\_adv\_clip\_max: float = 5train\_clip\_range: float = 0.0001train\_timestep\_fraction: float = 1.0per\_prompt\_stat\_tracking: bool = Falseper\_prompt\_stat\_tracking\_buffer\_size: int = 16per\_prompt\_stat\_tracking\_min\_count: int = 16async\_reward\_computation: bool = Falsemax\_workers: int = 2negative\_prompts: typing.Optional\[str] = '' )

Configuration class for DDPOTrainer

### DDPOTrainer

#### class trl.DDPOTrainer

[\<source>](https://github.com/huggingface/trl/blob/v0.7.2/trl/trainer/ddpo_trainer.py#L34)

( config: DDPOConfigreward\_function: typing.Callable\[\[torch.Tensor, typing.Tuple\[str], typing.Tuple\[typing.Any]], torch.Tensor]prompt\_function: typing.Callable\[\[], typing.Tuple\[str, typing.Any]]sd\_pipeline: DDPOStableDiffusionPipelineimage\_samples\_hook: typing.Union\[typing.Callable\[\[typing.Any, typing.Any, typing.Any], typing.Any], NoneType] = None )

Parameters

* \***\*config\*\*** (`DDPOConfig`) — Configuration object for DDPOTrainer. Check the documentation of `PPOConfig` for more — details.
* \***\*reward\_function\*\*** (Callable\[\[torch.Tensor, Tuple\[str], Tuple\[Any]], torch.Tensor]) — Reward function to be used —
* \***\*prompt\_function\*\*** (Callable\[\[], Tuple\[str, Any]]) — Function to generate prompts to guide model —
* \***\*sd\_pipeline\*\*** (`DDPOStableDiffusionPipeline`) — Stable Diffusion pipeline to be used for training. —
* \***\*image\_samples\_hook\*\*** (Optional\[Callable\[\[Any, Any, Any], Any]]) — Hook to be called to log images —

The DDPOTrainer uses Deep Diffusion Policy Optimization to optimise diffusion models. Note, this trainer is heavily inspired by the work here: <https://github.com/kvablack/ddpo-pytorch> As of now only Stable Diffusion based pipelines are supported

**calculate\_loss**

[\<source>](https://github.com/huggingface/trl/blob/v0.7.2/trl/trainer/ddpo_trainer.py#L311)

( latentstimestepsnext\_latentslog\_probsadvantagesembeds )

Parameters

* **latents** (torch.Tensor) — The latents sampled from the diffusion model, shape: \[batch\_size, num\_steps, …]
* **timesteps** (torch.Tensor) — The timesteps sampled from the diffusion model, shape: \[batch\_size]
* **next\_latents** (torch.Tensor) — The next latents sampled from the diffusion model, shape: \[batch\_size, num\_steps, …]
* **log\_probs** (torch.Tensor) — The log probabilities of the latents, shape: \[batch\_size]
* **advantages** (torch.Tensor) — The advantages of the latents, shape: \[batch\_size]
* **embeds** (torch.Tensor) — The embeddings of the prompts, shape: \[2\*batch\_size or batch\_size, …] Note: the “or” is because if train\_cfg is True, the expectation is that negative prompts are concatenated to the embeds

Calculate the loss for a batch of an unpacked sample

**step**

[\<source>](https://github.com/huggingface/trl/blob/v0.7.2/trl/trainer/ddpo_trainer.py#L205)

( epoch: intglobal\_step: int ) → global\_step (int)

Parameters

* **epoch** (int) — The current epoch.
* **global\_step** (int) — The current global step.

Returns

global\_step (int)

The updated global step.

Perform a single step of training.

Side Effects:

* Model weights are updated
* Logs the statistics to the accelerator trackers.
* If `self.image_samples_callback` is not None, it will be called with the prompt\_image\_pairs, global\_step, and the accelerator tracker.

**train**

[\<source>](https://github.com/huggingface/trl/blob/v0.7.2/trl/trainer/ddpo_trainer.py#L565)

( epochs: typing.Optional\[int] = None )

Train the model for a given number of epochs

### set\_seed

**trl.set\_seed**

[\<source>](https://github.com/huggingface/trl/blob/v0.7.2/trl/core.py#L234)

( seed: int )

Parameters

* **seed** (`int`) — The seed to set.

Helper function for reproducible behavior to set the seed in `random`, `numpy`, and `torch`.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://boinc-ai.gitbook.io/trl/api/trainer-classes.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
