Quickstart
Last updated
Last updated
Fine-tuning a language model via PPO consists of roughly three steps:
Rollout: The language model generates a response or continuation based on a query which could be the start of a sentence.
Evaluation: The query and response are evaluated with a function, model, human feedback, or some combination of them. The important thing is that this process should yield a scalar value for each query/response pair. The optimization will aim at maximizing this value.
Optimization: This is the most complex part. In the optimisation step the query/response pairs are used to calculate the log-probabilities of the tokens in the sequences. This is done with the model that is trained and a reference model, which is usually the pre-trained model before fine-tuning. The KL-divergence between the two outputs is used as an additional reward signal to make sure the generated responses donβt deviate too far from the reference language model. The active language model is then trained with PPO.
The full process is illustrated in the following figure:
The following code illustrates the steps above.
Copied
In general, you would run steps 3-6 in a for-loop and run it on many diverse queries. You can find more realistic examples in the examples section.
After training a AutoModelForCausalLMWithValueHead
, you can directly use the model in transformers
.
Copied
You can also load your model with AutoModelForCausalLMWithValueHead
if you want to use the value head, for example to continue training.
Copied