Customize the generation strategy
Text generation is essential to many NLP tasks, such as open-ended text generation, summarization, translation, and more. It also plays a role in a variety of mixed-modality applications that have text as an output like speech-to-text and vision-to-text. Some of the models that can generate text include GPT2, XLNet, OpenAI GPT, CTRL, TransformerXL, XLM, Bart, T5, GIT, Whisper.
Check out a few examples that use generate() method to produce text outputs for different tasks:
Note that the inputs to the generate method depend on the model’s modality. They are returned by the model’s preprocessor class, such as AutoTokenizer or AutoProcessor. If a model’s preprocessor creates more than one kind of input, pass all the inputs to generate(). You can learn more about the individual model’s preprocessor in the corresponding model’s documentation.
The process of selecting output tokens to generate text is known as decoding, and you can customize the decoding strategy that the generate()
method will use. Modifying a decoding strategy does not change the values of any trainable parameters. However, it can have a noticeable impact on the quality of the generated output. It can help reduce repetition in the text and make it more coherent.
This guide describes:
default generation configuration
common decoding strategies and their main parameters
saving and sharing custom generation configurations with your fine-tuned model on 🌍 Hub
Default text generation configuration
A decoding strategy for a model is defined in its generation configuration. When using pre-trained models for inference within a pipeline(), the models call the PreTrainedModel.generate()
method that applies a default generation configuration under the hood. The default configuration is also used when no custom configuration has been saved with the model.
When you load a model explicitly, you can inspect the generation configuration that comes with it through model.generation_config
:
Copied
Printing out the model.generation_config
reveals only the values that are different from the default generation configuration, and does not list any of the default values.
The default generation configuration limits the size of the output combined with the input prompt to a maximum of 20 tokens to avoid running into resource limitations. The default decoding strategy is greedy search, which is the simplest decoding strategy that picks a token with the highest probability as the next token. For many tasks and small output sizes this works well. However, when used to generate longer outputs, greedy search can start producing highly repetitive results.
Customize text generation
You can override any generation_config
by passing the parameters and their values directly to the generate
method:
Copied
Even if the default decoding strategy mostly works for your task, you can still tweak a few things. Some of the commonly adjusted parameters include:
max_new_tokens
: the maximum number of tokens to generate. In other words, the size of the output sequence, not including the tokens in the prompt. As an alternative to using the output’s length as a stopping criteria, you can choose to stop generation whenever the full generation exceeds some amount of time. To learn more, check StoppingCriteria.num_beams
: by specifying a number of beams higher than 1, you are effectively switching from greedy search to beam search. This strategy evaluates several hypotheses at each time step and eventually chooses the hypothesis that has the overall highest probability for the entire sequence. This has the advantage of identifying high-probability sequences that start with a lower probability initial tokens and would’ve been ignored by the greedy search.do_sample
: if set toTrue
, this parameter enables decoding strategies such as multinomial sampling, beam-search multinomial sampling, Top-K sampling and Top-p sampling. All these strategies select the next token from the probability distribution over the entire vocabulary with various strategy-specific adjustments.num_return_sequences
: the number of sequence candidates to return for each input. This option is only available for the decoding strategies that support multiple sequence candidates, e.g. variations of beam search and sampling. Decoding strategies like greedy search and contrastive search return a single output sequence.
Save a custom decoding strategy with your model
If you would like to share your fine-tuned model with a specific generation configuration, you can:
Create a GenerationConfig class instance
Specify the decoding strategy parameters
Save your generation configuration with GenerationConfig.save_pretrained(), making sure to leave its
config_file_name
argument emptySet
push_to_hub
toTrue
to upload your config to the model’s repo
Copied
You can also store several generation configurations in a single directory, making use of the config_file_name
argument in GenerationConfig.save_pretrained(). You can later instantiate them with GenerationConfig.from_pretrained(). This is useful if you want to store several generation configurations for a single model (e.g. one for creative text generation with sampling, and one for summarization with beam search). You must have the right Hub permissions to add configuration files to a model.
Copied
Streaming
The generate()
supports streaming, through its streamer
input. The streamer
input is compatible with any instance from a class that has the following methods: put()
and end()
. Internally, put()
is used to push new tokens and end()
is used to flag the end of text generation.
The API for the streamer classes is still under development and may change in the future.
In practice, you can craft your own streaming class for all sorts of purposes! We also have basic streaming classes ready for you to use. For example, you can use the TextStreamer class to stream the output of generate()
into your screen, one word at a time:
Copied
Decoding strategies
Certain combinations of the generate()
parameters, and ultimately generation_config
, can be used to enable specific decoding strategies. If you are new to this concept, we recommend reading this blog post that illustrates how common decoding strategies work.
Here, we’ll show some of the parameters that control the decoding strategies and illustrate how you can use them.
Greedy Search
generate
uses greedy search decoding by default so you don’t have to pass any parameters to enable it. This means the parameters num_beams
is set to 1 and do_sample=False
.
Copied
Contrastive search
The contrastive search decoding strategy was proposed in the 2022 paper A Contrastive Framework for Neural Text Generation. It demonstrates superior results for generating non-repetitive yet coherent long outputs. To learn how contrastive search works, check out this blog post. The two main parameters that enable and control the behavior of contrastive search are penalty_alpha
and top_k
:
Copied
Multinomial sampling
As opposed to greedy search that always chooses a token with the highest probability as the next token, multinomial sampling (also called ancestral sampling) randomly selects the next token based on the probability distribution over the entire vocabulary given by the model. Every token with a non-zero probability has a chance of being selected, thus reducing the risk of repetition.
To enable multinomial sampling set do_sample=True
and num_beams=1
.
Copied
Beam-search decoding
Unlike greedy search, beam-search decoding keeps several hypotheses at each time step and eventually chooses the hypothesis that has the overall highest probability for the entire sequence. This has the advantage of identifying high-probability sequences that start with lower probability initial tokens and would’ve been ignored by the greedy search.
To enable this decoding strategy, specify the num_beams
(aka number of hypotheses to keep track of) that is greater than 1.
Copied
Beam-search multinomial sampling
As the name implies, this decoding strategy combines beam search with multinomial sampling. You need to specify the num_beams
greater than 1, and set do_sample=True
to use this decoding strategy.
Copied
Diverse beam search decoding
The diverse beam search decoding strategy is an extension of the beam search strategy that allows for generating a more diverse set of beam sequences to choose from. To learn how it works, refer to Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models. This approach has three main parameters: num_beams
, num_beam_groups
, and diversity_penalty
. The diversity penalty ensures the outputs are distinct across groups, and beam search is used within each group.
Copied
This guide illustrates the main parameters that enable various decoding strategies. More advanced parameters exist for the generate
method, which gives you even further control over the generate
method’s behavior. For the complete list of the available parameters, refer to the API documentation.
Assisted Decoding
Assisted decoding is a modification of the decoding strategies above that uses an assistant model with the same tokenizer (ideally a much smaller model) to greedily generate a few candidate tokens. The main model then validates the candidate tokens in a single forward pass, which speeds up the decoding process. Currently, only greedy search and sampling are supported with assisted decoding, and doesn’t support batched inputs. To learn more about assisted decoding, check this blog post.
To enable assisted decoding, set the assistant_model
argument with a model.
Copied
When using assisted decoding with sampling methods, you can use the temperature
argument to control the randomness just like in multinomial sampling. However, in assisted decoding, reducing the temperature will help improving latency.
Copied
Last updated