Generation with LLMs
Last updated
Last updated
LLMs, or Large Language Models, are the key component behind text generation. In a nutshell, they consist of large pretrained transformer models trained to predict the next word (or, more precisely, token) given some input text. Since they predict one token at a time, you need to do something more elaborate to generate new sentences other than just calling the model โ you need to do autoregressive generation.
Autoregressive generation is the inference-time procedure of iteratively calling a model with its own generated outputs, given a few initial inputs. In ๐Transformers, this is handled by the method, which is available to all models with generative capabilities.
This tutorial will show you how to:
Generate text with an LLM
Avoid common pitfalls
Next steps to help you get the most out of your LLM
Before you begin, make sure you have all the necessary libraries installed:
Copied
A language model trained for takes a sequence of text tokens as input and returns the probability distribution for the next token.
A critical aspect of autoregressive generation with LLMs is how to select the next token from this probability distribution. Anything goes in this step as long as you end up with a token for the next iteration. This means it can be as simple as selecting the most likely token from the probability distribution or as complex as applying a dozen transformations before sampling from the resulting distribution.
The process depicted above is repeated iteratively until some stopping condition is reached. Ideally, the stopping condition is dictated by the model, which should learn when to output an end-of-sequence (EOS
) token. If this is not the case, generation stops when some predefined maximum length is reached.
Properly setting up the token selection step and the stopping condition is essential to make your model behave as youโd expect on your task. That is why we have a file associated with each model, which contains a good default generative parameterization and is loaded alongside your model.
Letโs talk code!
First, you need to load the model.
Copied
Youโll notice two flags in the from_pretrained
call:
device_map
ensures the model is moved to your GPU(s)
There are other ways to initialize a model, but this is a good baseline to begin with an LLM.
Copied
Copied
And thatโs it! In a few lines of code, you can harness the power of an LLM.
Copied
Copied
Copied
Copied
While the autoregressive generation process is relatively straightforward, making the most out of your LLM can be a challenging endeavor because there are many moving parts. For your next steps to help you dive deeper into LLM usage and understanding:
If youโre interested in basic LLM usage, our high-level interface is a great starting point. However, LLMs often require advanced features like quantization and fine control of the token selection step, which is best done through . Autoregressive generation with LLMs is also resource-intensive and should be executed on a GPU for adequate throughput.
load_in_4bit
applies to massively reduce the resource requirements
Next, you need to preprocess your text input with a .
The model_inputs
variable holds the tokenized text input, as well as the attention mask. While does its best effort to infer the attention mask when it is not passed, we recommend passing it whenever possible for optimal results.
Finally, call the method to returns the generated tokens, which should be converted to text before printing.
There are many , and sometimes the default values may not be appropriate for your use case. If your outputs arenโt aligned with what youโre expecting, weโve created a list of the most common pitfalls and how to avoid them.
If not specified in the file, generate
returns up to 20 tokens by default. We highly recommend manually setting max_new_tokens
in your generate
call to control the maximum number of new tokens it can return. Keep in mind LLMs (more precisely, ) also return the input prompt as part of the output.
By default, and unless specified in the file, generate
selects the most likely token at each iteration (greedy decoding). Depending on your task, this may be undesirable; creative tasks like chatbots or writing an essay benefit from sampling. On the other hand, input-grounded tasks like audio transcription or translation benefit from greedy decoding. Enable sampling with do_sample=True
, and you can learn more about this topic in this .
LLMs are architectures, meaning they continue to iterate on your input prompt. If your inputs do not have the same length, they need to be padded. Since LLMs are not trained to continue from pad tokens, your input needs to be left-padded. Make sure you also donโt forget to pass the attention mask to generate!
on how to control different generation methods, how to set up the generation configuration file, and how to stream the output;
API reference on , , and .
, which focuses on the quality of the open-source models;
, which focuses on LLM throughput.
on dynamic quantization, which shows you how to drastically reduce your memory requirements.
, a production-ready server for LLMs;
, an extension of ๐ Transformers that optimizes for specific hardware devices.