Transformers
  • 🌍GET STARTED
    • Transformers
    • Quick tour
    • Installation
  • 🌍TUTORIALS
    • Run inference with pipelines
    • Write portable code with AutoClass
    • Preprocess data
    • Fine-tune a pretrained model
    • Train with a script
    • Set up distributed training with BOINC AI Accelerate
    • Load and train adapters with BOINC AI PEFT
    • Share your model
    • Agents
    • Generation with LLMs
  • 🌍TASK GUIDES
    • 🌍NATURAL LANGUAGE PROCESSING
      • Text classification
      • Token classification
      • Question answering
      • Causal language modeling
      • Masked language modeling
      • Translation
      • Summarization
      • Multiple choice
    • 🌍AUDIO
      • Audio classification
      • Automatic speech recognition
    • 🌍COMPUTER VISION
      • Image classification
      • Semantic segmentation
      • Video classification
      • Object detection
      • Zero-shot object detection
      • Zero-shot image classification
      • Depth estimation
    • 🌍MULTIMODAL
      • Image captioning
      • Document Question Answering
      • Visual Question Answering
      • Text to speech
    • 🌍GENERATION
      • Customize the generation strategy
    • 🌍PROMPTING
      • Image tasks with IDEFICS
  • 🌍DEVELOPER GUIDES
    • Use fast tokenizers from BOINC AI Tokenizers
    • Run inference with multilingual models
    • Use model-specific APIs
    • Share a custom model
    • Templates for chat models
    • Run training on Amazon SageMaker
    • Export to ONNX
    • Export to TFLite
    • Export to TorchScript
    • Benchmarks
    • Notebooks with examples
    • Community resources
    • Custom Tools and Prompts
    • Troubleshoot
  • 🌍PERFORMANCE AND SCALABILITY
    • Overview
    • 🌍EFFICIENT TRAINING TECHNIQUES
      • Methods and tools for efficient training on a single GPU
      • Multiple GPUs and parallelism
      • Efficient training on CPU
      • Distributed CPU training
      • Training on TPUs
      • Training on TPU with TensorFlow
      • Training on Specialized Hardware
      • Custom hardware for training
      • Hyperparameter Search using Trainer API
    • 🌍OPTIMIZING INFERENCE
      • Inference on CPU
      • Inference on one GPU
      • Inference on many GPUs
      • Inference on Specialized Hardware
    • Instantiating a big model
    • Troubleshooting
    • XLA Integration for TensorFlow Models
    • Optimize inference using `torch.compile()`
  • 🌍CONTRIBUTE
    • How to contribute to transformers?
    • How to add a model to BOINC AI Transformers?
    • How to convert a BOINC AI Transformers model to TensorFlow?
    • How to add a pipeline to BOINC AI Transformers?
    • Testing
    • Checks on a Pull Request
  • 🌍CONCEPTUAL GUIDES
    • Philosophy
    • Glossary
    • What BOINC AI Transformers can do
    • How BOINC AI Transformers solve tasks
    • The Transformer model family
    • Summary of the tokenizers
    • Attention mechanisms
    • Padding and truncation
    • BERTology
    • Perplexity of fixed-length models
    • Pipelines for webserver inference
    • Model training anatomy
  • 🌍API
    • 🌍MAIN CLASSES
      • Agents and Tools
      • 🌍Auto Classes
        • Extending the Auto Classes
        • AutoConfig
        • AutoTokenizer
        • AutoFeatureExtractor
        • AutoImageProcessor
        • AutoProcessor
        • Generic model classes
          • AutoModel
          • TFAutoModel
          • FlaxAutoModel
        • Generic pretraining classes
          • AutoModelForPreTraining
          • TFAutoModelForPreTraining
          • FlaxAutoModelForPreTraining
        • Natural Language Processing
          • AutoModelForCausalLM
          • TFAutoModelForCausalLM
          • FlaxAutoModelForCausalLM
          • AutoModelForMaskedLM
          • TFAutoModelForMaskedLM
          • FlaxAutoModelForMaskedLM
          • AutoModelForMaskGenerationge
          • TFAutoModelForMaskGeneration
          • AutoModelForSeq2SeqLM
          • TFAutoModelForSeq2SeqLM
          • FlaxAutoModelForSeq2SeqLM
          • AutoModelForSequenceClassification
          • TFAutoModelForSequenceClassification
          • FlaxAutoModelForSequenceClassification
          • AutoModelForMultipleChoice
          • TFAutoModelForMultipleChoice
          • FlaxAutoModelForMultipleChoice
          • AutoModelForNextSentencePrediction
          • TFAutoModelForNextSentencePrediction
          • FlaxAutoModelForNextSentencePrediction
          • AutoModelForTokenClassification
          • TFAutoModelForTokenClassification
          • FlaxAutoModelForTokenClassification
          • AutoModelForQuestionAnswering
          • TFAutoModelForQuestionAnswering
          • FlaxAutoModelForQuestionAnswering
          • AutoModelForTextEncoding
          • TFAutoModelForTextEncoding
        • Computer vision
          • AutoModelForDepthEstimation
          • AutoModelForImageClassification
          • TFAutoModelForImageClassification
          • FlaxAutoModelForImageClassification
          • AutoModelForVideoClassification
          • AutoModelForMaskedImageModeling
          • TFAutoModelForMaskedImageModeling
          • AutoModelForObjectDetection
          • AutoModelForImageSegmentation
          • AutoModelForImageToImage
          • AutoModelForSemanticSegmentation
          • TFAutoModelForSemanticSegmentation
          • AutoModelForInstanceSegmentation
          • AutoModelForUniversalSegmentation
          • AutoModelForZeroShotImageClassification
          • TFAutoModelForZeroShotImageClassification
          • AutoModelForZeroShotObjectDetection
        • Audio
          • AutoModelForAudioClassification
          • AutoModelForAudioFrameClassification
          • TFAutoModelForAudioFrameClassification
          • AutoModelForCTC
          • AutoModelForSpeechSeq2Seq
          • TFAutoModelForSpeechSeq2Seq
          • FlaxAutoModelForSpeechSeq2Seq
          • AutoModelForAudioXVector
          • AutoModelForTextToSpectrogram
          • AutoModelForTextToWaveform
        • Multimodal
          • AutoModelForTableQuestionAnswering
          • TFAutoModelForTableQuestionAnswering
          • AutoModelForDocumentQuestionAnswering
          • TFAutoModelForDocumentQuestionAnswering
          • AutoModelForVisualQuestionAnswering
          • AutoModelForVision2Seq
          • TFAutoModelForVision2Seq
          • FlaxAutoModelForVision2Seq
      • Callbacks
      • Configuration
      • Data Collator
      • Keras callbacks
      • Logging
      • Models
      • Text Generation
      • ONNX
      • Optimization
      • Model outputs
      • Pipelines
      • Processors
      • Quantization
      • Tokenizer
      • Trainer
      • DeepSpeed Integration
      • Feature Extractor
      • Image Processor
    • 🌍MODELS
      • 🌍TEXT MODELS
        • ALBERT
        • BART
        • BARThez
        • BARTpho
        • BERT
        • BertGeneration
        • BertJapanese
        • Bertweet
        • BigBird
        • BigBirdPegasus
        • BioGpt
        • Blenderbot
        • Blenderbot Small
        • BLOOM
        • BORT
        • ByT5
        • CamemBERT
        • CANINE
        • CodeGen
        • CodeLlama
        • ConvBERT
        • CPM
        • CPMANT
        • CTRL
        • DeBERTa
        • DeBERTa-v2
        • DialoGPT
        • DistilBERT
        • DPR
        • ELECTRA
        • Encoder Decoder Models
        • ERNIE
        • ErnieM
        • ESM
        • Falcon
        • FLAN-T5
        • FLAN-UL2
        • FlauBERT
        • FNet
        • FSMT
        • Funnel Transformer
        • GPT
        • GPT Neo
        • GPT NeoX
        • GPT NeoX Japanese
        • GPT-J
        • GPT2
        • GPTBigCode
        • GPTSAN Japanese
        • GPTSw3
        • HerBERT
        • I-BERT
        • Jukebox
        • LED
        • LLaMA
        • LLama2
        • Longformer
        • LongT5
        • LUKE
        • M2M100
        • MarianMT
        • MarkupLM
        • MBart and MBart-50
        • MEGA
        • MegatronBERT
        • MegatronGPT2
        • Mistral
        • mLUKE
        • MobileBERT
        • MPNet
        • MPT
        • MRA
        • MT5
        • MVP
        • NEZHA
        • NLLB
        • NLLB-MoE
        • Nyströmformer
        • Open-Llama
        • OPT
        • Pegasus
        • PEGASUS-X
        • Persimmon
        • PhoBERT
        • PLBart
        • ProphetNet
        • QDQBert
        • RAG
        • REALM
        • Reformer
        • RemBERT
        • RetriBERT
        • RoBERTa
        • RoBERTa-PreLayerNorm
        • RoCBert
        • RoFormer
        • RWKV
        • Splinter
        • SqueezeBERT
        • SwitchTransformers
        • T5
        • T5v1.1
        • TAPEX
        • Transformer XL
        • UL2
        • UMT5
        • X-MOD
        • XGLM
        • XLM
        • XLM-ProphetNet
        • XLM-RoBERTa
        • XLM-RoBERTa-XL
        • XLM-V
        • XLNet
        • YOSO
      • 🌍VISION MODELS
        • BEiT
        • BiT
        • Conditional DETR
        • ConvNeXT
        • ConvNeXTV2
        • CvT
        • Deformable DETR
        • DeiT
        • DETA
        • DETR
        • DiNAT
        • DINO V2
        • DiT
        • DPT
        • EfficientFormer
        • EfficientNet
        • FocalNet
        • GLPN
        • ImageGPT
        • LeViT
        • Mask2Former
        • MaskFormer
        • MobileNetV1
        • MobileNetV2
        • MobileViT
        • MobileViTV2
        • NAT
        • PoolFormer
        • Pyramid Vision Transformer (PVT)
        • RegNet
        • ResNet
        • SegFormer
        • SwiftFormer
        • Swin Transformer
        • Swin Transformer V2
        • Swin2SR
        • Table Transformer
        • TimeSformer
        • UperNet
        • VAN
        • VideoMAE
        • Vision Transformer (ViT)
        • ViT Hybrid
        • ViTDet
        • ViTMAE
        • ViTMatte
        • ViTMSN
        • ViViT
        • YOLOS
      • 🌍AUDIO MODELS
        • Audio Spectrogram Transformer
        • Bark
        • CLAP
        • EnCodec
        • Hubert
        • MCTCT
        • MMS
        • MusicGen
        • Pop2Piano
        • SEW
        • SEW-D
        • Speech2Text
        • Speech2Text2
        • SpeechT5
        • UniSpeech
        • UniSpeech-SAT
        • VITS
        • Wav2Vec2
        • Wav2Vec2-Conformer
        • Wav2Vec2Phoneme
        • WavLM
        • Whisper
        • XLS-R
        • XLSR-Wav2Vec2
      • 🌍MULTIMODAL MODELS
        • ALIGN
        • AltCLIP
        • BLIP
        • BLIP-2
        • BridgeTower
        • BROS
        • Chinese-CLIP
        • CLIP
        • CLIPSeg
        • Data2Vec
        • DePlot
        • Donut
        • FLAVA
        • GIT
        • GroupViT
        • IDEFICS
        • InstructBLIP
        • LayoutLM
        • LayoutLMV2
        • LayoutLMV3
        • LayoutXLM
        • LiLT
        • LXMERT
        • MatCha
        • MGP-STR
        • Nougat
        • OneFormer
        • OWL-ViT
        • Perceiver
        • Pix2Struct
        • Segment Anything
        • Speech Encoder Decoder Models
        • TAPAS
        • TrOCR
        • TVLT
        • ViLT
        • Vision Encoder Decoder Models
        • Vision Text Dual Encoder
        • VisualBERT
        • X-CLIP
      • 🌍REINFORCEMENT LEARNING MODELS
        • Decision Transformer
        • Trajectory Transformer
      • 🌍TIME SERIES MODELS
        • Autoformer
        • Informer
        • Time Series Transformer
      • 🌍GRAPH MODELS
        • Graphormer
  • 🌍INTERNAL HELPERS
    • Custom Layers and Utilities
    • Utilities for pipelines
    • Utilities for Tokenizers
    • Utilities for Trainer
    • Utilities for Generation
    • Utilities for Image Processors
    • Utilities for Audio processing
    • General Utilities
    • Utilities for Time Series
Powered by GitBook
On this page
  • Generation
  • GenerationConfig
  • GenerationMixin
  • TFGenerationMixin
  • FlaxGenerationMixin
  1. API
  2. MAIN CLASSES

Text Generation

PreviousModelsNextONNX

Last updated 1 year ago

Generation

Each framework has a generate method for text generation implemented in their respective GenerationMixin class:

  • PyTorch is implemented in .

  • TensorFlow is implemented in .

  • Flax/JAX is implemented in .

Regardless of your framework of choice, you can parameterize the generate method with a class instance. Please refer to this class for the complete list of generation parameters, which control the behavior of the generation method.

To learn how to inspect a model’s generation configuration, what are the defaults, how to change the parameters ad hoc, and how to create and save a customized generation configuration, refer to the . The guide also explains how to use related features, like token streaming.

GenerationConfig

class transformers.GenerationConfig

( **kwargs )

Parameters that control the length of the output

  • max_length (int, optional, defaults to 20) — The maximum length the generated tokens can have. Corresponds to the length of the input prompt + max_new_tokens. Its effect is overridden by max_new_tokens, if also set.

  • max_new_tokens (int, optional) — The maximum numbers of tokens to generate, ignoring the number of tokens in the prompt.

  • min_length (int, optional, defaults to 0) — The minimum length of the sequence to be generated. Corresponds to the length of the input prompt + min_new_tokens. Its effect is overridden by min_new_tokens, if also set.

  • min_new_tokens (int, optional) — The minimum numbers of tokens to generate, ignoring the number of tokens in the prompt.

  • early_stopping (bool or str, optional, defaults to False) — Controls the stopping condition for beam-based methods, like beam-search. It accepts the following values: True, where the generation stops as soon as there are num_beams complete candidates; False, where an heuristic is applied and the generation stops when is it very unlikely to find better candidates; "never", where the beam search procedure only stops when there cannot be better candidates (canonical beam search algorithm).

  • max_time(float, optional) — The maximum amount of time you allow the computation to run for in seconds. generation will still finish the current pass after allocated time has been passed.

Parameters that control the generation strategy used

  • do_sample (bool, optional, defaults to False) — Whether or not to use sampling ; use greedy decoding otherwise.

  • num_beams (int, optional, defaults to 1) — Number of beams for beam search. 1 means no beam search.

  • penalty_alpha (float, optional) — The values balance the model confidence and the degeneration penalty in contrastive search decoding.

  • use_cache (bool, optional, defaults to True) — Whether or not the model should use the past last key/values attentions (if applicable to the model) to speed up decoding.

Parameters for manipulation of the model output logits

  • temperature (float, optional, defaults to 1.0) — The value used to modulate the next token probabilities.

  • top_k (int, optional, defaults to 50) — The number of highest probability vocabulary tokens to keep for top-k-filtering.

  • top_p (float, optional, defaults to 1.0) — If set to float < 1, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation.

  • diversity_penalty (float, optional, defaults to 0.0) — This value is subtracted from a beam’s score if it generates a token same as any beam from other group at a particular time. Note that diversity_penalty is only effective if group beam search is enabled.

  • encoder_repetition_penalty (float, optional, defaults to 1.0) — The paramater for encoder_repetition_penalty. An exponential penalty on sequences that are not in the original input. 1.0 means no penalty.

  • length_penalty (float, optional, defaults to 1.0) — Exponential penalty to the length that is used with beam-based generation. It is applied as an exponent to the sequence length, which in turn is used to divide the score of the sequence. Since the score is the log likelihood of the sequence (i.e. negative), length_penalty > 0.0 promotes longer sequences, while length_penalty < 0.0 encourages shorter sequences.

  • no_repeat_ngram_size (int, optional, defaults to 0) — If set to int > 0, all ngrams of that size can only occur once.

  • renormalize_logits (bool, optional, defaults to False) — Whether to renormalize the logits after applying all the logits processors or warpers (including the custom ones). It’s highly recommended to set this flag to True as the search algorithms suppose the score logits are normalized but some logit processors or warpers break the normalization.

  • constraints (List[Constraint], optional) — Custom constraints that can be added to the generation to ensure that the output will contain the use of certain tokens as defined by Constraint objects, in the most sensible way possible.

  • forced_eos_token_id (Union[int, List[int]], optional, defaults to model.config.forced_eos_token_id) — The id of the token to force as the last generated token when max_length is reached. Optionally, use a list to set multiple end-of-sequence tokens.

  • remove_invalid_values (bool, optional, defaults to model.config.remove_invalid_values) — Whether to remove possible nan and inf outputs of the model to prevent the generation method to crash. Note that using remove_invalid_values can slow down generation.

  • exponential_decay_length_penalty (tuple(int, float), optional) — This Tuple adds an exponentially increasing length penalty, after a certain amount of tokens have been generated. The tuple shall consist of: (start_index, decay_factor) where start_index indicates where penalty starts and decay_factor represents the factor of exponential decay

  • suppress_tokens (List[int], optional) — A list of tokens that will be suppressed at generation. The SupressTokens logit processor will set their log probs to -inf so that they are not sampled.

  • begin_suppress_tokens (List[int], optional) — A list of tokens that will be suppressed at the beginning of the generation. The SupressBeginTokens logit processor will set their log probs to -inf so that they are not sampled.

  • forced_decoder_ids (List[List[int]], optional) — A list of pairs of integers which indicates a mapping from generation indices to token indices that will be forced before sampling. For example, [[1, 123]] means the second generated token will always be a token of index 123.

  • guidance_scale (float, optional) — The guidance scale for classifier free guidance (CFG). CFG is enabled by setting guidance_scale > 1. Higher guidance scale encourages the model to generate samples that are more closely linked to the input prompt, usually at the expense of poorer quality.

  • low_memory (bool, optional) — Switch to sequential topk for contrastive search to reduce peak memory. Used with contrastive search.

Parameters that define the output variables of `generate`

  • num_return_sequences(int, optional, defaults to 1) — The number of independently computed returned sequences for each element in the batch.

  • output_attentions (bool, optional, defaults to False) — Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more details.

  • output_hidden_states (bool, optional, defaults to False) — Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more details.

  • output_scores (bool, optional, defaults to False) — Whether or not to return the prediction scores. See scores under returned tensors for more details.

Special tokens that can be used at generation time

  • pad_token_id (int, optional) — The id of the padding token.

  • bos_token_id (int, optional) — The id of the beginning-of-sequence token.

  • eos_token_id (Union[int, List[int]], optional) — The id of the end-of-sequence token. Optionally, use a list to set multiple end-of-sequence tokens.

Generation parameters exclusive to encoder-decoder models

  • encoder_no_repeat_ngram_size (int, optional, defaults to 0) — If set to int > 0, all ngrams of that size that occur in the encoder_input_ids cannot occur in the decoder_input_ids.

  • decoder_start_token_id (int, optional) — If an encoder-decoder model starts decoding with a different token than bos, the id of that token.

Wild card

Class that holds a configuration for a generation task. A generate call supports the following generation methods for text-decoder, text-to-text, speech-to-text, and vision-to-text models:

  • assisted decoding by calling assisted_decoding(), if assistant_model is passed to .generate()

from_pretrained

Parameters

  • pretrained_model_name (str or os.PathLike) — This can be either:

    • a string, the model id of a pretrained model configuration hosted inside a model repo on boincai.com. Valid model ids can be located at the root-level, like bert-base-uncased, or namespaced under a user or organization name, like dbmdz/bert-base-german-cased.

  • config_file_name (str or os.PathLike, optional, defaults to "generation_config.json") — Name of the generation configuration JSON file to be loaded from pretrained_model_name.

  • cache_dir (str or os.PathLike, optional) — Path to a directory in which a downloaded pretrained model configuration should be cached if the standard cache should not be used.

  • force_download (bool, optional, defaults to False) — Whether or not to force to (re-)download the configuration files and override the cached versions if they exist.

  • resume_download (bool, optional, defaults to False) — Whether or not to delete incompletely received file. Attempts to resume the download if such a file exists.

  • proxies (Dict[str, str], optional) — A dictionary of proxy servers to use by protocol or endpoint, e.g., {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}. The proxies are used on each request.

  • token (str or bool, optional) — The token to use as HTTP bearer authorization for remote files. If True, or not specified, will use the token generated when running boincai-cli login (stored in ~/.boincai).

  • revision (str, optional, defaults to "main") — The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a git-based system for storing models and other artifacts on boincai.com, so revision can be any identifier allowed by git.

    To test a pull request you made on the Hub, you can pass `revision=“refs/pr/“.

  • return_unused_kwargs (bool, optional, defaults to False) — If False, then this function returns just the final configuration object.

    If True, then this functions returns a Tuple(config, unused_kwargs) where unused_kwargs is a dictionary consisting of the key/value pairs whose keys are not configuration attributes: i.e., the part of kwargs which has not been used to update config and is otherwise ignored.

  • subfolder (str, optional, defaults to "") — In case the relevant files are located inside a subfolder of the model repo on boincai.com, you can specify the folder name here.

  • kwargs (Dict[str, Any], optional) — The values in kwargs of any keys which are configuration attributes will be used to override the loaded values. Behavior concerning key/value pairs whose keys are not configuration attributes is controlled by the return_unused_kwargs keyword parameter.

Returns

The configuration object instantiated from this pretrained model.

Examples:

Copied

>>> from transformers import GenerationConfig

>>> # Download configuration from boincai.com and cache.
>>> generation_config = GenerationConfig.from_pretrained("gpt2")

>>> # E.g. config was saved using *save_pretrained('./test/saved_model/')*
>>> generation_config.save_pretrained("./test/saved_model/")
>>> generation_config = GenerationConfig.from_pretrained("./test/saved_model/")

>>> # You can also specify configuration names to your generation configuration file
>>> generation_config.save_pretrained("./test/saved_model/", config_file_name="my_configuration.json")
>>> generation_config = GenerationConfig.from_pretrained("./test/saved_model/", "my_configuration.json")

>>> # If you'd like to try a minor variation to an existing configuration, you can also pass generation
>>> # arguments to `.from_pretrained()`. Be mindful that typos and unused arguments will be ignored
>>> generation_config, unused_kwargs = GenerationConfig.from_pretrained(
...     "gpt2", top_k=1, foo=False, do_sample=True, return_unused_kwargs=True
... )
>>> generation_config.top_k
1

>>> unused_kwargs
{'foo': False}

from_model_config

Parameters

  • model_config (PretrainedConfig) — The model config that will be used to instantiate the generation config.

Returns

The configuration object instantiated from those parameters.

save_pretrained

( save_directory: typing.Union[str, os.PathLike]config_file_name: typing.Union[str, os.PathLike, NoneType] = Nonepush_to_hub: bool = False**kwargs )

Parameters

  • save_directory (str or os.PathLike) — Directory where the configuration JSON file will be saved (will be created if it does not exist).

  • config_file_name (str or os.PathLike, optional, defaults to "generation_config.json") — Name of the generation configuration JSON file to be saved in save_directory.

  • push_to_hub (bool, optional, defaults to False) — Whether or not to push your model to the BOINC AI model hub after saving it. You can specify the repository you want to push to with repo_id (will default to the name of save_directory in your namespace).

GenerationMixin

class transformers.GenerationMixin

( )

generate

Parameters

  • inputs (torch.Tensor of varying shape depending on the modality, optional) — The sequence used as a prompt for the generation or as model inputs to the encoder. If None the method initializes it with bos_token_id and a batch size of 1. For decoder-only models inputs should of in the format of input_ids. For encoder-decoder models inputs can represent any of input_ids, input_values, input_features, or pixel_values.

  • logits_processor (LogitsProcessorList, optional) — Custom logits processors that complement the default logits processors built from arguments and generation config. If a logit processor is passed that is already created with the arguments or a generation config an error is thrown. This feature is intended for advanced users.

  • stopping_criteria (StoppingCriteriaList, optional) — Custom stopping criteria that complement the default stopping criteria built from arguments and a generation config. If a stopping criteria is passed that is already created with the arguments or a generation config an error is thrown. This feature is intended for advanced users.

  • synced_gpus (bool, optional) — Whether to continue running the while loop until max_length. Unless overridden this flag will be set to True under DeepSpeed ZeRO Stage 3 multiple GPUs environment to avoid hanging if one GPU finished generating before other GPUs. Otherwise it’ll be set to False.

  • assistant_model (PreTrainedModel, optional) — An assistant model that can be used to accelerate generation. The assistant model must have the exact same tokenizer. The acceleration is achieved when forecasting candidate tokens with the assistent model is much faster than running generation with the model you’re calling generate from. As such, the assistant model should be much smaller.

  • streamer (BaseStreamer, optional) — Streamer object that will be used to stream the generated sequences. Generated tokens are passed through streamer.put(token_ids) and the streamer is responsible for any further processing.

  • negative_prompt_ids (torch.LongTensor of shape (batch_size, sequence_length), optional) — The negative prompt needed for some processors such as CFG. The batch size must match the input batch size. This is an experimental feature, subject to breaking API changes in future versions.

  • negative_prompt_attention_mask (torch.LongTensor of shape (batch_size, sequence_length), optional) — Attention_mask for negative_prompt_ids.

  • kwargs (Dict[str, Any], optional) — Ad hoc parametrization of generate_config and/or additional model-specific kwargs that will be forwarded to the forward function of the model. If the model is an encoder-decoder model, encoder specific kwargs should not be prefixed and decoder specific kwargs should be prefixed with decoder_.

Returns

Generates sequences of token ids for models with a language modeling head.

Most generation-controlling parameters are set in generation_config which, if not passed, will be set to the model’s default generation configuration. You can override any generation_config by passing the corresponding parameters to generate(), e.g. .generate(inputs, num_beams=4, do_sample=True).

compute_transition_scores

( sequences: Tensorscores: typing.Tuple[torch.Tensor]beam_indices: typing.Optional[torch.Tensor] = Nonenormalize_logits: bool = False ) → torch.Tensor

Parameters

  • sequences (torch.LongTensor) — The generated sequences. The second dimension (sequence_length) is either equal to max_length or shorter if all batches finished early due to the eos_token_id.

  • scores (tuple(torch.FloatTensor)) — Transition scores for each vocabulary token at each generation step. Beam transition scores consisting of log probabilities of tokens conditioned on log softmax of previously generated tokens Tuple of torch.FloatTensor with up to max_new_tokens elements (one element for each generated token), with each tensor of shape (batch_size*num_beams, config.vocab_size).

  • beam_indices (torch.LongTensor, optional) — Beam indices of generated token id at each generation step. torch.LongTensor of shape (batch_size*num_return_sequences, sequence_length). Only required if a num_beams>1 at generate-time.

  • normalize_logits (bool, optional, defaults to False) — Whether to normalize the logits (which, for legacy reasons, may be unnormalized).

Returns

torch.Tensor

A torch.Tensor of shape (batch_size*num_return_sequences, sequence_length) containing the transition scores (logits)

Computes the transition scores of sequences given the generation scores (and beam indices, if beam search was used). This is a convenient method to quicky obtain the scores of the selected tokens at generation time.

Examples:

Copied

>>> from transformers import GPT2Tokenizer, AutoModelForCausalLM
>>> import numpy as np

>>> tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
>>> model = AutoModelForCausalLM.from_pretrained("gpt2")
>>> tokenizer.pad_token_id = tokenizer.eos_token_id
>>> inputs = tokenizer(["Today is"], return_tensors="pt")

>>> # Example 1: Print the scores for each token generated with Greedy Search
>>> outputs = model.generate(**inputs, max_new_tokens=5, return_dict_in_generate=True, output_scores=True)
>>> transition_scores = model.compute_transition_scores(
...     outputs.sequences, outputs.scores, normalize_logits=True
... )
>>> # input_length is the length of the input prompt for decoder-only models, like the GPT family, and 1 for
>>> # encoder-decoder models, like BART or T5.
>>> input_length = 1 if model.config.is_encoder_decoder else inputs.input_ids.shape[1]
>>> generated_tokens = outputs.sequences[:, input_length:]
>>> for tok, score in zip(generated_tokens[0], transition_scores[0]):
...     # | token | token string | logits | probability
...     print(f"| {tok:5d} | {tokenizer.decode(tok):8s} | {score.numpy():.3f} | {np.exp(score.numpy()):.2%}")
|   262 |  the     | -1.414 | 24.33%
|  1110 |  day     | -2.609 | 7.36%
|   618 |  when    | -2.010 | 13.40%
|   356 |  we      | -1.859 | 15.58%
|   460 |  can     | -2.508 | 8.14%

>>> # Example 2: Reconstruct the sequence scores from Beam Search
>>> outputs = model.generate(
...     **inputs,
...     max_new_tokens=5,
...     num_beams=4,
...     num_return_sequences=4,
...     return_dict_in_generate=True,
...     output_scores=True,
... )
>>> transition_scores = model.compute_transition_scores(
...     outputs.sequences, outputs.scores, outputs.beam_indices, normalize_logits=False
... )
>>> # If you sum the generated tokens' scores and apply the length penalty, you'll get the sequence scores.
>>> # Tip: recomputing the scores is only guaranteed to match with `normalize_logits=False`. Depending on the
>>> # use case, you might want to recompute it with `normalize_logits=True`.
>>> output_length = input_length + np.sum(transition_scores.numpy() < 0, axis=1)
>>> length_penalty = model.generation_config.length_penalty
>>> reconstructed_scores = transition_scores.sum(axis=1) / (output_length**length_penalty)
>>> print(np.allclose(outputs.sequences_scores, reconstructed_scores))
True

greedy_search

( input_ids: LongTensorlogits_processor: typing.Optional[transformers.generation.logits_process.LogitsProcessorList] = Nonestopping_criteria: typing.Optional[transformers.generation.stopping_criteria.StoppingCriteriaList] = Nonemax_length: typing.Optional[int] = Nonepad_token_id: typing.Optional[int] = Noneeos_token_id: typing.Union[int, typing.List[int], NoneType] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Noneoutput_scores: typing.Optional[bool] = Nonereturn_dict_in_generate: typing.Optional[bool] = Nonesynced_gpus: bool = Falsestreamer: typing.Optional[ForwardRef('BaseStreamer')] = None**model_kwargs )

Parameters

  • input_ids (torch.LongTensor of shape (batch_size, sequence_length)) — The sequence used as a prompt for the generation.

  • max_length (int, optional, defaults to 20) — DEPRECATED. Use logits_processor or stopping_criteria directly to cap the number of generated tokens. The maximum length of the sequence to be generated.

  • pad_token_id (int, optional) — The id of the padding token.

  • eos_token_id (Union[int, List[int]], optional) — The id of the end-of-sequence token. Optionally, use a list to set multiple end-of-sequence tokens.

  • output_attentions (bool, optional, defaults to False) — Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more details.

  • output_hidden_states (bool, optional, defaults to False) — Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more details.

  • output_scores (bool, optional, defaults to False) — Whether or not to return the prediction scores. See scores under returned tensors for more details.

  • synced_gpus (bool, optional, defaults to False) — Whether to continue running the while loop until max_length (needed for ZeRO stage 3)

  • streamer (BaseStreamer, optional) — Streamer object that will be used to stream the generated sequences. Generated tokens are passed through streamer.put(token_ids) and the streamer is responsible for any further processing. model_kwargs — Additional model specific keyword arguments will be forwarded to the forward function of the model. If model is an encoder-decoder model the kwargs should include encoder_outputs.

Generates sequences of token ids for models with a language modeling head using greedy decoding and can be used for text-decoder, text-to-text, speech-to-text, and vision-to-text models.

Examples:

Copied

>>> from transformers import (
...     AutoTokenizer,
...     AutoModelForCausalLM,
...     LogitsProcessorList,
...     MinLengthLogitsProcessor,
...     StoppingCriteriaList,
...     MaxLengthCriteria,
... )

>>> tokenizer = AutoTokenizer.from_pretrained("gpt2")
>>> model = AutoModelForCausalLM.from_pretrained("gpt2")

>>> # set pad_token_id to eos_token_id because GPT2 does not have a PAD token
>>> model.generation_config.pad_token_id = model.generation_config.eos_token_id

>>> input_prompt = "It might be possible to"
>>> input_ids = tokenizer(input_prompt, return_tensors="pt").input_ids

>>> # instantiate logits processors
>>> logits_processor = LogitsProcessorList(
...     [
...         MinLengthLogitsProcessor(10, eos_token_id=model.generation_config.eos_token_id),
...     ]
... )
>>> stopping_criteria = StoppingCriteriaList([MaxLengthCriteria(max_length=20)])

>>> outputs = model.greedy_search(
...     input_ids, logits_processor=logits_processor, stopping_criteria=stopping_criteria
... )

>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
["It might be possible to get a better understanding of the nature of the problem, but it's not"]

sample

Parameters

  • input_ids (torch.LongTensor of shape (batch_size, sequence_length)) — The sequence used as a prompt for the generation.

  • max_length (int, optional, defaults to 20) — DEPRECATED. Use logits_processor or stopping_criteria directly to cap the number of generated tokens. The maximum length of the sequence to be generated.

  • pad_token_id (int, optional) — The id of the padding token.

  • eos_token_id (Union[int, List[int]], optional) — The id of the end-of-sequence token. Optionally, use a list to set multiple end-of-sequence tokens.

  • output_attentions (bool, optional, defaults to False) — Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more details.

  • output_hidden_states (bool, optional, defaults to False) — Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more details.

  • output_scores (bool, optional, defaults to False) — Whether or not to return the prediction scores. See scores under returned tensors for more details.

  • synced_gpus (bool, optional, defaults to False) — Whether to continue running the while loop until max_length (needed for ZeRO stage 3)

  • streamer (BaseStreamer, optional) — Streamer object that will be used to stream the generated sequences. Generated tokens are passed through streamer.put(token_ids) and the streamer is responsible for any further processing. model_kwargs — Additional model specific kwargs will be forwarded to the forward function of the model. If model is an encoder-decoder model the kwargs should include encoder_outputs.

Returns

Generates sequences of token ids for models with a language modeling head using multinomial sampling and can be used for text-decoder, text-to-text, speech-to-text, and vision-to-text models.

Examples:

Copied

>>> from transformers import (
...     AutoTokenizer,
...     AutoModelForCausalLM,
...     LogitsProcessorList,
...     MinLengthLogitsProcessor,
...     TopKLogitsWarper,
...     TemperatureLogitsWarper,
...     StoppingCriteriaList,
...     MaxLengthCriteria,
... )
>>> import torch

>>> tokenizer = AutoTokenizer.from_pretrained("gpt2")
>>> model = AutoModelForCausalLM.from_pretrained("gpt2")

>>> # set pad_token_id to eos_token_id because GPT2 does not have a EOS token
>>> model.config.pad_token_id = model.config.eos_token_id
>>> model.generation_config.pad_token_id = model.config.eos_token_id

>>> input_prompt = "Today is a beautiful day, and"
>>> input_ids = tokenizer(input_prompt, return_tensors="pt").input_ids

>>> # instantiate logits processors
>>> logits_processor = LogitsProcessorList(
...     [
...         MinLengthLogitsProcessor(15, eos_token_id=model.generation_config.eos_token_id),
...     ]
... )
>>> # instantiate logits processors
>>> logits_warper = LogitsProcessorList(
...     [
...         TopKLogitsWarper(50),
...         TemperatureLogitsWarper(0.7),
...     ]
... )

>>> stopping_criteria = StoppingCriteriaList([MaxLengthCriteria(max_length=20)])

>>> torch.manual_seed(0)
>>> outputs = model.sample(
...     input_ids,
...     logits_processor=logits_processor,
...     logits_warper=logits_warper,
...     stopping_criteria=stopping_criteria,
... )

>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
['Today is a beautiful day, and we must do everything possible to make it a day of celebration.']

beam_search

( input_ids: LongTensorbeam_scorer: BeamScorerlogits_processor: typing.Optional[transformers.generation.logits_process.LogitsProcessorList] = Nonestopping_criteria: typing.Optional[transformers.generation.stopping_criteria.StoppingCriteriaList] = Nonemax_length: typing.Optional[int] = Nonepad_token_id: typing.Optional[int] = Noneeos_token_id: typing.Union[int, typing.List[int], NoneType] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Noneoutput_scores: typing.Optional[bool] = Nonereturn_dict_in_generate: typing.Optional[bool] = Nonesynced_gpus: bool = False**model_kwargs )

Parameters

  • input_ids (torch.LongTensor of shape (batch_size, sequence_length)) — The sequence used as a prompt for the generation.

  • max_length (int, optional, defaults to 20) — DEPRECATED. Use logits_processor or stopping_criteria directly to cap the number of generated tokens. The maximum length of the sequence to be generated.

  • pad_token_id (int, optional) — The id of the padding token.

  • eos_token_id (Union[int, List[int]], optional) — The id of the end-of-sequence token. Optionally, use a list to set multiple end-of-sequence tokens.

  • output_attentions (bool, optional, defaults to False) — Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more details.

  • output_hidden_states (bool, optional, defaults to False) — Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more details.

  • output_scores (bool, optional, defaults to False) — Whether or not to return the prediction scores. See scores under returned tensors for more details.

  • synced_gpus (bool, optional, defaults to False) — Whether to continue running the while loop until max_length (needed for ZeRO stage 3) model_kwargs — Additional model specific kwargs will be forwarded to the forward function of the model. If model is an encoder-decoder model the kwargs should include encoder_outputs.

Generates sequences of token ids for models with a language modeling head using beam search decoding and can be used for text-decoder, text-to-text, speech-to-text, and vision-to-text models.

Examples:

Copied

>>> from transformers import (
...     AutoTokenizer,
...     AutoModelForSeq2SeqLM,
...     LogitsProcessorList,
...     MinLengthLogitsProcessor,
...     BeamSearchScorer,
... )
>>> import torch

>>> tokenizer = AutoTokenizer.from_pretrained("t5-base")
>>> model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")

>>> encoder_input_str = "translate English to German: How old are you?"
>>> encoder_input_ids = tokenizer(encoder_input_str, return_tensors="pt").input_ids


>>> # lets run beam search using 3 beams
>>> num_beams = 3
>>> # define decoder start token ids
>>> input_ids = torch.ones((num_beams, 1), device=model.device, dtype=torch.long)
>>> input_ids = input_ids * model.config.decoder_start_token_id

>>> # add encoder_outputs to model keyword arguments
>>> model_kwargs = {
...     "encoder_outputs": model.get_encoder()(
...         encoder_input_ids.repeat_interleave(num_beams, dim=0), return_dict=True
...     )
... }

>>> # instantiate beam scorer
>>> beam_scorer = BeamSearchScorer(
...     batch_size=1,
...     num_beams=num_beams,
...     device=model.device,
... )

>>> # instantiate logits processors
>>> logits_processor = LogitsProcessorList(
...     [
...         MinLengthLogitsProcessor(5, eos_token_id=model.config.eos_token_id),
...     ]
... )

>>> outputs = model.beam_search(input_ids, beam_scorer, logits_processor=logits_processor, **model_kwargs)

>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
['Wie alt bist du?']

beam_sample

( input_ids: LongTensorbeam_scorer: BeamScorerlogits_processor: typing.Optional[transformers.generation.logits_process.LogitsProcessorList] = Nonestopping_criteria: typing.Optional[transformers.generation.stopping_criteria.StoppingCriteriaList] = Nonelogits_warper: typing.Optional[transformers.generation.logits_process.LogitsProcessorList] = Nonemax_length: typing.Optional[int] = Nonepad_token_id: typing.Optional[int] = Noneeos_token_id: typing.Union[int, typing.List[int], NoneType] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Noneoutput_scores: typing.Optional[bool] = Nonereturn_dict_in_generate: typing.Optional[bool] = Nonesynced_gpus: bool = False**model_kwargs )

Parameters

  • input_ids (torch.LongTensor of shape (batch_size, sequence_length)) — The sequence used as a prompt for the generation.

  • max_length (int, optional, defaults to 20) — DEPRECATED. Use logits_processor or stopping_criteria directly to cap the number of generated tokens. The maximum length of the sequence to be generated.

  • pad_token_id (int, optional) — The id of the padding token.

  • eos_token_id (Union[int, List[int]], optional) — The id of the end-of-sequence token. Optionally, use a list to set multiple end-of-sequence tokens.

  • output_attentions (bool, optional, defaults to False) — Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more details.

  • output_hidden_states (bool, optional, defaults to False) — Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more details.

  • output_scores (bool, optional, defaults to False) — Whether or not to return the prediction scores. See scores under returned tensors for more details.

  • synced_gpus (bool, optional, defaults to False) — Whether to continue running the while loop until max_length (needed for ZeRO stage 3) model_kwargs — Additional model specific kwargs will be forwarded to the forward function of the model. If model is an encoder-decoder model the kwargs should include encoder_outputs.

Generates sequences of token ids for models with a language modeling head using beam search multinomial sampling and can be used for text-decoder, text-to-text, speech-to-text, and vision-to-text models.

Examples:

Copied

>>> from transformers import (
...     AutoTokenizer,
...     AutoModelForSeq2SeqLM,
...     LogitsProcessorList,
...     MinLengthLogitsProcessor,
...     TopKLogitsWarper,
...     TemperatureLogitsWarper,
...     BeamSearchScorer,
... )
>>> import torch

>>> tokenizer = AutoTokenizer.from_pretrained("t5-base")
>>> model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")

>>> encoder_input_str = "translate English to German: How old are you?"
>>> encoder_input_ids = tokenizer(encoder_input_str, return_tensors="pt").input_ids

>>> # lets run beam search using 3 beams
>>> num_beams = 3
>>> # define decoder start token ids
>>> input_ids = torch.ones((num_beams, 1), device=model.device, dtype=torch.long)
>>> input_ids = input_ids * model.config.decoder_start_token_id

>>> # add encoder_outputs to model keyword arguments
>>> model_kwargs = {
...     "encoder_outputs": model.get_encoder()(
...         encoder_input_ids.repeat_interleave(num_beams, dim=0), return_dict=True
...     )
... }

>>> # instantiate beam scorer
>>> beam_scorer = BeamSearchScorer(
...     batch_size=1,
...     max_length=model.config.max_length,
...     num_beams=num_beams,
...     device=model.device,
... )

>>> # instantiate logits processors
>>> logits_processor = LogitsProcessorList(
...     [MinLengthLogitsProcessor(5, eos_token_id=model.config.eos_token_id)]
... )
>>> # instantiate logits processors
>>> logits_warper = LogitsProcessorList(
...     [
...         TopKLogitsWarper(50),
...         TemperatureLogitsWarper(0.7),
...     ]
... )

>>> outputs = model.beam_sample(
...     input_ids, beam_scorer, logits_processor=logits_processor, logits_warper=logits_warper, **model_kwargs
... )

>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
['Wie alt bist du?']

contrastive_search

( input_ids: LongTensortop_k: typing.Optional[int] = 1penalty_alpha: typing.Optional[float] = 0logits_processor: typing.Optional[transformers.generation.logits_process.LogitsProcessorList] = Nonelogits_warper: typing.Optional[transformers.generation.logits_process.LogitsProcessorList] = Nonestopping_criteria: typing.Optional[transformers.generation.stopping_criteria.StoppingCriteriaList] = Nonepad_token_id: typing.Optional[int] = Noneeos_token_id: typing.Union[int, typing.List[int], NoneType] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Noneoutput_scores: typing.Optional[bool] = Nonereturn_dict_in_generate: typing.Optional[bool] = Nonesynced_gpus: bool = Falsestreamer: typing.Optional[ForwardRef('BaseStreamer')] = Nonesequential: typing.Optional[bool] = None**model_kwargs )

Parameters

  • input_ids (torch.LongTensor of shape (batch_size, sequence_length)) — The sequence used as a prompt for the generation.

  • top_k (int, optional, defaults to 1) — The size of the candidate set that is used to re-rank for contrastive search

  • penalty_alpha (float, optional, defaults to 0) — The degeneration penalty for contrastive search; activate when it is larger than 0

  • pad_token_id (int, optional) — The id of the padding token.

  • eos_token_id (Union[int, List[int]], optional) — The id of the end-of-sequence token. Optionally, use a list to set multiple end-of-sequence tokens.

  • output_attentions (bool, optional, defaults to False) — Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more details.

  • output_hidden_states (bool, optional, defaults to False) — Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more details.

  • output_scores (bool, optional, defaults to False) — Whether or not to return the prediction scores. See scores under returned tensors for more details.

  • synced_gpus (bool, optional, defaults to False) — Whether to continue running the while loop until max_length (needed for ZeRO stage 3)

  • streamer (BaseStreamer, optional) — Streamer object that will be used to stream the generated sequences. Generated tokens are passed through streamer.put(token_ids) and the streamer is responsible for any further processing.

  • sequential (bool, optional) — Switches topk hidden state computation from parallel to sequential to reduce memory if True. model_kwargs — Additional model specific keyword arguments will be forwarded to the forward function of the model. If model is an encoder-decoder model the kwargs should include encoder_outputs.

Generates sequences of token ids for models with a language modeling head using contrastive search and can be used for text-decoder, text-to-text, speech-to-text, and vision-to-text models.

Examples:

Copied

>>> from transformers import (
...     AutoTokenizer,
...     AutoModelForCausalLM,
...     StoppingCriteriaList,
...     MaxLengthCriteria,
... )

>>> tokenizer = AutoTokenizer.from_pretrained("facebook/opt-125m")
>>> model = AutoModelForCausalLM.from_pretrained("facebook/opt-125m")
>>> # set pad_token_id to eos_token_id because OPT does not have a PAD token
>>> model.config.pad_token_id = model.config.eos_token_id
>>> input_prompt = "DeepMind Company is"
>>> input_ids = tokenizer(input_prompt, return_tensors="pt")
>>> stopping_criteria = StoppingCriteriaList([MaxLengthCriteria(max_length=64)])
>>> outputs = model.contrastive_search(
...     **input_ids, penalty_alpha=0.6, top_k=4, stopping_criteria=stopping_criteria
... )
>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
['DeepMind Company is a company that focuses on the development and commercialization of artificial intelligence (AI). DeepMind’s mission is to help people understand and solve problems that are difficult to solve in the world today.\n\nIn this post, we talk about the benefits of deep learning in business and how it']

group_beam_search

( input_ids: LongTensorbeam_scorer: BeamScorerlogits_processor: typing.Optional[transformers.generation.logits_process.LogitsProcessorList] = Nonestopping_criteria: typing.Optional[transformers.generation.stopping_criteria.StoppingCriteriaList] = Nonemax_length: typing.Optional[int] = Nonepad_token_id: typing.Optional[int] = Noneeos_token_id: typing.Union[int, typing.List[int], NoneType] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Noneoutput_scores: typing.Optional[bool] = Nonereturn_dict_in_generate: typing.Optional[bool] = Nonesynced_gpus: bool = False**model_kwargs )

Parameters

  • input_ids (torch.LongTensor of shape (batch_size, sequence_length)) — The sequence used as a prompt for the generation.

  • max_length (int, optional, defaults to 20) — DEPRECATED. Use logits_processor or stopping_criteria directly to cap the number of generated tokens. The maximum length of the sequence to be generated.

  • pad_token_id (int, optional) — The id of the padding token.

  • eos_token_id (Union[int, List[int]], optional) — The id of the end-of-sequence token. Optionally, use a list to set multiple end-of-sequence tokens.

  • output_attentions (bool, optional, defaults to False) — Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more details.

  • output_hidden_states (bool, optional, defaults to False) — Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more details.

  • output_scores (bool, optional, defaults to False) — Whether or not to return the prediction scores. See scores under returned tensors for more details.

  • synced_gpus (bool, optional, defaults to False) — Whether to continue running the while loop until max_length (needed for ZeRO stage 3)

    model_kwargs — Additional model specific kwargs that will be forwarded to the forward function of the model. If model is an encoder-decoder model the kwargs should include encoder_outputs.

Generates sequences of token ids for models with a language modeling head using diverse beam search decoding and can be used for text-decoder, text-to-text, speech-to-text, and vision-to-text models.

Examples:

Copied

>>> from transformers import (
...     AutoTokenizer,
...     AutoModelForSeq2SeqLM,
...     LogitsProcessorList,
...     MinLengthLogitsProcessor,
...     HammingDiversityLogitsProcessor,
...     BeamSearchScorer,
... )
>>> import torch

>>> tokenizer = AutoTokenizer.from_pretrained("t5-base")
>>> model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")

>>> encoder_input_str = "translate English to German: How old are you?"
>>> encoder_input_ids = tokenizer(encoder_input_str, return_tensors="pt").input_ids


>>> # lets run diverse beam search using 6 beams
>>> num_beams = 6
>>> # define decoder start token ids
>>> input_ids = torch.ones((num_beams, 1), device=model.device, dtype=torch.long)
>>> input_ids = input_ids * model.config.decoder_start_token_id

>>> # add encoder_outputs to model keyword arguments
>>> model_kwargs = {
...     "encoder_outputs": model.get_encoder()(
...         encoder_input_ids.repeat_interleave(num_beams, dim=0), return_dict=True
...     )
... }

>>> # instantiate beam scorer
>>> beam_scorer = BeamSearchScorer(
...     batch_size=1,
...     max_length=model.config.max_length,
...     num_beams=num_beams,
...     device=model.device,
...     num_beam_groups=3,
... )

>>> # instantiate logits processors
>>> logits_processor = LogitsProcessorList(
...     [
...         HammingDiversityLogitsProcessor(5.5, num_beams=6, num_beam_groups=3),
...         MinLengthLogitsProcessor(5, eos_token_id=model.config.eos_token_id),
...     ]
... )

>>> outputs = model.group_beam_search(
...     input_ids, beam_scorer, logits_processor=logits_processor, **model_kwargs
... )

>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
['Wie alt bist du?']

constrained_beam_search

( input_ids: LongTensorconstrained_beam_scorer: ConstrainedBeamSearchScorerlogits_processor: typing.Optional[transformers.generation.logits_process.LogitsProcessorList] = Nonestopping_criteria: typing.Optional[transformers.generation.stopping_criteria.StoppingCriteriaList] = Nonemax_length: typing.Optional[int] = Nonepad_token_id: typing.Optional[int] = Noneeos_token_id: typing.Union[int, typing.List[int], NoneType] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Noneoutput_scores: typing.Optional[bool] = Nonereturn_dict_in_generate: typing.Optional[bool] = Nonesynced_gpus: typing.Optional[bool] = None**model_kwargs )

Parameters

  • input_ids (torch.LongTensor of shape (batch_size, sequence_length)) — The sequence used as a prompt for the generation.

  • max_length (int, optional, defaults to 20) — DEPRECATED. Use logits_processor or stopping_criteria directly to cap the number of generated tokens. The maximum length of the sequence to be generated.

  • pad_token_id (int, optional) — The id of the padding token.

  • eos_token_id (Union[int, List[int]], optional) — The id of the end-of-sequence token. Optionally, use a list to set multiple end-of-sequence tokens.

  • output_attentions (bool, optional, defaults to False) — Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more details.

  • output_hidden_states (bool, optional, defaults to False) — Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more details.

  • output_scores (bool, optional, defaults to False) — Whether or not to return the prediction scores. See scores under returned tensors for more details.

  • synced_gpus (bool, optional, defaults to False) — Whether to continue running the while loop until max_length (needed for ZeRO stage 3) model_kwargs — Additional model specific kwargs will be forwarded to the forward function of the model. If model is an encoder-decoder model the kwargs should include encoder_outputs.

Generates sequences of token ids for models with a language modeling head using constrained beam search decoding and can be used for text-decoder, text-to-text, speech-to-text, and vision-to-text models.

Examples:

Copied

>>> from transformers import (
...     AutoTokenizer,
...     AutoModelForSeq2SeqLM,
...     LogitsProcessorList,
...     MinLengthLogitsProcessor,
...     ConstrainedBeamSearchScorer,
...     PhrasalConstraint,
... )
>>> import torch

>>> tokenizer = AutoTokenizer.from_pretrained("t5-base")
>>> model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")

>>> encoder_input_str = "translate English to German: How old are you?"
>>> encoder_input_ids = tokenizer(encoder_input_str, return_tensors="pt").input_ids


>>> # lets run beam search using 3 beams
>>> num_beams = 3
>>> # define decoder start token ids
>>> input_ids = torch.ones((num_beams, 1), device=model.device, dtype=torch.long)
>>> input_ids = input_ids * model.config.decoder_start_token_id

>>> # add encoder_outputs to model keyword arguments
>>> model_kwargs = {
...     "encoder_outputs": model.get_encoder()(
...         encoder_input_ids.repeat_interleave(num_beams, dim=0), return_dict=True
...     )
... }

>>> constraint_str = "Sie"
>>> constraint_token_ids = tokenizer.encode(constraint_str)[:-1]  # slice to remove eos token
>>> constraints = [PhrasalConstraint(token_ids=constraint_token_ids)]


>>> # instantiate beam scorer
>>> beam_scorer = ConstrainedBeamSearchScorer(
...     batch_size=1, num_beams=num_beams, device=model.device, constraints=constraints
... )

>>> # instantiate logits processors
>>> logits_processor = LogitsProcessorList(
...     [
...         MinLengthLogitsProcessor(5, eos_token_id=model.config.eos_token_id),
...     ]
... )

>>> outputs = model.constrained_beam_search(
...     input_ids, beam_scorer, constraints=constraints, logits_processor=logits_processor, **model_kwargs
... )

>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
['Wie alt sind Sie?']

TFGenerationMixin

class transformers.TFGenerationMixin

( )

  • greedy decoding by calling greedy_search() if num_beams=1 and do_sample=False

  • contrastive search by calling contrastive_search() if penalty_alpha>0 and top_k>1

  • multinomial sampling by calling sample() if num_beams=1 and do_sample=True

  • beam-search decoding by calling beam_search() if num_beams>1

generate

Parameters

  • inputs (tf.Tensor of varying shape depending on the modality, optional) — The sequence used as a prompt for the generation or as model inputs to the encoder. If None the method initializes it with bos_token_id and a batch size of 1. For decoder-only models inputs should of in the format of input_ids. For encoder-decoder models inputs can represent any of input_ids, input_values, input_features, or pixel_values.

  • logits_processor (LogitsProcessorList, optional) — Custom logits processors that complement the default logits processors built from arguments and generation config. If a logit processor is passed that is already created with the arguments or a generation config an error is thrown. This feature is intended for advanced users.

  • seed (List[int], optional) — Random seed to control sampling, containing two integers, used when do_sample is True. See the seed argument from stateless functions in tf.random.

  • kwargs (Dict[str, Any], optional) — Ad hoc parametrization of generate_config and/or additional model-specific kwargs that will be forwarded to the forward function of the model. If the model is an encoder-decoder model, encoder specific kwargs should not be prefixed and decoder specific kwargs should be prefixed with decoder_.

Returns

Generates sequences of token ids for models with a language modeling head.

Most generation-controlling parameters are set in generation_config which, if not passed, will be set to the model’s default generation configuration. You can override any generation_config by passing the corresponding parameters to generate, e.g. .generate(inputs, num_beams=4, do_sample=True).

compute_transition_scores

( sequences: Tensorscores: typing.Tuple[tensorflow.python.framework.ops.Tensor]beam_indices: typing.Optional[tensorflow.python.framework.ops.Tensor] = Nonenormalize_logits: bool = False ) → tf.Tensor

Parameters

  • sequences (tf.Tensor) — The generated sequences. The second dimension (sequence_length) is either equal to max_length or shorter if all batches finished early due to the eos_token_id.

  • scores (tuple(tf.Tensor)) — Transition scores for each vocabulary token at each generation step. Beam transition scores consisting of log probabilities of tokens conditioned on log softmax of previously generated tokens Tuple of tf.Tensor with up to max_new_tokens elements (one element for each generated token), with each tensor of shape (batch_size*num_beams, config.vocab_size).

  • beam_indices (tf.Tensor, optional) — Beam indices of generated token id at each generation step. tf.Tensor of shape (batch_size*num_return_sequences, sequence_length). Only required if a num_beams>1 at generate-time.

  • normalize_logits (bool, optional, defaults to False) — Whether to normalize the logits (which, for legacy reasons, may be unnormalized).

Returns

tf.Tensor

A tf.Tensor of shape (batch_size*num_return_sequences, sequence_length) containing the transition scores (logits)

Computes the transition scores of sequences given the generation scores (and beam indices, if beam search was used). This is a convenient method to quicky obtain the scores of the selected tokens at generation time.

Examples:

Copied

>>> from transformers import GPT2Tokenizer, TFAutoModelForCausalLM
>>> import numpy as np

>>> tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
>>> model = TFAutoModelForCausalLM.from_pretrained("gpt2")
>>> tokenizer.pad_token_id = tokenizer.eos_token_id
>>> inputs = tokenizer(["Today is"], return_tensors="tf")

>>> # Example 1: Print the scores for each token generated with Greedy Search
>>> outputs = model.generate(**inputs, max_new_tokens=5, return_dict_in_generate=True, output_scores=True)
>>> transition_scores = model.compute_transition_scores(
...     outputs.sequences, outputs.scores, normalize_logits=True
... )
>>> # input_length is the length of the input prompt for decoder-only models, like the GPT family, and 1 for
>>> # encoder-decoder models, like BART or T5.
>>> input_length = 1 if model.config.is_encoder_decoder else inputs.input_ids.shape[1]
>>> generated_tokens = outputs.sequences[:, input_length:]
>>> for tok, score in zip(generated_tokens[0], transition_scores[0]):
...     # | token | token string | logits | probability
...     print(f"| {tok:5d} | {tokenizer.decode(tok):8s} | {score.numpy():.3f} | {np.exp(score.numpy()):.2%}")
|   262 |  the     | -1.413 | 24.33%
|  1110 |  day     | -2.609 | 7.36%
|   618 |  when    | -2.009 | 13.41%
|   356 |  we      | -1.859 | 15.58%
|   460 |  can     | -2.508 | 8.14%

>>> # Example 2: Reconstruct the sequence scores from Beam Search
>>> outputs = model.generate(
...     **inputs,
...     max_new_tokens=5,
...     num_beams=4,
...     num_return_sequences=4,
...     return_dict_in_generate=True,
...     output_scores=True,
... )
>>> transition_scores = model.compute_transition_scores(
...     outputs.sequences, outputs.scores, outputs.beam_indices, normalize_logits=False
... )
>>> # If you sum the generated tokens' scores and apply the length penalty, you'll get the sequence scores.
>>> # Tip: recomputing the scores is only guaranteed to match with `normalize_logits=False`. Depending on the
>>> # use case, you might want to recompute it with `normalize_logits=True`.
>>> output_length = input_length + np.sum(transition_scores.numpy() < 0, axis=1)
>>> length_penalty = model.generation_config.length_penalty
>>> reconstructed_scores = np.sum(transition_scores, axis=1) / (output_length**length_penalty)
>>> print(np.allclose(outputs.sequences_scores, reconstructed_scores))
True

FlaxGenerationMixin

class transformers.FlaxGenerationMixin

( )

  • greedy decoding by calling _greedy_search() if num_beams=1 and do_sample=False

  • multinomial sampling by calling _sample() if num_beams=1 and do_sample=True

  • beam-search decoding by calling _beam_search() if num_beams>1 and do_sample=False

generate

( input_ids: Arraygeneration_config: typing.Optional[transformers.generation.configuration_utils.GenerationConfig] = Noneprng_key: typing.Optional[jax.Array] = Nonetrace: bool = Trueparams: typing.Union[typing.Dict[str, jax.Array], NoneType] = Nonelogits_processor: typing.Optional[transformers.generation.flax_logits_process.FlaxLogitsProcessorList] = None**kwargs )

Parameters

  • input_ids (jnp.ndarray of shape (batch_size, sequence_length)) — The sequence used as a prompt for the generation.

  • trace (bool, optional, defaults to True) — Whether to trace generation. Setting trace=False should only be used for debugging and will lead to a considerably slower runtime.

  • params (Dict[str, jnp.ndarray], optional) — Optionally the model parameters can be passed. Can be useful for parallelized generation.

  • logits_processor (FlaxLogitsProcessorList , optional) — Custom logits processors that complement the default logits processors built from arguments and generation config. If a logit processor is passed that is already created with the arguments or a generation config an error is thrown. This feature is intended for advanced users.

  • kwargs (Dict[str, Any], optional) — Ad hoc parametrization of generate_config and/or additional model-specific kwargs that will be forwarded to the forward function of the model. If the model is an encoder-decoder model, encoder specific kwargs should not be prefixed and decoder specific kwargs should be prefixed with decoder_.

Generates sequences of token ids for models with a language modeling head.

num_beam_groups (int, optional, defaults to 1) — Number of groups to divide num_beams into in order to ensure diversity among different groups of beams. for more details.

typical_p (float, optional, defaults to 1.0) — Local typicality measures how similar the conditional probability of predicting a target token next is to the expected conditional probability of predicting a random token next, given the partial text already generated. If set to float < 1, the smallest set of the most locally typical tokens with probabilities that add up to typical_p or higher are kept for generation. See for more details.

epsilon_cutoff (float, optional, defaults to 0.0) — If set to float strictly between 0 and 1, only tokens with a conditional probability greater than epsilon_cutoff will be sampled. In the paper, suggested values range from 3e-4 to 9e-4, depending on the size of the model. See for more details.

eta_cutoff (float, optional, defaults to 0.0) — Eta sampling is a hybrid of locally typical sampling and epsilon sampling. If set to float strictly between 0 and 1, a token is only considered if it is greater than either eta_cutoff or sqrt(eta_cutoff) * exp(-entropy(softmax(next_token_logits))). The latter term is intuitively the expected next token probability, scaled by sqrt(eta_cutoff). In the paper, suggested values range from 3e-4 to 2e-3, depending on the size of the model. See for more details.

repetition_penalty (float, optional, defaults to 1.0) — The parameter for repetition penalty. 1.0 means no penalty. See for more details.

bad_words_ids(List[List[int]], optional) — List of list of token ids that are not allowed to be generated. Check for further documentation and examples.

force_words_ids(List[List[int]] or List[List[List[int]]], optional) — List of token ids that must be generated. If given a List[List[int]], this is treated as a simple list of words that must be included, the opposite to bad_words_ids. If given List[List[List[int]]], this triggers a , where one can allow different forms of each word.

forced_bos_token_id (int, optional, defaults to model.config.forced_bos_token_id) — The id of the token to force as the first generated token after the decoder_start_token_id. Useful for multilingual models like where the first generated token needs to be the target language token.

sequence_bias (Dict[Tuple[int], float], optional)) — Dictionary that maps a sequence of tokens to its bias term. Positive biases increase the odds of the sequence being selected, while negative biases do the opposite. Check for further documentation and examples.

return_dict_in_generate (bool, optional, defaults to False) — Whether or not to return a instead of a plain tuple.

greedy decoding by calling if num_beams=1 and do_sample=False

contrastive search by calling if penalty_alpha>0. and top_k>1

multinomial sampling by calling if num_beams=1 and do_sample=True

beam-search decoding by calling if num_beams>1 and do_sample=False

beam-search multinomial sampling by calling if num_beams>1 and do_sample=True

diverse beam-search decoding by calling , if num_beams>1 and num_beam_groups>1

constrained beam-search decoding by calling , if constraints!=None or force_words_ids!=None

You do not need to call any of the above methods directly. Pass custom parameter values to ‘.generate()‘. To learn more about decoding strategies refer to the .

( pretrained_model_name: typing.Union[str, os.PathLike]config_file_name: typing.Union[str, os.PathLike, NoneType] = Nonecache_dir: typing.Union[str, os.PathLike, NoneType] = Noneforce_download: bool = Falselocal_files_only: bool = Falsetoken: typing.Union[str, bool, NoneType] = Nonerevision: str = 'main'**kwargs ) →

a path to a directory containing a configuration file saved using the method, e.g., ./my_model_directory/.

Instantiate a from a generation configuration file.

( model_config: PretrainedConfig ) →

Instantiates a from a . This function is useful to convert legacy objects, which may contain generation parameters, into a stand-alone .

kwargs (Dict[str, Any], optional) — Additional key word arguments passed along to the method.

Save a generation configuration object to the directory save_directory, so that it can be re-loaded using the class method.

A class containing all functions for auto-regressive text generation, to be used as a mixin in .

The class exposes , which can be used for:

greedy decoding by calling if num_beams=1 and do_sample=False

contrastive search by calling if penalty_alpha>0 and top_k>1

multinomial sampling by calling if num_beams=1 and do_sample=True

beam-search decoding by calling if num_beams>1 and do_sample=False

beam-search multinomial sampling by calling if num_beams>1 and do_sample=True

diverse beam-search decoding by calling , if num_beams>1 and num_beam_groups>1

constrained beam-search decoding by calling , if constraints!=None or force_words_ids!=None

You do not need to call any of the above methods directly. Pass custom parameter values to ‘generate’ instead. To learn more about decoding strategies refer to the .

( inputs: typing.Optional[torch.Tensor] = Nonegeneration_config: typing.Optional[transformers.generation.configuration_utils.GenerationConfig] = Nonelogits_processor: typing.Optional[transformers.generation.logits_process.LogitsProcessorList] = Nonestopping_criteria: typing.Optional[transformers.generation.stopping_criteria.StoppingCriteriaList] = Noneprefix_allowed_tokens_fn: typing.Union[typing.Callable[[int, torch.Tensor], typing.List[int]], NoneType] = Nonesynced_gpus: typing.Optional[bool] = Noneassistant_model: typing.Optional[ForwardRef('PreTrainedModel')] = Nonestreamer: typing.Optional[ForwardRef('BaseStreamer')] = Nonenegative_prompt_ids: typing.Optional[torch.Tensor] = Nonenegative_prompt_attention_mask: typing.Optional[torch.Tensor] = None**kwargs ) → or torch.LongTensor

generation_config (~generation.GenerationConfig, optional) — The generation configuration to be used as base parametrization for the generation call. **kwargs passed to generate matching the attributes of generation_config will override them. If generation_config is not provided, the default will be used, which had the following loading priority: 1) from the generation_config.json model file, if it exists; 2) from the model configuration. Please note that unspecified parameters will inherit ’s default values, whose documentation should be checked to parameterize generation.

prefix_allowed_tokens_fn (Callable[[int, torch.Tensor], List[int]], optional) — If provided, this function constraints the beam search to allowed tokens only at each step. If not provided no constraint is applied. This function takes 2 arguments: the batch ID batch_id and input_ids. It has to return a list with the allowed tokens for the next generation step conditioned on the batch ID batch_id and the previously generated tokens inputs_ids. This argument is useful for constrained generation conditioned on the prefix, as described in .

or torch.LongTensor

A (if return_dict_in_generate=True or when config.return_dict_in_generate=True) or a torch.FloatTensor.

If the model is not an encoder-decoder model (model.config.is_encoder_decoder=False), the possible types are:

,

,

,

If the model is an encoder-decoder model (model.config.is_encoder_decoder=True), the possible types are:

,

,

,

For an overview of generation strategies and code examples, check out the .

logits_processor (LogitsProcessorList, optional) — An instance of . List of instances of class derived from used to modify the prediction scores of the language modeling head applied at each generation step.

stopping_criteria (StoppingCriteriaList, optional) — An instance of . List of instances of class derived from used to tell if the generation loop should stop.

return_dict_in_generate (bool, optional, defaults to False) — Whether or not to return a instead of a plain tuple.

In most cases, you do not need to call directly. Use generate() instead. For an overview of generation strategies and code examples, check the .

( input_ids: LongTensorlogits_processor: typing.Optional[transformers.generation.logits_process.LogitsProcessorList] = Nonestopping_criteria: typing.Optional[transformers.generation.stopping_criteria.StoppingCriteriaList] = Nonelogits_warper: typing.Optional[transformers.generation.logits_process.LogitsProcessorList] = Nonemax_length: typing.Optional[int] = Nonepad_token_id: typing.Optional[int] = Noneeos_token_id: typing.Union[int, typing.List[int], NoneType] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Noneoutput_scores: typing.Optional[bool] = Nonereturn_dict_in_generate: typing.Optional[bool] = Nonesynced_gpus: bool = Falsestreamer: typing.Optional[ForwardRef('BaseStreamer')] = None**model_kwargs ) → , or torch.LongTensor

logits_processor (LogitsProcessorList, optional) — An instance of . List of instances of class derived from used to modify the prediction scores of the language modeling head applied at each generation step.

stopping_criteria (StoppingCriteriaList, optional) — An instance of . List of instances of class derived from used to tell if the generation loop should stop.

logits_warper (LogitsProcessorList, optional) — An instance of . List of instances of class derived from used to warp the prediction score distribution of the language modeling head applied before multinomial sampling at each generation step.

return_dict_in_generate (bool, optional, defaults to False) — Whether or not to return a instead of a plain tuple.

, or torch.LongTensor

A torch.LongTensor containing the generated tokens (default behaviour) or a if model.config.is_encoder_decoder=False and return_dict_in_generate=True or a if model.config.is_encoder_decoder=True.

In most cases, you do not need to call directly. Use generate() instead. For an overview of generation strategies and code examples, check the .

beam_scorer (BeamScorer) — An derived instance of that defines how beam hypotheses are constructed, stored and sorted during generation. For more information, the documentation of should be read.

logits_processor (LogitsProcessorList, optional) — An instance of . List of instances of class derived from used to modify the prediction scores of the language modeling head applied at each generation step.

stopping_criteria (StoppingCriteriaList, optional) — An instance of . List of instances of class derived from used to tell if the generation loop should stop.

return_dict_in_generate (bool, optional, defaults to False) — Whether or not to return a instead of a plain tuple.

In most cases, you do not need to call directly. Use generate() instead. For an overview of generation strategies and code examples, check the .

beam_scorer (BeamScorer) — A derived instance of that defines how beam hypotheses are constructed, stored and sorted during generation. For more information, the documentation of should be read.

logits_processor (LogitsProcessorList, optional) — An instance of . List of instances of class derived from used to modify the prediction scores of the language modeling head applied at each generation step.

stopping_criteria (StoppingCriteriaList, optional) — An instance of . List of instances of class derived from used to tell if the generation loop should stop.

logits_warper (LogitsProcessorList, optional) — An instance of . List of instances of class derived from used to warp the prediction score distribution of the language modeling head applied before multinomial sampling at each generation step.

return_dict_in_generate (bool, optional, defaults to False) — Whether or not to return a instead of a plain tuple.

In most cases, you do not need to call directly. Use generate() instead. For an overview of generation strategies and code examples, check the .

logits_processor (LogitsProcessorList, optional) — An instance of . List of instances of class derived from used to modify the prediction scores of the language modeling head applied at each generation step.

logits_warper (LogitsProcessorList, optional) — An instance of . List of instances of class derived from used to warp the prediction score distribution of the language modeling head applied before multinomial sampling at each generation step.

stopping_criteria (StoppingCriteriaList, optional) — An instance of . List of instances of class derived from used to tell if the generation loop should stop.

return_dict_in_generate (bool, optional, defaults to False) — Whether or not to return a instead of a plain tuple.

In most cases, you do not need to call directly. Use generate() instead. For an overview of generation strategies and code examples, check the .

beam_scorer (BeamScorer) — An derived instance of that defines how beam hypotheses are constructed, stored and sorted during generation. For more information, the documentation of should be read.

logits_processor (LogitsProcessorList, optional) — An instance of . List of instances of class derived from used to modify the prediction scores of the language modeling head applied at each generation step.

stopping_criteria (StoppingCriteriaList, optional) — An instance of . List of instances of class derived from used to tell if the generation loop should stop.

return_dict_in_generate (bool, optional, defaults to False) — Whether or not to return a instead of a plain tuple.

In most cases, you do not need to call directly. Use generate() instead. For an overview of generation strategies and code examples, check the .

constrained_beam_scorer (ConstrainedBeamSearchScorer) — A derived instance of that defines how beam hypotheses are constructed, stored and sorted during generation, while satisfying a list of positive constraints. For more information, the documentation of should be read.

logits_processor (LogitsProcessorList, optional) — An instance of . List of instances of class derived from used to modify the prediction scores of the language modeling head applied at each generation step.

stopping_criteria (StoppingCriteriaList, optional) — An instance of . List of instances of class derived from used to tell if the generation loop should stop.

logits_warper (LogitsProcessorList, optional) — An instance of . List of instances of class derived from used to warp the prediction score distribution of the language modeling head applied before multinomial sampling at each generation step.

return_dict_in_generate (bool, optional, defaults to False) — Whether or not to return a instead of a plain tuple.

In most cases, you do not need to call directly. Use generate() instead. For an overview of generation strategies and code examples, check the .

A class containing all of the functions supporting generation, to be used as a mixin in .

The class exposes , which can be used for:

You do not need to call any of the above methods directly. Pass custom parameter values to ‘generate’ instead. To learn more about decoding strategies refer to the .

( inputs: typing.Optional[tensorflow.python.framework.ops.Tensor] = Nonegeneration_config: typing.Optional[transformers.generation.configuration_utils.GenerationConfig] = Nonelogits_processor: typing.Optional[transformers.generation.tf_logits_process.TFLogitsProcessorList] = Noneseed = None**kwargs ) → or tf.Tensor

generation_config (~generation.GenerationConfig, optional) — The generation configuration to be used as base parametrization for the generation call. **kwargs passed to generate matching the attributes of generation_config will override them. If generation_config is not provided, the default will be used, which had the following loading priority: 1) from the generation_config.json model file, if it exists; 2) from the model configuration. Please note that unspecified parameters will inherit ’s default values, whose documentation should be checked to parameterize generation.

or tf.Tensor

A (if return_dict_in_generate=True or when config.return_dict_in_generate=True) or a tf.Tensor.

If the model is not an encoder-decoder model (model.config.is_encoder_decoder=False), the possible types are:

,

,

,

If the model is an encoder-decoder model (model.config.is_encoder_decoder=True), the possible types are:

,

,

,

For an overview of generation strategies and code examples, check out the .

A class containing all functions for auto-regressive text generation, to be used as a mixin in .

The class exposes , which can be used for:

You do not need to call any of the above methods directly. Pass custom parameter values to ‘generate’ instead. To learn more about decoding strategies refer to the .

generation_config (~generation.GenerationConfig, optional) — The generation configuration to be used as base parametrization for the generation call. **kwargs passed to generate matching the attributes of generation_config will override them. If generation_config is not provided, the default will be used, which had the following loading priority: 1) from the generation_config.json model file, if it exists; 2) from the model configuration. Please note that unspecified parameters will inherit ’s default values, whose documentation should be checked to parameterize generation.

🌍
🌍
generate()
GenerationMixin
generate()
TFGenerationMixin
generate()
FlaxGenerationMixin
GenerationConfig
text generation strategies guide
<source>
this paper
this paper
Truncation Sampling as Language Model Desmoothing
Truncation Sampling as Language Model Desmoothing
this paper
NoBadWordsLogitsProcessor
disjunctive constraint
mBART
SequenceBiasLogitsProcessor
ModelOutput
greedy_search()
contrastive_search()
sample()
beam_search()
beam_sample()
group_beam_search()
constrained_beam_search()
text generation strategies guide
<source>
GenerationConfig
save_pretrained()
GenerationConfig
GenerationConfig
<source>
GenerationConfig
GenerationConfig
GenerationConfig
PretrainedConfig
PretrainedConfig
GenerationConfig
<source>
push_to_hub()
from_pretrained()
<source>
PreTrainedModel
generate()
greedy_search()
contrastive_search()
sample()
beam_search()
beam_sample()
group_beam_search()
constrained_beam_search()
text generation strategies guide
<source>
ModelOutput
GenerationConfig
Autoregressive Entity Retrieval
ModelOutput
ModelOutput
ModelOutput
GreedySearchDecoderOnlyOutput
SampleDecoderOnlyOutput
BeamSearchDecoderOnlyOutput
BeamSampleDecoderOnlyOutput
ModelOutput
GreedySearchEncoderDecoderOutput
SampleEncoderDecoderOutput
BeamSearchEncoderDecoderOutput
BeamSampleEncoderDecoderOutput
following guide
<source>
<source>
LogitsProcessorList
LogitsProcessor
StoppingCriteriaList
StoppingCriteria
ModelOutput
greedy_search()
following guide
<source>
SampleDecoderOnlyOutput
SampleEncoderDecoderOutput
LogitsProcessorList
LogitsProcessor
StoppingCriteriaList
StoppingCriteria
LogitsProcessorList
LogitsWarper
ModelOutput
SampleDecoderOnlyOutput
SampleEncoderDecoderOutput
SampleDecoderOnlyOutput
SampleEncoderDecoderOutput
sample()
following guide
<source>
BeamScorer
BeamScorer
LogitsProcessorList
LogitsProcessor
StoppingCriteriaList
StoppingCriteria
ModelOutput
beam_search()
following guide
<source>
BeamScorer
BeamScorer
LogitsProcessorList
LogitsProcessor
StoppingCriteriaList
StoppingCriteria
LogitsProcessorList
LogitsWarper
ModelOutput
beam_sample()
following guide
<source>
LogitsProcessorList
LogitsProcessor
LogitsProcessorList
LogitsWarper
StoppingCriteriaList
StoppingCriteria
ModelOutput
contrastive_search()
following guide
<source>
BeamScorer
BeamScorer
LogitsProcessorList
LogitsProcessor
StoppingCriteriaList
StoppingCriteria
ModelOutput
group_beam_search()
following guide
<source>
BeamScorer
ConstrainedBeamSearchScorer
LogitsProcessorList
LogitsProcessor
StoppingCriteriaList
StoppingCriteria
LogitsProcessorList
LogitsWarper
ModelOutput
constrained_beam_search()
following guide
<source>
TFPreTrainedModel
generate()
text generation strategies guide
<source>
ModelOutput
GenerationConfig
ModelOutput
ModelOutput
ModelOutput
TFGreedySearchDecoderOnlyOutput
TFSampleDecoderOnlyOutput
TFBeamSearchDecoderOnlyOutput
TFBeamSampleDecoderOnlyOutput
ModelOutput
TFGreedySearchEncoderDecoderOutput
TFSampleEncoderDecoderOutput
TFBeamSearchEncoderDecoderOutput
TFBeamSampleEncoderDecoderOutput
following guide
<source>
<source>
FlaxPreTrainedModel
generate()
text generation strategies guide
<source>
GenerationConfig