Transformers
  • ๐ŸŒGET STARTED
    • Transformers
    • Quick tour
    • Installation
  • ๐ŸŒTUTORIALS
    • Run inference with pipelines
    • Write portable code with AutoClass
    • Preprocess data
    • Fine-tune a pretrained model
    • Train with a script
    • Set up distributed training with BOINC AI Accelerate
    • Load and train adapters with BOINC AI PEFT
    • Share your model
    • Agents
    • Generation with LLMs
  • ๐ŸŒTASK GUIDES
    • ๐ŸŒNATURAL LANGUAGE PROCESSING
      • Text classification
      • Token classification
      • Question answering
      • Causal language modeling
      • Masked language modeling
      • Translation
      • Summarization
      • Multiple choice
    • ๐ŸŒAUDIO
      • Audio classification
      • Automatic speech recognition
    • ๐ŸŒCOMPUTER VISION
      • Image classification
      • Semantic segmentation
      • Video classification
      • Object detection
      • Zero-shot object detection
      • Zero-shot image classification
      • Depth estimation
    • ๐ŸŒMULTIMODAL
      • Image captioning
      • Document Question Answering
      • Visual Question Answering
      • Text to speech
    • ๐ŸŒGENERATION
      • Customize the generation strategy
    • ๐ŸŒPROMPTING
      • Image tasks with IDEFICS
  • ๐ŸŒDEVELOPER GUIDES
    • Use fast tokenizers from BOINC AI Tokenizers
    • Run inference with multilingual models
    • Use model-specific APIs
    • Share a custom model
    • Templates for chat models
    • Run training on Amazon SageMaker
    • Export to ONNX
    • Export to TFLite
    • Export to TorchScript
    • Benchmarks
    • Notebooks with examples
    • Community resources
    • Custom Tools and Prompts
    • Troubleshoot
  • ๐ŸŒPERFORMANCE AND SCALABILITY
    • Overview
    • ๐ŸŒEFFICIENT TRAINING TECHNIQUES
      • Methods and tools for efficient training on a single GPU
      • Multiple GPUs and parallelism
      • Efficient training on CPU
      • Distributed CPU training
      • Training on TPUs
      • Training on TPU with TensorFlow
      • Training on Specialized Hardware
      • Custom hardware for training
      • Hyperparameter Search using Trainer API
    • ๐ŸŒOPTIMIZING INFERENCE
      • Inference on CPU
      • Inference on one GPU
      • Inference on many GPUs
      • Inference on Specialized Hardware
    • Instantiating a big model
    • Troubleshooting
    • XLA Integration for TensorFlow Models
    • Optimize inference using `torch.compile()`
  • ๐ŸŒCONTRIBUTE
    • How to contribute to transformers?
    • How to add a model to BOINC AI Transformers?
    • How to convert a BOINC AI Transformers model to TensorFlow?
    • How to add a pipeline to BOINC AI Transformers?
    • Testing
    • Checks on a Pull Request
  • ๐ŸŒCONCEPTUAL GUIDES
    • Philosophy
    • Glossary
    • What BOINC AI Transformers can do
    • How BOINC AI Transformers solve tasks
    • The Transformer model family
    • Summary of the tokenizers
    • Attention mechanisms
    • Padding and truncation
    • BERTology
    • Perplexity of fixed-length models
    • Pipelines for webserver inference
    • Model training anatomy
  • ๐ŸŒAPI
    • ๐ŸŒMAIN CLASSES
      • Agents and Tools
      • ๐ŸŒAuto Classes
        • Extending the Auto Classes
        • AutoConfig
        • AutoTokenizer
        • AutoFeatureExtractor
        • AutoImageProcessor
        • AutoProcessor
        • Generic model classes
          • AutoModel
          • TFAutoModel
          • FlaxAutoModel
        • Generic pretraining classes
          • AutoModelForPreTraining
          • TFAutoModelForPreTraining
          • FlaxAutoModelForPreTraining
        • Natural Language Processing
          • AutoModelForCausalLM
          • TFAutoModelForCausalLM
          • FlaxAutoModelForCausalLM
          • AutoModelForMaskedLM
          • TFAutoModelForMaskedLM
          • FlaxAutoModelForMaskedLM
          • AutoModelForMaskGenerationge
          • TFAutoModelForMaskGeneration
          • AutoModelForSeq2SeqLM
          • TFAutoModelForSeq2SeqLM
          • FlaxAutoModelForSeq2SeqLM
          • AutoModelForSequenceClassification
          • TFAutoModelForSequenceClassification
          • FlaxAutoModelForSequenceClassification
          • AutoModelForMultipleChoice
          • TFAutoModelForMultipleChoice
          • FlaxAutoModelForMultipleChoice
          • AutoModelForNextSentencePrediction
          • TFAutoModelForNextSentencePrediction
          • FlaxAutoModelForNextSentencePrediction
          • AutoModelForTokenClassification
          • TFAutoModelForTokenClassification
          • FlaxAutoModelForTokenClassification
          • AutoModelForQuestionAnswering
          • TFAutoModelForQuestionAnswering
          • FlaxAutoModelForQuestionAnswering
          • AutoModelForTextEncoding
          • TFAutoModelForTextEncoding
        • Computer vision
          • AutoModelForDepthEstimation
          • AutoModelForImageClassification
          • TFAutoModelForImageClassification
          • FlaxAutoModelForImageClassification
          • AutoModelForVideoClassification
          • AutoModelForMaskedImageModeling
          • TFAutoModelForMaskedImageModeling
          • AutoModelForObjectDetection
          • AutoModelForImageSegmentation
          • AutoModelForImageToImage
          • AutoModelForSemanticSegmentation
          • TFAutoModelForSemanticSegmentation
          • AutoModelForInstanceSegmentation
          • AutoModelForUniversalSegmentation
          • AutoModelForZeroShotImageClassification
          • TFAutoModelForZeroShotImageClassification
          • AutoModelForZeroShotObjectDetection
        • Audio
          • AutoModelForAudioClassification
          • AutoModelForAudioFrameClassification
          • TFAutoModelForAudioFrameClassification
          • AutoModelForCTC
          • AutoModelForSpeechSeq2Seq
          • TFAutoModelForSpeechSeq2Seq
          • FlaxAutoModelForSpeechSeq2Seq
          • AutoModelForAudioXVector
          • AutoModelForTextToSpectrogram
          • AutoModelForTextToWaveform
        • Multimodal
          • AutoModelForTableQuestionAnswering
          • TFAutoModelForTableQuestionAnswering
          • AutoModelForDocumentQuestionAnswering
          • TFAutoModelForDocumentQuestionAnswering
          • AutoModelForVisualQuestionAnswering
          • AutoModelForVision2Seq
          • TFAutoModelForVision2Seq
          • FlaxAutoModelForVision2Seq
      • Callbacks
      • Configuration
      • Data Collator
      • Keras callbacks
      • Logging
      • Models
      • Text Generation
      • ONNX
      • Optimization
      • Model outputs
      • Pipelines
      • Processors
      • Quantization
      • Tokenizer
      • Trainer
      • DeepSpeed Integration
      • Feature Extractor
      • Image Processor
    • ๐ŸŒMODELS
      • ๐ŸŒTEXT MODELS
        • ALBERT
        • BART
        • BARThez
        • BARTpho
        • BERT
        • BertGeneration
        • BertJapanese
        • Bertweet
        • BigBird
        • BigBirdPegasus
        • BioGpt
        • Blenderbot
        • Blenderbot Small
        • BLOOM
        • BORT
        • ByT5
        • CamemBERT
        • CANINE
        • CodeGen
        • CodeLlama
        • ConvBERT
        • CPM
        • CPMANT
        • CTRL
        • DeBERTa
        • DeBERTa-v2
        • DialoGPT
        • DistilBERT
        • DPR
        • ELECTRA
        • Encoder Decoder Models
        • ERNIE
        • ErnieM
        • ESM
        • Falcon
        • FLAN-T5
        • FLAN-UL2
        • FlauBERT
        • FNet
        • FSMT
        • Funnel Transformer
        • GPT
        • GPT Neo
        • GPT NeoX
        • GPT NeoX Japanese
        • GPT-J
        • GPT2
        • GPTBigCode
        • GPTSAN Japanese
        • GPTSw3
        • HerBERT
        • I-BERT
        • Jukebox
        • LED
        • LLaMA
        • LLama2
        • Longformer
        • LongT5
        • LUKE
        • M2M100
        • MarianMT
        • MarkupLM
        • MBart and MBart-50
        • MEGA
        • MegatronBERT
        • MegatronGPT2
        • Mistral
        • mLUKE
        • MobileBERT
        • MPNet
        • MPT
        • MRA
        • MT5
        • MVP
        • NEZHA
        • NLLB
        • NLLB-MoE
        • Nystrรถmformer
        • Open-Llama
        • OPT
        • Pegasus
        • PEGASUS-X
        • Persimmon
        • PhoBERT
        • PLBart
        • ProphetNet
        • QDQBert
        • RAG
        • REALM
        • Reformer
        • RemBERT
        • RetriBERT
        • RoBERTa
        • RoBERTa-PreLayerNorm
        • RoCBert
        • RoFormer
        • RWKV
        • Splinter
        • SqueezeBERT
        • SwitchTransformers
        • T5
        • T5v1.1
        • TAPEX
        • Transformer XL
        • UL2
        • UMT5
        • X-MOD
        • XGLM
        • XLM
        • XLM-ProphetNet
        • XLM-RoBERTa
        • XLM-RoBERTa-XL
        • XLM-V
        • XLNet
        • YOSO
      • ๐ŸŒVISION MODELS
        • BEiT
        • BiT
        • Conditional DETR
        • ConvNeXT
        • ConvNeXTV2
        • CvT
        • Deformable DETR
        • DeiT
        • DETA
        • DETR
        • DiNAT
        • DINO V2
        • DiT
        • DPT
        • EfficientFormer
        • EfficientNet
        • FocalNet
        • GLPN
        • ImageGPT
        • LeViT
        • Mask2Former
        • MaskFormer
        • MobileNetV1
        • MobileNetV2
        • MobileViT
        • MobileViTV2
        • NAT
        • PoolFormer
        • Pyramid Vision Transformer (PVT)
        • RegNet
        • ResNet
        • SegFormer
        • SwiftFormer
        • Swin Transformer
        • Swin Transformer V2
        • Swin2SR
        • Table Transformer
        • TimeSformer
        • UperNet
        • VAN
        • VideoMAE
        • Vision Transformer (ViT)
        • ViT Hybrid
        • ViTDet
        • ViTMAE
        • ViTMatte
        • ViTMSN
        • ViViT
        • YOLOS
      • ๐ŸŒAUDIO MODELS
        • Audio Spectrogram Transformer
        • Bark
        • CLAP
        • EnCodec
        • Hubert
        • MCTCT
        • MMS
        • MusicGen
        • Pop2Piano
        • SEW
        • SEW-D
        • Speech2Text
        • Speech2Text2
        • SpeechT5
        • UniSpeech
        • UniSpeech-SAT
        • VITS
        • Wav2Vec2
        • Wav2Vec2-Conformer
        • Wav2Vec2Phoneme
        • WavLM
        • Whisper
        • XLS-R
        • XLSR-Wav2Vec2
      • ๐ŸŒMULTIMODAL MODELS
        • ALIGN
        • AltCLIP
        • BLIP
        • BLIP-2
        • BridgeTower
        • BROS
        • Chinese-CLIP
        • CLIP
        • CLIPSeg
        • Data2Vec
        • DePlot
        • Donut
        • FLAVA
        • GIT
        • GroupViT
        • IDEFICS
        • InstructBLIP
        • LayoutLM
        • LayoutLMV2
        • LayoutLMV3
        • LayoutXLM
        • LiLT
        • LXMERT
        • MatCha
        • MGP-STR
        • Nougat
        • OneFormer
        • OWL-ViT
        • Perceiver
        • Pix2Struct
        • Segment Anything
        • Speech Encoder Decoder Models
        • TAPAS
        • TrOCR
        • TVLT
        • ViLT
        • Vision Encoder Decoder Models
        • Vision Text Dual Encoder
        • VisualBERT
        • X-CLIP
      • ๐ŸŒREINFORCEMENT LEARNING MODELS
        • Decision Transformer
        • Trajectory Transformer
      • ๐ŸŒTIME SERIES MODELS
        • Autoformer
        • Informer
        • Time Series Transformer
      • ๐ŸŒGRAPH MODELS
        • Graphormer
  • ๐ŸŒINTERNAL HELPERS
    • Custom Layers and Utilities
    • Utilities for pipelines
    • Utilities for Tokenizers
    • Utilities for Trainer
    • Utilities for Generation
    • Utilities for Image Processors
    • Utilities for Audio processing
    • General Utilities
    • Utilities for Time Series
Powered by GitBook
On this page
  • ByT5
  • Overview
  • ByT5Tokenizer
  1. API
  2. MODELS
  3. TEXT MODELS

ByT5

PreviousBORTNextCamemBERT

Last updated 1 year ago

ByT5

Overview

The ByT5 model was presented in by Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, Colin Raffel.

The abstract from the paper is the following:

Most widely-used pre-trained language models operate on sequences of tokens corresponding to word or subword units. Encoding text as a sequence of tokens requires a tokenizer, which is typically created as an independent artifact from the model. Token-free models that instead operate directly on raw text (bytes or characters) have many benefits: they can process text in any language out of the box, they are more robust to noise, and they minimize technical debt by removing complex and error-prone text preprocessing pipelines. Since byte or character sequences are longer than token sequences, past work on token-free models has often introduced new model architectures designed to amortize the cost of operating directly on raw text. In this paper, we show that a standard Transformer architecture can be used with minimal modifications to process byte sequences. We carefully characterize the trade-offs in terms of parameter count, training FLOPs, and inference speed, and show that byte-level models are competitive with their token-level counterparts. We also demonstrate that byte-level models are significantly more robust to noise and perform better on tasks that are sensitive to spelling and pronunciation. As part of our contribution, we release a new set of pre-trained byte-level Transformer models based on the T5 architecture, as well as all code and data used in our experiments.

This model was contributed by . The original code can be found .

ByT5โ€™s architecture is based on the T5v1.1 model, so one can refer to . They only differ in how inputs should be prepared for the model, see the code examples below.

Since ByT5 was pre-trained unsupervisedly, thereโ€™s no real advantage to using a task prefix during single-task fine-tuning. If you are doing multi-task fine-tuning, you should use a prefix.

Example

ByT5 works on raw UTF-8 bytes, so it can be used without a tokenizer:

Copied

>>> from transformers import T5ForConditionalGeneration
>>> import torch

>>> model = T5ForConditionalGeneration.from_pretrained("google/byt5-small")

>>> num_special_tokens = 3
>>> # Model has 3 special tokens which take up the input ids 0,1,2 of ByT5.
>>> # => Need to shift utf-8 character encodings by 3 before passing ids to model.

>>> input_ids = torch.tensor([list("Life is like a box of chocolates.".encode("utf-8"))]) + num_special_tokens

>>> labels = torch.tensor([list("La vie est comme une boรฎte de chocolat.".encode("utf-8"))]) + num_special_tokens

>>> loss = model(input_ids, labels=labels).loss
>>> loss.item()
2.66

For batched inference and training it is however recommended to make use of the tokenizer:

Copied

>>> from transformers import T5ForConditionalGeneration, AutoTokenizer

>>> model = T5ForConditionalGeneration.from_pretrained("google/byt5-small")
>>> tokenizer = AutoTokenizer.from_pretrained("google/byt5-small")

>>> model_inputs = tokenizer(
...     ["Life is like a box of chocolates.", "Today is Monday."], padding="longest", return_tensors="pt"
... )
>>> labels_dict = tokenizer(
...     ["La vie est comme une boรฎte de chocolat.", "Aujourd'hui c'est lundi."], padding="longest", return_tensors="pt"
... )
>>> labels = labels_dict.input_ids

>>> loss = model(**model_inputs, labels=labels).loss
>>> loss.item()
17.9

Copied

>>> from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
>>> import torch

>>> tokenizer = AutoTokenizer.from_pretrained("google/byt5-base")
>>> model = AutoModelForSeq2SeqLM.from_pretrained("google/byt5-base")

>>> input_ids_prompt = "The dog chases a ball in the park."
>>> input_ids = tokenizer(input_ids_prompt).input_ids

>>> # Note that we cannot add "{extra_id_...}" to the string directly
>>> # as the Byte tokenizer would incorrectly merge the tokens
>>> # For ByT5, we need to work directly on the character level
>>> # Contrary to T5, ByT5 does not use sentinel tokens for masking, but instead
>>> # uses final utf character ids.
>>> # UTF-8 is represented by 8 bits and ByT5 has 3 special tokens.
>>> # => There are 2**8+2 = 259 input ids and mask tokens count down from index 258.
>>> # => mask to "The dog [258]a ball [257]park."

>>> input_ids = torch.tensor([input_ids[:8] + [258] + input_ids[14:21] + [257] + input_ids[28:]])
>>> input_ids
tensor([[ 87, 107, 104,  35, 103, 114, 106,  35, 258,  35, 100,  35, 101, 100, 111, 111, 257,  35, 115, 100, 117, 110,  49,   1]])

>>> # ByT5 produces only one char at a time so we need to produce many more output characters here -> set `max_length=100`.
>>> output_ids = model.generate(input_ids, max_length=100)[0].tolist()
>>> output_ids
[0, 258, 108, 118,  35, 119, 107, 104,  35, 114, 113, 104,  35, 122, 107, 114,  35, 103, 114, 104, 118, 257,  35, 108, 113,  35, 119, 107, 104,  35, 103, 108, 118, 102, 114, 256, 108, 113,  35, 119, 107, 104, 35, 115, 100, 117, 110,  49,  35,  87, 107, 104,  35, 103, 114, 106, 35, 108, 118,  35, 119, 107, 104,  35, 114, 113, 104,  35, 122, 107, 114,  35, 103, 114, 104, 118,  35, 100,  35, 101, 100, 111, 111,  35, 108, 113, 255,  35, 108, 113,  35, 119, 107, 104,  35, 115, 100, 117, 110,  49]

>>> # ^- Note how 258 descends to 257, 256, 255

>>> # Now we need to split on the sentinel tokens, let's write a short loop for this
>>> output_ids_list = []
>>> start_token = 0
>>> sentinel_token = 258
>>> while sentinel_token in output_ids:
...     split_idx = output_ids.index(sentinel_token)
...     output_ids_list.append(output_ids[start_token:split_idx])
...     start_token = split_idx
...     sentinel_token -= 1

>>> output_ids_list.append(output_ids[start_token:])
>>> output_string = tokenizer.batch_decode(output_ids_list)
>>> output_string
['<pad>', 'is the one who does', ' in the disco', 'in the park. The dog is the one who does a ball in', ' in the park.']

ByT5Tokenizer

class transformers.ByT5Tokenizer

( eos_token = '</s>'unk_token = '<unk>'pad_token = '<pad>'extra_ids = 125additional_special_tokens = None**kwargs )

Parameters

  • eos_token (str, optional, defaults to "</s>") โ€” The end of sequence token.

    When building a sequence using special tokens, this is not the token that is used for the end of sequence. The token used is the sep_token.

  • unk_token (str, optional, defaults to "<unk>") โ€” The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead.

  • pad_token (str, optional, defaults to "<pad>") โ€” The token used for padding, for example when batching sequences of different lengths.

  • additional_special_tokens (List[str], optional) โ€” Additional special tokens used by the tokenizer.

Construct a ByT5 tokenizer. ByT5 simply uses raw bytes utf-8 encoding.

build_inputs_with_special_tokens

( token_ids_0: typing.List[int]token_ids_1: typing.Optional[typing.List[int]] = None ) โ†’ List[int]

Parameters

  • token_ids_0 (List[int]) โ€” List of IDs to which the special tokens will be added.

  • token_ids_1 (List[int], optional) โ€” Optional second list of IDs for sequence pairs.

Returns

List[int]

Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. A sequence has the following format:

  • single sequence: X </s>

  • pair of sequences: A </s> B </s>

convert_tokens_to_string

( tokens )

Converts a sequence of tokens (string) in a single string.

create_token_type_ids_from_sequences

( token_ids_0: typing.List[int]token_ids_1: typing.Optional[typing.List[int]] = None ) โ†’ List[int]

Parameters

  • token_ids_0 (List[int]) โ€” List of IDs.

  • token_ids_1 (List[int], optional) โ€” Optional second list of IDs for sequence pairs.

Returns

List[int]

List of zeros.

Create a mask from the two sequences passed to be used in a sequence-pair classification task. ByT5 does not make use of token type ids, therefore a list of zeros is returned.

get_special_tokens_mask

( token_ids_0: typing.List[int]token_ids_1: typing.Optional[typing.List[int]] = Nonealready_has_special_tokens: bool = False ) โ†’ List[int]

Parameters

  • token_ids_0 (List[int]) โ€” List of IDs.

  • token_ids_1 (List[int], optional) โ€” Optional second list of IDs for sequence pairs.

  • already_has_special_tokens (bool, optional, defaults to False) โ€” Whether or not the token list is already formatted with special tokens for the model.

Returns

List[int]

A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.

Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer prepare_for_model method.

Similar to , ByT5 was trained on the span-mask denoising task. However, since the model works directly on characters, the pretraining task is a bit different. Letโ€™s corrupt some characters of the input sentence "The dog chases a ball in the park." and ask ByT5 to predict them for us.

extra_ids (int, optional, defaults to 100) โ€” Add a number of extra ids added to the end of the vocabulary for use as sentinels. These tokens are accessible as โ€œid{%d}>โ€ where โ€{%d}โ€ is a number between 0 and extra_ids-1. Extra tokens are indexed from the end of the vocabulary up to beginning (โ€œโ€ is the last token in the vocabulary like in ByT5 preprocessing see ).

This tokenizer inherits from which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.

List of with the appropriate special tokens.

See for all details.

๐ŸŒ
๐ŸŒ
๐ŸŒ
ByT5: Towards a token-free future with pre-trained byte-to-byte models
patrickvonplaten
here
T5v1.1โ€™s documentation page
T5
<source>
here
PreTrainedTokenizer
<source>
input IDs
<source>
<source>
<source>
ByT5Tokenizer