ByT5

ByT5

Overview

The ByT5 model was presented in ByT5: Towards a token-free future with pre-trained byte-to-byte modelsarrow-up-right by Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, Colin Raffel.

The abstract from the paper is the following:

Most widely-used pre-trained language models operate on sequences of tokens corresponding to word or subword units. Encoding text as a sequence of tokens requires a tokenizer, which is typically created as an independent artifact from the model. Token-free models that instead operate directly on raw text (bytes or characters) have many benefits: they can process text in any language out of the box, they are more robust to noise, and they minimize technical debt by removing complex and error-prone text preprocessing pipelines. Since byte or character sequences are longer than token sequences, past work on token-free models has often introduced new model architectures designed to amortize the cost of operating directly on raw text. In this paper, we show that a standard Transformer architecture can be used with minimal modifications to process byte sequences. We carefully characterize the trade-offs in terms of parameter count, training FLOPs, and inference speed, and show that byte-level models are competitive with their token-level counterparts. We also demonstrate that byte-level models are significantly more robust to noise and perform better on tasks that are sensitive to spelling and pronunciation. As part of our contribution, we release a new set of pre-trained byte-level Transformer models based on the T5 architecture, as well as all code and data used in our experiments.

This model was contributed by patrickvonplatenarrow-up-right. The original code can be found herearrow-up-right.

ByT5โ€™s architecture is based on the T5v1.1 model, so one can refer to T5v1.1โ€™s documentation pagearrow-up-right. They only differ in how inputs should be prepared for the model, see the code examples below.

Since ByT5 was pre-trained unsupervisedly, thereโ€™s no real advantage to using a task prefix during single-task fine-tuning. If you are doing multi-task fine-tuning, you should use a prefix.

Example

ByT5 works on raw UTF-8 bytes, so it can be used without a tokenizer:

Copied

>>> from transformers import T5ForConditionalGeneration
>>> import torch

>>> model = T5ForConditionalGeneration.from_pretrained("google/byt5-small")

>>> num_special_tokens = 3
>>> # Model has 3 special tokens which take up the input ids 0,1,2 of ByT5.
>>> # => Need to shift utf-8 character encodings by 3 before passing ids to model.

>>> input_ids = torch.tensor([list("Life is like a box of chocolates.".encode("utf-8"))]) + num_special_tokens

>>> labels = torch.tensor([list("La vie est comme une boรฎte de chocolat.".encode("utf-8"))]) + num_special_tokens

>>> loss = model(input_ids, labels=labels).loss
>>> loss.item()
2.66

For batched inference and training it is however recommended to make use of the tokenizer:

Copied

Similar to T5arrow-up-right, ByT5 was trained on the span-mask denoising task. However, since the model works directly on characters, the pretraining task is a bit different. Letโ€™s corrupt some characters of the input sentence "The dog chases a ball in the park." and ask ByT5 to predict them for us.

Copied

ByT5Tokenizer

class transformers.ByT5Tokenizer

<source>arrow-up-right

( eos_token = '</s>'unk_token = '<unk>'pad_token = '<pad>'extra_ids = 125additional_special_tokens = None**kwargs )

Parameters

  • eos_token (str, optional, defaults to "</s>") โ€” The end of sequence token.

    When building a sequence using special tokens, this is not the token that is used for the end of sequence. The token used is the sep_token.

  • unk_token (str, optional, defaults to "<unk>") โ€” The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead.

  • pad_token (str, optional, defaults to "<pad>") โ€” The token used for padding, for example when batching sequences of different lengths.

  • extra_ids (int, optional, defaults to 100) โ€” Add a number of extra ids added to the end of the vocabulary for use as sentinels. These tokens are accessible as โ€œid{%d}>โ€ where โ€{%d}โ€ is a number between 0 and extra_ids-1. Extra tokens are indexed from the end of the vocabulary up to beginning (โ€œโ€ is the last token in the vocabulary like in ByT5 preprocessing see herearrow-up-right).

  • additional_special_tokens (List[str], optional) โ€” Additional special tokens used by the tokenizer.

Construct a ByT5 tokenizer. ByT5 simply uses raw bytes utf-8 encoding.

This tokenizer inherits from PreTrainedTokenizerarrow-up-right which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.

build_inputs_with_special_tokens

<source>arrow-up-right

( token_ids_0: typing.List[int]token_ids_1: typing.Optional[typing.List[int]] = None ) โ†’ List[int]

Parameters

  • token_ids_0 (List[int]) โ€” List of IDs to which the special tokens will be added.

  • token_ids_1 (List[int], optional) โ€” Optional second list of IDs for sequence pairs.

Returns

List[int]

List of input IDsarrow-up-right with the appropriate special tokens.

Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. A sequence has the following format:

  • single sequence: X </s>

  • pair of sequences: A </s> B </s>

convert_tokens_to_string

<source>arrow-up-right

( tokens )

Converts a sequence of tokens (string) in a single string.

create_token_type_ids_from_sequences

<source>arrow-up-right

( token_ids_0: typing.List[int]token_ids_1: typing.Optional[typing.List[int]] = None ) โ†’ List[int]

Parameters

  • token_ids_0 (List[int]) โ€” List of IDs.

  • token_ids_1 (List[int], optional) โ€” Optional second list of IDs for sequence pairs.

Returns

List[int]

List of zeros.

Create a mask from the two sequences passed to be used in a sequence-pair classification task. ByT5 does not make use of token type ids, therefore a list of zeros is returned.

get_special_tokens_mask

<source>arrow-up-right

( token_ids_0: typing.List[int]token_ids_1: typing.Optional[typing.List[int]] = Nonealready_has_special_tokens: bool = False ) โ†’ List[int]

Parameters

  • token_ids_0 (List[int]) โ€” List of IDs.

  • token_ids_1 (List[int], optional) โ€” Optional second list of IDs for sequence pairs.

  • already_has_special_tokens (bool, optional, defaults to False) โ€” Whether or not the token list is already formatted with special tokens for the model.

Returns

List[int]

A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.

Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer prepare_for_model method.

See ByT5Tokenizerarrow-up-right for all details.

Last updated