Decoders

Decoders

PythonRustNode

BPEDecoder

class tokenizers.decoders.BPEDecoder

( suffix = '</w>' )

Parameters

  • suffix (str, optional, defaults to </w>) β€” The suffix that was used to caracterize an end-of-word. This suffix will be replaced by whitespaces during the decoding

BPEDecoder Decoder

ByteLevel

class tokenizers.decoders.ByteLevel

( )

ByteLevel Decoder

This decoder is to be used in tandem with the ByteLevel PreTokenizer.

CTC

class tokenizers.decoders.CTC

( pad_token = '<pad>'word_delimiter_token = '|'cleanup = True )

Parameters

  • pad_token (str, optional, defaults to <pad>) β€” The pad token used by CTC to delimit a new token.

  • word_delimiter_token (str, optional, defaults to |) β€” The word delimiter token. It will be replaced by a

  • cleanup (bool, optional, defaults to True) β€” Whether to cleanup some tokenization artifacts. Mainly spaces before punctuation, and some abbreviated english forms.

CTC Decoder

Metaspace

class tokenizers.decoders.Metaspace

( )

Parameters

  • replacement (str, optional, defaults to ▁) β€” The replacement character. Must be exactly one character. By default we use the ▁ (U+2581) meta symbol (Same as in SentencePiece).

  • add_prefix_space (bool, optional, defaults to True) β€” Whether to add a space to the first word if there isn’t already one. This lets us treat hello exactly like say hello.

Metaspace Decoder

WordPiece

class tokenizers.decoders.WordPiece

( prefix = '##'cleanup = True )

Parameters

  • prefix (str, optional, defaults to ##) β€” The prefix to use for subwords that are not a beginning-of-word

  • cleanup (bool, optional, defaults to True) β€” Whether to cleanup some tokenization artifacts. Mainly spaces before punctuation, and some abbreviated english forms.

WordPiece Decoder

Last updated