Decoders
Decoders
PythonRustNode
BPEDecoder
class tokenizers.decoders.BPEDecoder
( suffix = '</w>' )
Parameters
suffix (
str
, optional, defaults to</w>
) β The suffix that was used to caracterize an end-of-word. This suffix will be replaced by whitespaces during the decoding
BPEDecoder Decoder
ByteLevel
class tokenizers.decoders.ByteLevel
( )
ByteLevel Decoder
This decoder is to be used in tandem with the ByteLevel PreTokenizer.
CTC
class tokenizers.decoders.CTC
( pad_token = '<pad>'word_delimiter_token = '|'cleanup = True )
Parameters
pad_token (
str
, optional, defaults to<pad>
) β The pad token used by CTC to delimit a new token.word_delimiter_token (
str
, optional, defaults to|
) β The word delimiter token. It will be replaced by acleanup (
bool
, optional, defaults toTrue
) β Whether to cleanup some tokenization artifacts. Mainly spaces before punctuation, and some abbreviated english forms.
CTC Decoder
Metaspace
class tokenizers.decoders.Metaspace
( )
Parameters
replacement (
str
, optional, defaults toβ
) β The replacement character. Must be exactly one character. By default we use the β (U+2581) meta symbol (Same as in SentencePiece).add_prefix_space (
bool
, optional, defaults toTrue
) β Whether to add a space to the first word if there isnβt already one. This lets us treat hello exactly like say hello.
Metaspace Decoder
WordPiece
class tokenizers.decoders.WordPiece
( prefix = '##'cleanup = True )
Parameters
prefix (
str
, optional, defaults to##
) β The prefix to use for subwords that are not a beginning-of-wordcleanup (
bool
, optional, defaults toTrue
) β Whether to cleanup some tokenization artifacts. Mainly spaces before punctuation, and some abbreviated english forms.
WordPiece Decoder
Last updated