CtrlK

Decoders

Decoders

PythonRustNode

BPEDecoder

class tokenizers.decoders.BPEDecoder

( suffix = '</w>' )

Parameters

suffix (str, optional, defaults to </w>) — The suffix that was used to caracterize an end-of-word. This suffix will be replaced by whitespaces during the decoding

BPEDecoder Decoder

ByteLevel

class tokenizers.decoders.ByteLevel

( )

ByteLevel Decoder

This decoder is to be used in tandem with the ByteLevel PreTokenizer.

CTC

class tokenizers.decoders.CTC

( pad_token = '<pad>'word_delimiter_token = '|'cleanup = True )

Parameters

pad_token (str, optional, defaults to <pad>) — The pad token used by CTC to delimit a new token.
word_delimiter_token (str, optional, defaults to |) — The word delimiter token. It will be replaced by a
cleanup (bool, optional, defaults to True) — Whether to cleanup some tokenization artifacts. Mainly spaces before punctuation, and some abbreviated english forms.

CTC Decoder

Metaspace

class tokenizers.decoders.Metaspace

( )

Parameters

replacement (str, optional, defaults to ▁) — The replacement character. Must be exactly one character. By default we use the ▁ (U+2581) meta symbol (Same as in SentencePiece).
add_prefix_space (bool, optional, defaults to True) — Whether to add a space to the first word if there isn’t already one. This lets us treat hello exactly like say hello.

Metaspace Decoder

WordPiece

class tokenizers.decoders.WordPiece

( prefix = '##'cleanup = True )

Parameters

prefix (str, optional, defaults to ##) — The prefix to use for subwords that are not a beginning-of-word
cleanup (bool, optional, defaults to True) — Whether to cleanup some tokenization artifacts. Mainly spaces before punctuation, and some abbreviated english forms.

WordPiece Decoder

PreviousTrainers NextVisualizer

Last updated 1 year ago