Normalizers
Normalizers
PythonRustNode
BertNormalizer
class tokenizers.normalizers.BertNormalizer
( clean_text = Truehandle_chinese_chars = Truestrip_accents = Nonelowercase = True )
Parameters
clean_text (
bool
, optional, defaults toTrue
) — Whether to clean the text, by removing any control characters and replacing all whitespaces by the classic one.handle_chinese_chars (
bool
, optional, defaults toTrue
) — Whether to handle chinese chars by putting spaces around them.strip_accents (
bool
, optional) — Whether to strip all accents. If this option is not specified (ie == None), then it will be determined by the value for lowercase (as in the original Bert).lowercase (
bool
, optional, defaults toTrue
) — Whether to lowercase.
BertNormalizer
Takes care of normalizing raw text before giving it to a Bert model. This includes cleaning the text, handling accents, chinese chars and lowercasing
Lowercase
class tokenizers.normalizers.Lowercase
( )
Lowercase Normalizer
NFC
class tokenizers.normalizers.NFC
( )
NFC Unicode Normalizer
NFD
class tokenizers.normalizers.NFD
( )
NFD Unicode Normalizer
NFKC
class tokenizers.normalizers.NFKC
( )
NFKC Unicode Normalizer
NFKD
class tokenizers.normalizers.NFKD
( )
NFKD Unicode Normalizer
Nmt
class tokenizers.normalizers.Nmt
( )
Nmt normalizer
Normalizer
class tokenizers.normalizers.Normalizer
( )
Base class for all normalizers
This class is not supposed to be instantiated directly. Instead, any implementation of a Normalizer will return an instance of this class when instantiated.
normalize
( normalized )
Parameters
normalized (
NormalizedString
) — The normalized string on which to apply this Normalizer
Normalize a NormalizedString
in-place
This method allows to modify a NormalizedString
to keep track of the alignment information. If you just want to see the result of the normalization on a raw string, you can use normalize_str()
normalize_str
( sequence ) → str
Parameters
sequence (
str
) — A string to normalize
Returns
str
A string after normalization
Normalize the given string
This method provides a way to visualize the effect of a Normalizer but it does not keep track of the alignment information. If you need to get/convert offsets, you can use normalize()
Precompiled
class tokenizers.normalizers.Precompiled
( precompiled_charsmap )
Precompiled normalizer Don’t use manually it is used for compatiblity for SentencePiece.
Replace
class tokenizers.normalizers.Replace
( patterncontent )
Replace normalizer
Sequence
class tokenizers.normalizers.Sequence
( )
Parameters
normalizers (
List[Normalizer]
) — A list of Normalizer to be run as a sequence
Allows concatenating multiple other Normalizer as a Sequence. All the normalizers run in sequence in the given order
Strip
class tokenizers.normalizers.Strip
( left = Trueright = True )
Strip normalizer
StripAccents
class tokenizers.normalizers.StripAccents
( )
StripAccents normalizer
Last updated