Normalizers
Normalizers
PythonRustNode
BertNormalizer
class tokenizers.normalizers.BertNormalizer
( clean_text = Truehandle_chinese_chars = Truestrip_accents = Nonelowercase = True )
Parameters
clean_text (
bool
, optional, defaults toTrue
) β Whether to clean the text, by removing any control characters and replacing all whitespaces by the classic one.handle_chinese_chars (
bool
, optional, defaults toTrue
) β Whether to handle chinese chars by putting spaces around them.strip_accents (
bool
, optional) β Whether to strip all accents. If this option is not specified (ie == None), then it will be determined by the value for lowercase (as in the original Bert).lowercase (
bool
, optional, defaults toTrue
) β Whether to lowercase.
BertNormalizer
Takes care of normalizing raw text before giving it to a Bert model. This includes cleaning the text, handling accents, chinese chars and lowercasing
Lowercase
class tokenizers.normalizers.Lowercase
( )
Lowercase Normalizer
NFC
class tokenizers.normalizers.NFC
( )
NFC Unicode Normalizer
NFD
class tokenizers.normalizers.NFD
( )
NFD Unicode Normalizer
NFKC
class tokenizers.normalizers.NFKC
( )
NFKC Unicode Normalizer
NFKD
class tokenizers.normalizers.NFKD
( )
NFKD Unicode Normalizer
Nmt
class tokenizers.normalizers.Nmt
( )
Nmt normalizer
Normalizer
class tokenizers.normalizers.Normalizer
( )
Base class for all normalizers
This class is not supposed to be instantiated directly. Instead, any implementation of a Normalizer will return an instance of this class when instantiated.
normalize
( normalized )
Parameters
normalized (
NormalizedString
) β The normalized string on which to apply this Normalizer
Normalize a NormalizedString
in-place
This method allows to modify a NormalizedString
to keep track of the alignment information. If you just want to see the result of the normalization on a raw string, you can use normalize_str()
normalize_str
( sequence ) β str
Parameters
sequence (
str
) β A string to normalize
Returns
str
A string after normalization
Normalize the given string
This method provides a way to visualize the effect of a Normalizer but it does not keep track of the alignment information. If you need to get/convert offsets, you can use normalize()
Precompiled
class tokenizers.normalizers.Precompiled
( precompiled_charsmap )
Precompiled normalizer Donβt use manually it is used for compatiblity for SentencePiece.
Replace
class tokenizers.normalizers.Replace
( patterncontent )
Replace normalizer
Sequence
class tokenizers.normalizers.Sequence
( )
Parameters
normalizers (
List[Normalizer]
) β A list of Normalizer to be run as a sequence
Allows concatenating multiple other Normalizer as a Sequence. All the normalizers run in sequence in the given order
Strip
class tokenizers.normalizers.Strip
( left = Trueright = True )
Strip normalizer
StripAccents
class tokenizers.normalizers.StripAccents
( )
StripAccents normalizer
Last updated