clean_text (bool, optional, defaults to True) β Whether to clean the text, by removing any control characters and replacing all whitespaces by the classic one.
handle_chinese_chars (bool, optional, defaults to True) β Whether to handle chinese chars by putting spaces around them.
strip_accents (bool, optional) β Whether to strip all accents. If this option is not specified (ie == None), then it will be determined by the value for lowercase (as in the original Bert).
lowercase (bool, optional, defaults to True) β Whether to lowercase.
BertNormalizer
Takes care of normalizing raw text before giving it to a Bert model. This includes cleaning the text, handling accents, chinese chars and lowercasing
Lowercase
class tokenizers.normalizers.Lowercase
( )
Lowercase Normalizer
NFC
class tokenizers.normalizers.NFC
( )
NFC Unicode Normalizer
NFD
class tokenizers.normalizers.NFD
( )
NFD Unicode Normalizer
NFKC
class tokenizers.normalizers.NFKC
( )
NFKC Unicode Normalizer
NFKD
class tokenizers.normalizers.NFKD
( )
NFKD Unicode Normalizer
Nmt
class tokenizers.normalizers.Nmt
( )
Nmt normalizer
Normalizer
class tokenizers.normalizers.Normalizer
( )
Base class for all normalizers
This class is not supposed to be instantiated directly. Instead, any implementation of a Normalizer will return an instance of this class when instantiated.
normalize
( normalized )
Parameters
normalized (NormalizedString) β The normalized string on which to apply this Normalizer
Normalize a NormalizedString in-place
This method allows to modify a NormalizedString to keep track of the alignment information. If you just want to see the result of the normalization on a raw string, you can use normalize_str()
normalize_str
( sequence ) β str
Parameters
sequence (str) β A string to normalize
Returns
str
A string after normalization
Normalize the given string
This method provides a way to visualize the effect of a Normalizer but it does not keep track of the alignment information. If you need to get/convert offsets, you can use normalize()
Precompiled
class tokenizers.normalizers.Precompiled
( precompiled_charsmap )
Precompiled normalizer Donβt use manually it is used for compatiblity for SentencePiece.
Replace
class tokenizers.normalizers.Replace
( patterncontent )
Replace normalizer
Sequence
class tokenizers.normalizers.Sequence
( )
Parameters
normalizers (List[Normalizer]) β A list of Normalizer to be run as a sequence
Allows concatenating multiple other Normalizer as a Sequence. All the normalizers run in sequence in the given order