AutoTokenizer

AutoTokenizer

class transformers.AutoTokenizer

<source>

( )

This is a generic tokenizer class that will be instantiated as one of the tokenizer classes of the library when created with the AutoTokenizer.from_pretrained() class method.

This class cannot be instantiated directly using __init__() (throws an error).

from_pretrained

<source>

( pretrained_model_name_or_path*inputs**kwargs )

Parameters

  • pretrained_model_name_or_path (str or os.PathLike) — Can be either:

    • A string, the model id of a predefined tokenizer hosted inside a model repo on huggingface.co. Valid model ids can be located at the root-level, like bert-base-uncased, or namespaced under a user or organization name, like dbmdz/bert-base-german-cased.

    • A path to a directory containing vocabulary files required by the tokenizer, for instance saved using the save_pretrained() method, e.g., ./my_model_directory/.

    • A path or url to a single saved vocabulary file if and only if the tokenizer only requires a single vocabulary file (like Bert or XLNet), e.g.: ./my_model_directory/vocab.txt. (Not applicable to all derived classes)

  • inputs (additional positional arguments, optional) — Will be passed along to the Tokenizer __init__() method.

  • config (PretrainedConfig, optional) — The configuration object used to determine the tokenizer class to instantiate.

  • cache_dir (str or os.PathLike, optional) — Path to a directory in which a downloaded pretrained model configuration should be cached if the standard cache should not be used.

  • force_download (bool, optional, defaults to False) — Whether or not to force the (re-)download the model weights and configuration files and override the cached versions if they exist.

  • resume_download (bool, optional, defaults to False) — Whether or not to delete incompletely received files. Will attempt to resume the download if such a file exists.

  • proxies (Dict[str, str], optional) — A dictionary of proxy servers to use by protocol or endpoint, e.g., {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}. The proxies are used on each request.

  • revision (str, optional, defaults to "main") — The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a git-based system for storing models and other artifacts on huggingface.co, so revision can be any identifier allowed by git.

  • subfolder (str, optional) — In case the relevant files are located inside a subfolder of the model repo on huggingface.co (e.g. for facebook/rag-token-base), specify it here.

  • use_fast (bool, optional, defaults to True) — Use a fast Rust-based tokenizer if it is supported for a given model. If a fast tokenizer is not available for a given model, a normal Python-based tokenizer is returned instead.

  • tokenizer_type (str, optional) — Tokenizer type to be loaded.

  • trust_remote_code (bool, optional, defaults to False) — Whether or not to allow for custom models defined on the Hub in their own modeling files. This option should only be set to True for repositories you trust and in which you have read the code, as it will execute code present on the Hub on your local machine.

  • kwargs (additional keyword arguments, optional) — Will be passed to the Tokenizer __init__() method. Can be used to set special tokens like bos_token, eos_token, unk_token, sep_token, pad_token, cls_token, mask_token, additional_special_tokens. See parameters in the __init__() for more details.

Instantiate one of the tokenizer classes of the library from a pretrained model vocabulary.

The tokenizer class to instantiate is selected based on the model_type property of the config object (either passed as an argument or loaded from pretrained_model_name_or_path if possible), or when it’s missing, by falling back to using pattern matching on pretrained_model_name_or_path:

Examples:

Copied

>>> from transformers import AutoTokenizer

>>> # Download vocabulary from huggingface.co and cache.
>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

>>> # Download vocabulary from huggingface.co (user-uploaded) and cache.
>>> tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-german-cased")

>>> # If vocabulary files are in a directory (e.g. tokenizer was saved using *save_pretrained('./test/saved_model/')*)
>>> # tokenizer = AutoTokenizer.from_pretrained("./test/bert_saved_model/")

>>> # Download vocabulary from huggingface.co and define model-specific arguments
>>> tokenizer = AutoTokenizer.from_pretrained("roberta-base", add_prefix_space=True)

register

<source>

( config_classslow_tokenizer_class = Nonefast_tokenizer_class = Noneexist_ok = False )

Parameters

  • config_class (PretrainedConfig) — The configuration corresponding to the model to register.

  • slow_tokenizer_class (PretrainedTokenizer, optional) — The slow tokenizer to register.

  • fast_tokenizer_class (PretrainedTokenizerFast, optional) — The fast tokenizer to register.

Register a new tokenizer in this mapping.

Last updated