GPTSw3

GPT-Sw3

Overview

The GPT-Sw3 model was first proposed in Lessons Learned from GPT-SW3: Building the First Large-Scale Generative Language Model for Swedish by Ariel Ekgren, Amaru Cuba Gyllensten, Evangelia Gogoulou, Alice Heiman, Severine Verlinden, Joey Öhman, Fredrik Carlsson, Magnus Sahlgren.

Since that first paper the authors have extended their work and trained new models on their new 1.2TB corpora named The Nordic Pile.

GPT-Sw3 is a collection of large decoder-only pretrained transformer language models that were developed by AI Sweden in collaboration with RISE and the WASP WARA for Media and Language. GPT-Sw3 has been trained on a dataset containing 320B tokens in Swedish, Norwegian, Danish, Icelandic, English, and programming code. The model was pretrained using a causal language modeling (CLM) objective utilizing the NeMo Megatron GPT implementation.

This model was contributed by AI Sweden.

The implementation uses the GPT2Model coupled with our GPTSw3Tokenizer. This means that AutoTokenizer and AutoModelForCausalLM map to our tokenizer implementation and the corresponding GPT2 model implementation respectively. Note that sentencepiece is required to use our tokenizer and can be installed with: pip install transformers[sentencepiece] or pip install sentencepiece

Example usage:

Copied

>>> from transformers import AutoTokenizer, AutoModelForCausalLM

>>> tokenizer = AutoTokenizer.from_pretrained("AI-Sweden/gpt-sw3-356m")
>>> model = AutoModelForCausalLM.from_pretrained("AI-Sweden/gpt-sw3-356m")

>>> input_ids = tokenizer("Träd är fina för att", return_tensors="pt")["input_ids"]

>>> generated_token_ids = model.generate(inputs=input_ids, max_new_tokens=10, do_sample=True)[0]

>>> print(tokenizer.decode(generated_token_ids))
Träd är fina för att de är färgstarka. Men ibland är det fint

Documentation resources

GPTSw3Tokenizer

class transformers.GPTSw3Tokenizer

<source>

( vocab_filedo_lower_case = Falseremove_space = Falsekeep_accents = Falsepad_token = Noneunk_token = Noneeos_token = Nonebos_token = Nonesp_model_kwargs: typing.Union[typing.Dict[str, typing.Any], NoneType] = None**kwargs )

Parameters

  • vocab_file (str) — SentencePiece file (generally has a .spm extension) that contains the vocabulary necessary to instantiate a tokenizer.

  • do_lower_case (bool, optional, defaults to False) — Whether or not to lowercase the input when tokenizing.

  • remove_space (bool, optional, defaults to False) — Whether or not to strip the text when tokenizing (removing excess spaces before and after the string).

  • keep_accents (bool, optional, defaults to False) — Whether or not to keep accents when tokenizing.

  • bos_token (str, optional) — The beginning of sequence token that can be used for downstream task, was not seen during pretraining. If not provided, will default to ’’ or ’<|endoftext|>’, depending on model size.

  • eos_token (str, optional) — The end of sequence token seen during pretraining. If not provided, will default to ’<|endoftext|>’

  • unk_token (str, optional) — The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead. If not provided, will default to ’‘.

  • pad_token (str, optional) — The token used for padding, for example when batching sequences of different lengths. If not provided, will default to ’’ or ’’ depending on model size.

  • sp_model_kwargs (dict, optional) — Will be passed to the SentencePieceProcessor.__init__() method. The Python wrapper for SentencePiece can be used, among other things, to set:

    • enable_sampling: Enable subword regularization.

    • nbest_size: Sampling parameters for unigram. Invalid for BPE-Dropout.

      • nbest_size = {0,1}: No sampling is performed.

      • nbest_size > 1: samples from the nbest_size results.

      • nbest_size < 0: assuming that nbest_size is infinite and samples from the all hypothesis (lattice) using forward-filtering-and-backward-sampling algorithm.

    • alpha: Smoothing parameter for unigram sampling, and dropout probability of merge operations for BPE-dropout.

  • sp_model (SentencePieceProcessor) — The SentencePiece processor that is used for every conversion (string, tokens and IDs).

  • whitespaces (set) — The whitespaces that are replaced in the whitespace normalization in preprocessing.

  • non_printing_characters_re (Pattern) — The compiled regular expression to remove non-printing characters in preprocessing.

Construct an GPTSw3 tokenizer. Based on SentencePiece.

This tokenizer inherits from PreTrainedTokenizer which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.

Example usage:

Copied

>>> from transformers import GPTSw3Tokenizer

>>> tokenizer = GPTSw3Tokenizer.from_pretrained("AI-Sweden/gpt-sw3-126m")
>>> tokenizer("Svenska är kul!")["input_ids"]
[1814, 377, 3617, 63504]

save_vocabulary

<source>

( save_directory: strfilename_prefix: typing.Optional[str] = None )

Last updated