GPTSAN Japanese
GPTSAN-japanese
Overview
The GPTSAN-japanese model was released in the repository by Toshiyuki Sakamoto (tanreinama).
GPTSAN is a Japanese language model using Switch Transformer. It has the same structure as the model introduced as Prefix LM in the T5 paper, and support both Text Generation and Masked Language Modeling tasks. These basic tasks similarly can fine-tune for translation or summarization.
Generation
The generate()
method can be used to generate text using GPTSAN-Japanese model.
Copied
GPTSAN Features
GPTSAN has some unique features. It has a model structure of Prefix-LM. It works as a shifted Masked Language Model for Prefix Input tokens. Un-prefixed inputs behave like normal generative models. The Spout vector is a GPTSAN specific input. Spout is pre-trained with random inputs, but you can specify a class of text or an arbitrary vector during fine-tuning. This allows you to indicate the tendency of the generated text. GPTSAN has a sparse Feed Forward based on Switch-Transformer. You can also add other layers and train them partially. See the original GPTSAN repository for details.
Prefix-LM Model
GPTSAN has the structure of the model named Prefix-LM in the T5
paper. (The original GPTSAN repository calls it hybrid
) In GPTSAN, the Prefix
part of Prefix-LM, that is, the input position that can be referenced by both tokens, can be specified with any length. Arbitrary lengths can also be specified differently for each batch. This length applies to the text entered in prefix_text
for the tokenizer. The tokenizer returns the mask of the Prefix
part of Prefix-LM as token_type_ids
. The model treats the part where token_type_ids
is 1 as a Prefix
part, that is, the input can refer to both tokens before and after.
Tips:
Specifying the Prefix part is done with a mask passed to self-attention. When token_type_ids=None or all zero, it is equivalent to regular causal mask
for example:
x_token = tokenizer(“アイウエ”) input_ids: | SOT | SEG | ア | イ | ウ | エ | token_type_ids: | 1 | 0 | 0 | 0 | 0 | 0 | prefix_lm_mask: SOT | 1 0 0 0 0 0 | SEG | 1 1 0 0 0 0 | ア | 1 1 1 0 0 0 | イ | 1 1 1 1 0 0 | ウ | 1 1 1 1 1 0 | エ | 1 1 1 1 1 1 |
x_token = tokenizer("", prefix_text=“アイウエ”) input_ids: | SOT | ア | イ | ウ | エ | SEG | token_type_ids: | 1 | 1 | 1 | 1 | 1 | 0 | prefix_lm_mask: SOT | 1 1 1 1 1 0 | ア | 1 1 1 1 1 0 | イ | 1 1 1 1 1 0 | ウ | 1 1 1 1 1 0 | エ | 1 1 1 1 1 0 | SEG | 1 1 1 1 1 1 |
x_token = tokenizer(“ウエ”, prefix_text=“アイ”) input_ids: | SOT | ア | イ | SEG | ウ | エ | token_type_ids: | 1 | 1 | 1 | 0 | 0 | 0 | prefix_lm_mask: SOT | 1 1 1 0 0 0 | ア | 1 1 1 0 0 0 | イ | 1 1 1 0 0 0 | SEG | 1 1 1 1 0 0 | ウ | 1 1 1 1 1 0 | エ | 1 1 1 1 1 1 |
Spout Vector
A Spout Vector is a special vector for controlling text generation. This vector is treated as the first embedding in self-attention to bring extraneous attention to the generated tokens. In the pre-trained model published from Tanrei/GPTSAN-japanese
, the Spout Vector is a 128-dimensional vector that passes through 8 fully connected layers in the model and is projected into the space acting as external attention. The Spout Vector projected by the fully connected layer is split to be passed to all self-attentions.
GPTSanJapaneseConfig
class transformers.GPTSanJapaneseConfig
( vocab_size = 36000max_position_embeddings = 1280d_model = 1024d_ff = 8192d_ext = 4096d_spout = 128num_switch_layers = 10num_ext_layers = 0num_heads = 16num_experts = 16expert_capacity = 128dropout_rate = 0.0layer_norm_epsilon = 1e-05router_bias = Falserouter_jitter_noise = 0.0router_dtype = 'float32'router_ignore_padding_tokens = Falseoutput_hidden_states = Falseoutput_attentions = Falseinitializer_factor = 0.002output_router_logits = Falseuse_cache = Trueseparator_token_id = 35998pad_token_id = 35995eos_token_id = 35999**kwargs )
Parameters
vocab_size (
int
, optional, defaults to 36000) — Vocabulary size of the GPTSANJapanese model. Defines the number of different tokens that can be represented by theinputs_ids
passed when calling GPTSanJapaneseModel.max_position_embeddings (
int
, optional, defaults to 1280) — The maximum sequence length that this model might ever be used with. Defaults set this to 1280.d_model (
int
, optional, defaults to 1024) — Size of the encoder layers and the pooler layer.d_ff (
int
, optional, defaults to 8192) — Size of the intermediate feed forward layer in eachSwitchTransformersBlock
.d_ext (
int
, optional, defaults to 4096) — Size of the intermediate feed forward layer in each Extra-layers.d_spout (
int
, optional, defaults to 128) — Size of thespout
vector.num_switch_layers (
int
, optional, defaults to 10) — Number of layers in the Switch Transformer layer.num_ext_layers (
int
, optional, defaults to 0) — Number of layers in the Extra-layers.num_heads (
int
, optional, defaults to 16) — Number of attention heads for each attention layer in the Transformer encoder.num_experts (
int
, optional, defaults to 16) — Number of experts for each SwitchTransformer layer.expert_capacity (
int
, optional, defaults to 128) — Number of tokens that can be stored in each expert. If set to 1, the model will behave like a regular Transformer.dropout_rate (
float
, optional, defaults to 0.0) — The ratio for all dropout layers.layer_norm_eps (
float
, optional, defaults to 1e-5) — The epsilon used by the layer normalization layers.router_bias (
bool
, optional, defaults toFalse
) — Whether to add a bias to the router.router_jitter_noise (
float
, optional, defaults to 0.0) — Amount of noise to add to the router. Set it to 0.0 during prediction or set small value (usually 1e-2) during training.router_dtype (
str
, optional, default to"float32"
) — Thedtype
used for the routers. It is preferable to keep thedtype
to"float32"
as specified in the selective precision discussion in the paper.router_ignore_padding_tokens (
bool
, optional, defaults toFalse
) — Whether to ignore padding tokens when routing.output_hidden_states (
bool
, optional, default toFalse
) — Whether or not to return the hidden states of all layers. Seehidden_states
under returned tensors for more detail.output_attentions (
bool
, optional, defaults toFalse
) — Whether or not to return the attentions tensors of all attention layers.initializer_factor (
float
, optional, defaults to 0.002) — A factor for initializing all weight matrices.output_router_logits (
bool
, optional, default toFalse
) — Whether or not to return the router logits of all experts.use_cache (
bool
, optional, defaults toTrue
) — Whether or not the model should return the last key/values attentions (not used by all models)
This is the configuration class to store the configuration of a GPTSanJapaneseModel. It is used to instantiate a GPTSANJapanese model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the GPTSANJapanese Tanrei/GPTSAN-japanese architecture.
Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.
GPTSanJapaneseTokenizer
class transformers.GPTSanJapaneseTokenizer
( vocab_fileemoji_fileunk_token = '<|nottoken|>'pad_token = '<|separator|>'bos_token = '<|startoftext|>'eos_token = '<|endoftext|>'sep_token = '<|segmenter|>'do_clean_text = False**kwargs )
Parameters
vocab_file (
str
) — File containing the vocabulary.emoji_file (
str
) — File containing the emoji.unk_token (
str
, optional, defaults to"<|nottoken|>"
) — The token used for unknown charactorpad_token (
str
, optional, defaults to"<|separator|>"
) — The token used for paddingbos_token (
str
, optional, defaults to"<|startoftext|>""
) — The beginning of sequence token.eos_token (
str
, optional, defaults to"<|endoftext|>"
) — The end of sequence token.sep_token (
str
, optional, defaults to"<|segmenter|>"
) — A special token to separate token to prefix part and general input part.do_clean_text (
bool
, optional, defaults toFalse
) — Whether or not to clean text for URL, EMAIL, TEL, Japanese DATE and Japanese PRICE.
This tokenizer is based on GPTNeoXJapaneseTokenizer and has the following modifications
Decoding byte0~byte255 tokens correctly
Added bagofword token handling
Return token_type_ids for Prefix-LM model The bagofword token represents a repetition of the previous token and is converted to 3 consecutive tokens when decoding In addition, the original Japanese special Sub-Word-Encoding has been released in this repository (https://github.com/tanreinama/Japanese-BPEEncoder_V2). The token_type_ids is a mask indicating the prefix input position of the Prefix-LM model. To specify a prefix position, specify a prefix input for prefix_text, or specify a sentence of the prefix part and the part after it as a text pair of batch input.
Example:
Copied
Example for Prefix-LM:
Copied
Example for batch encode:
Copied
convert_tokens_to_string
( tokens )
Converts a sequence of tokens (string) in a single string.
create_token_type_ids_from_sequences
( token_ids_0: typing.List[int]token_ids_1: typing.Optional[typing.List[int]] = None )
The tokenizer returns token_type_ids as separators between the Prefix part and the rest. token_type_ids is 1 for the Prefix part and 0 for the rest of the token.
Example:
Copied
GPTSanJapaneseModel
class transformers.GPTSanJapaneseModel
( config: GPTSanJapaneseConfig )
Parameters
config (GPTSanJapaneseConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The bare GPTSAN-japanese Model transformer outputting raw hidden-states without any specific head on top.
The GPTSAN-japanese model was proposed in General-purpose Swich transformer based Japanese language model
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
( input_ids: typing.Optional[torch.LongTensor] = Noneattention_mask: typing.Optional[torch.FloatTensor] = Nonetoken_type_ids: typing.Optional[torch.FloatTensor] = Nonespout: typing.Optional[torch.FloatTensor] = Nonepast_key_values: typing.Optional[typing.Tuple[typing.Tuple[torch.FloatTensor]]] = Nonehead_mask: typing.Optional[torch.FloatTensor] = Noneuse_cache: typing.Optional[bool] = Falseinputs_embeds: typing.Optional[torch.FloatTensor] = Nonedecoder_inputs_embeds: typing.Optional[torch.FloatTensor] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = Noneoutput_router_logits: typing.Optional[bool] = Nonenum_precontext: typing.Optional[torch.LongTensor] = None )
Parameters
input_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
) — Indices of input sequence tokens in the vocabulary. GPTSAN-japanese is a model that generates sentence continuations or predicts tokens at mask positions. Special tokens required for inputs to the model are automatically appended.attention_mask (
torch.FloatTensor
of shape(batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]
:1 for tokens that are not masked,
0 for tokens that are masked.
token_type_ids (
torch.FloatTensor
of shape(batch_size, sequence_length)
, optional) — An input that masks the Prefix part in the Prefix-LM input. Mask values selected in[0, 1]
:1 for tokens that are prefix input,
0 for tokens that are not-prefix input.
spout (
torch.Tensor
of shape(batch_size, config.d_spout)
) — This vector is transformed through an 8-layer FFN and can be used instead ofpast_key_values
.past_key_values (
tuple(tuple(torch.FloatTensor))
of lengthconfig.n_layers
with each tuple having 4 tensors of shape(batch_size, num_heads, sequence_length - 1, embed_size_per_head)
) — Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.If
past_key_values
are used, the user can optionally input only the lastdecoder_input_ids
(those that don’t have their past key value states given to this model) of shape(batch_size, 1)
instead of alldecoder_input_ids
of shape(batch_size, sequence_length)
.head_mask (
torch.FloatTensor
of shape(num_heads,)
or(num_layers, num_heads)
, optional) — Mask to nullify selected heads of the self-attention modules. Mask values selected in[0, 1]
:use_cache (
bool
, optional) — If set toTrue
,past_key_values
key value states are returned and can be used to speed up decoding (seepast_key_values
).inputs_embeds (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
, optional) — Optionally, instead of passinginput_ids
you can choose to directly pass an embedded representation. This is useful if you want more control over how to convertinput_ids
indices into associated vectors than the model’s internal embedding lookup matrix.decoder_inputs_embeds (
torch.FloatTensor
of shape(batch_size, target_sequence_length, hidden_size)
, optional) — Optionally, instead of passingdecoder_input_ids
you can choose to directly pass an embedded representation. Ifpast_key_values
is used, optionally only the lastdecoder_inputs_embeds
have to be input (seepast_key_values
). This is useful if you want more control over how to convertdecoder_input_ids
indices into associated vectors than the model’s internal embedding lookup matrix.output_attentions (
bool
, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentions
under returned tensors for more detail.output_hidden_states (
bool
, optional) — Whether or not to return the hidden states of all layers. Seehidden_states
under returned tensors for more detail.return_dict (
bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple.router_logits (
tuple(torch.FloatTensor)
, optional, returned whenoutput_router_logits=True
is passed or whenconfig.add_router_probs=True
) — Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, sequence_length, num_experts)
. Router logits of the decoder model, useful to compute the auxiliary loss for Mixture of Experts models.num_precontext (
torch.LongTensor
of shape(batch_size,1)
) — length ofhybrid
input tokens in the input. Tokens up to this length refer to both front and back like BERT, tokens after that refer only to front like GPT. see also: https://github.com/tanreinama/GPTSAN/blob/main/report/model.md
The GPTSanJapaneseModel forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
GPTSanJapaneseForConditionalGeneration
class transformers.GPTSanJapaneseForConditionalGeneration
( config: GPTSanJapaneseConfig )
Parameters
config (GPTSanJapaneseConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The bare GPTSAN-japanese Model with a language modeling head.
The GPTSAN-japanese model was proposed in General-purpose Swich transformer based Japanese language model
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
( input_ids: typing.Optional[torch.LongTensor] = Noneattention_mask: typing.Optional[torch.FloatTensor] = Nonetoken_type_ids: typing.Optional[torch.FloatTensor] = Nonespout: typing.Optional[torch.FloatTensor] = Nonepast_key_values: typing.Optional[typing.Tuple[typing.Tuple[torch.FloatTensor]]] = Nonehead_mask: typing.Optional[torch.FloatTensor] = Noneuse_cache: typing.Optional[bool] = Falseinputs_embeds: typing.Optional[torch.FloatTensor] = Nonedecoder_inputs_embeds: typing.Optional[torch.FloatTensor] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = Noneoutput_router_logits: typing.Optional[bool] = Nonelabels: typing.Optional[torch.LongTensor] = None )
Parameters
input_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
) — Indices of input sequence tokens in the vocabulary. GPTSAN-japanese is a model that generates sentence continuations or predicts tokens at mask positions. Special tokens required for inputs to the model are automatically appended.attention_mask (
torch.FloatTensor
of shape(batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]
:1 for tokens that are not masked,
0 for tokens that are masked.
token_type_ids (
torch.FloatTensor
of shape(batch_size, sequence_length)
, optional) — An input that masks the Prefix part in the Prefix-LM input. Mask values selected in[0, 1]
:1 for tokens that are prefix input,
0 for tokens that are not-prefix input.
spout (
torch.Tensor
of shape(batch_size, config.d_spout)
) — This vector is transformed through an 8-layer FFN and can be used instead ofpast_key_values
.past_key_values (
tuple(tuple(torch.FloatTensor))
of lengthconfig.n_layers
with each tuple having 4 tensors of shape(batch_size, num_heads, sequence_length - 1, embed_size_per_head)
) — Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.If
past_key_values
are used, the user can optionally input only the lastdecoder_input_ids
(those that don’t have their past key value states given to this model) of shape(batch_size, 1)
instead of alldecoder_input_ids
of shape(batch_size, sequence_length)
.head_mask (
torch.FloatTensor
of shape(num_heads,)
or(num_layers, num_heads)
, optional) — Mask to nullify selected heads of the self-attention modules. Mask values selected in[0, 1]
:use_cache (
bool
, optional) — If set toTrue
,past_key_values
key value states are returned and can be used to speed up decoding (seepast_key_values
).inputs_embeds (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
, optional) — Optionally, instead of passinginput_ids
you can choose to directly pass an embedded representation. This is useful if you want more control over how to convertinput_ids
indices into associated vectors than the model’s internal embedding lookup matrix.decoder_inputs_embeds (
torch.FloatTensor
of shape(batch_size, target_sequence_length, hidden_size)
, optional) — Optionally, instead of passingdecoder_input_ids
you can choose to directly pass an embedded representation. Ifpast_key_values
is used, optionally only the lastdecoder_inputs_embeds
have to be input (seepast_key_values
). This is useful if you want more control over how to convertdecoder_input_ids
indices into associated vectors than the model’s internal embedding lookup matrix.output_attentions (
bool
, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentions
under returned tensors for more detail.output_hidden_states (
bool
, optional) — Whether or not to return the hidden states of all layers. Seehidden_states
under returned tensors for more detail.return_dict (
bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple.router_logits (
tuple(torch.FloatTensor)
, optional, returned whenoutput_router_logits=True
is passed or whenconfig.add_router_probs=True
) — Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, sequence_length, num_experts)
. Router logits of the decoder model, useful to compute the auxiliary loss for Mixture of Experts models.labels (
torch.LongTensor
of shape(batch_size,)
, optional) — Labels for computing the sequence classification loss. Indices should be in[-100, 0, ..., config.vocab_size - 1]
. All labels set to-100
are ignored (masked), the loss is only computed for labels in[0, ..., config.vocab_size]
The GPTSanJapaneseForConditionalGeneration forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
Example:
Text Generation with regular LM Model
Copied
Text Generation with Prefix-LM Model
Copied
Simultaneously Text Generation And Masked Language Model
Copied
Last updated