NLLB-MoE
NLLB-MOE
Overview
The NLLB model was presented in No Language Left Behind: Scaling Human-Centered Machine Translation by Marta R. Costa-jussร , James Cross, Onur รelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula, Loic Barrault, Gabriel Mejia Gonzalez, Prangthip Hansanti, John Hoffman, Semarley Jarrett, Kaushik Ram Sadagopan, Dirk Rowe, Shannon Spruit, Chau Tran, Pierre Andrews, Necip Fazil Ayan, Shruti Bhosale, Sergey Edunov, Angela Fan, Cynthia Gao, Vedanuj Goswami, Francisco Guzmรกn, Philipp Koehn, Alexandre Mourachko, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, and Jeff Wang.
The abstract of the paper is the following:
Driven by the goal of eradicating language barriers on a global scale, machine translation has solidified itself as a key focus of artificial intelligence research today. However, such efforts have coalesced around a small subset of languages, leaving behind the vast majority of mostly low-resource languages. What does it take to break the 200 language barrier while ensuring safe, high quality results, all while keeping ethical considerations in mind? In No Language Left Behind, we took on this challenge by first contextualizing the need for low-resource language translation support through exploratory interviews with native speakers. Then, we created datasets and models aimed at narrowing the performance gap between low and high-resource languages. More specifically, we developed a conditional compute model based on Sparsely Gated Mixture of Experts that is trained on data obtained with novel and effective data mining techniques tailored for low-resource languages. We propose multiple architectural and training improvements to counteract overfitting while training on thousands of tasks. Critically, we evaluated the performance of over 40,000 different translation directions using a human-translated benchmark, Flores-200, and combined human evaluation with a novel toxicity benchmark covering all languages in Flores-200 to assess translation safety. Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art, laying important groundwork towards realizing a universal translation system.
Tips:
M2M100ForConditionalGeneration is the base model for both NLLB and NLLB MoE
The NLLB-MoE is very similar to the NLLB model, but itโs feed forward layer is based on the implementation of SwitchTransformers.
The tokenizer is the same as the NLLB models.
This model was contributed by Arthur Zucker. The original code can be found here.
Implementation differences with SwitchTransformers
The biggest difference is the way the tokens are routed. NLLB-MoE uses a top-2-gate
which means that for each input, only the top two experts are selected based on the highest predicted probabilities from the gating network, and the remaining experts are ignored. In SwitchTransformers
, only the top-1 probabilities are computed, which means that tokens have less probability of being forwarded. Moreover, if a token is not routed to any expert, SwitchTransformers
still adds its unmodified hidden states (kind of like a residual connection) while they are masked in NLLB
โs top-2 routing mechanism.
Generating with NLLB-MoE
The avalable checkpoints requires around 350GB of storage. Make sure to use accelerate
if you do not have enough RAM on your machine.
While generating the target text set the forced_bos_token_id
to the target language id. The following example shows how to translate English to French using the facebook/nllb-200-distilled-600M model.
Note that weโre using the BCP-47 code for French fra_Latn
. See here for the list of all BCP-47 in the Flores 200 dataset.
Copied
Generating from any other language than English
English (eng_Latn
) is set as the default language from which to translate. In order to specify that youโd like to translate from a different language, you should specify the BCP-47 code in the src_lang
keyword argument of the tokenizer initialization.
See example below for a translation from romanian to german:
Copied
Documentation resources
NllbMoeConfig
class transformers.NllbMoeConfig
( vocab_size = 128112max_position_embeddings = 1024encoder_layers = 12encoder_ffn_dim = 4096encoder_attention_heads = 16decoder_layers = 12decoder_ffn_dim = 4096decoder_attention_heads = 16encoder_layerdrop = 0.05decoder_layerdrop = 0.05use_cache = Trueis_encoder_decoder = Trueactivation_function = 'relu'd_model = 1024dropout = 0.1attention_dropout = 0.1activation_dropout = 0.0init_std = 0.02decoder_start_token_id = 2scale_embedding = Truerouter_bias = Falserouter_dtype = 'float32'router_ignore_padding_tokens = Falsenum_experts = 128expert_capacity = 64encoder_sparse_step = 4decoder_sparse_step = 4router_z_loss_coef = 0.001router_aux_loss_coef = 0.001second_expert_policy = 'all'normalize_router_prob_before_dropping = Falsebatch_prioritized_routing = Falsemoe_eval_capacity_token_fraction = 1.0moe_token_dropout = 0.2pad_token_id = 1bos_token_id = 0eos_token_id = 2output_router_logits = False**kwargs )
Parameters
vocab_size (
int
, optional, defaults to 50265) โ Vocabulary size of the NllbMoe model. Defines the number of different tokens that can be represented by theinputs_ids
passed when calling NllbMoeModel ord_model (
int
, optional, defaults to 1024) โ Dimensionality of the layers and the pooler layer.encoder_layers (
int
, optional, defaults to 12) โ Number of encoder layers.decoder_layers (
int
, optional, defaults to 12) โ Number of decoder layers.encoder_attention_heads (
int
, optional, defaults to 16) โ Number of attention heads for each attention layer in the Transformer encoder.decoder_attention_heads (
int
, optional, defaults to 16) โ Number of attention heads for each attention layer in the Transformer decoder.decoder_ffn_dim (
int
, optional, defaults to 4096) โ Dimensionality of the โintermediateโ (often named feed-forward) layer in decoder.encoder_ffn_dim (
int
, optional, defaults to 4096) โ Dimensionality of the โintermediateโ (often named feed-forward) layer in encoder.activation_function (
str
orfunction
, optional, defaults to"gelu"
) โ The non-linear activation function (function or string) in the encoder and pooler. If string,"gelu"
,"relu"
,"silu"
and"gelu_new"
are supported.dropout (
float
, optional, defaults to 0.1) โ The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.attention_dropout (
float
, optional, defaults to 0.0) โ The dropout ratio for the attention probabilities.activation_dropout (
float
, optional, defaults to 0.0) โ The dropout ratio for activations inside the fully connected layer.classifier_dropout (
float
, optional, defaults to 0.0) โ The dropout ratio for classifier.max_position_embeddings (
int
, optional, defaults to 1024) โ The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).init_std (
float
, optional, defaults to 0.02) โ The standard deviation of the truncated_normal_initializer for initializing all weight matrices.encoder_layerdrop (
float
, optional, defaults to 0.0) โ The LayerDrop probability for the encoder. See the [LayerDrop paper](see https://arxiv.org/abs/1909.11556) for more details.decoder_layerdrop (
float
, optional, defaults to 0.0) โ The LayerDrop probability for the decoder. See the [LayerDrop paper](see https://arxiv.org/abs/1909.11556) for more details.second_expert_policy (
str
, optional, default to"all"
) โ The policy used for the sampling the probability of being sampled to a second expert for each token.normalize_router_prob_before_dropping (
bool
, optional, defaults toTrue
) โ Whether or not to normalize the router probabilities before applying a mask based on the experts capacity (capacity dropping).batch_prioritized_routing (
bool
, optional, defaults toTrue
) โ Whether or not to orders the tokens by their router probabilities before capacity dropping. This means that the tokens that have the highest probabilities will be routed before other tokens that might be further in the sequence.moe_eval_capacity_token_fraction (
float
, optional, defaults to 1.0) โ Fraction of tokens as capacity during validation, if set to negative, uses the same as training. Should be in range: (0.0, 1.0].num_experts (
int
, optional, defaults to 128) โ Number of experts for each NllbMoeSparseMlp layer.expert_capacity (
int
, optional, defaults to 64) โ Number of tokens that can be stored in each expert.encoder_sparse_step (
int
, optional, defaults to 4) โ Frequency of the sparse layers in the encoder. 4 means that one out of 4 layers will be sparse.decoder_sparse_step (
int
, optional, defaults to 4) โ Frequency of the sparse layers in the decoder. 4 means that one out of 4 layers will be sparse.router_dtype (
str
, optional, default to"float32"
) โ Thedtype
used for the routers. It is preferable to keep thedtype
to"float32"
as specified in the selective precision discussion in the paper.router_ignore_padding_tokens (
bool
, optional, defaults toFalse
) โ Whether to ignore padding tokens when routing. ifFalse
, the padding tokens are not routed to any experts.router_bias (
bool
, optional, defaults toFalse
) โ Whether or not the classifier of the router should have a bias.moe_token_dropout (
float
, optional, defualt ot 0.2) โ Masking rate for MoE expert output masking (EOM), which is implemented via a Dropout2d on the expert outputs.output_router_logits (
bool
, optional, defaults toFalse
) โ Whether or not to return the router logits. Only set toTrue
to get the auxiliary loss when training.use_cache (
bool
, optional, defaults toTrue
) โ Whether or not the model should return the last key/values attentions (not used by all models).
This is the configuration class to store the configuration of a NllbMoeModel. It is used to instantiate an NLLB-MoE model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the NLLB-MoE facebook/nllb-moe-54b architecture.
Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.
Example:
Copied
NllbMoeTop2Router
class transformers.NllbMoeTop2Router
( config: NllbMoeConfig )
Router using tokens choose top-2 experts assignment.
This router uses the same mechanism as in NLLB-MoE from the fairseq repository. Items are sorted by router_probs and then routed to their choice of expert until the expertโs expert_capacity is reached. There is no guarantee that each token is processed by an expert, or that each expert receives at least one token.
The router combining weights are also returned to make sure that the states that are not updated will be masked.
route_tokens
( router_logits: Tensorinput_dtype: dtype = torch.float32padding_mask: typing.Optional[torch.LongTensor] = None )
Computes the dispatch_mask
and the dispatch_weights
for each experts. The masks are adapted to the expert capacity.
forward
( hidden_states: Tensorpadding_mask: typing.Optional[torch.LongTensor] = None ) โ top_1_mask (torch.Tensor
of shape (batch_size, sequence_length))
Parameters
hidden_states (
torch.Tensor
) โ (batch_size, sequence_length, hidden_dim) from which router probabilities are computed.
Returns
top_1_mask (torch.Tensor
of shape (batch_size, sequence_length))
Index tensor of shape [batch_size, sequence_length] corresponding to the expert selected for each token using the top1 probabilities of the router. router_probabilities (torch.Tensor
of shape (batch_size, sequence_length, nump_experts)): Tensor of shape (batch_size, sequence_length, num_experts) corresponding to the probabilities for each token and expert. Used for routing tokens to experts. router_logits (torch.Tensor
of shape (batch_size, sequence_length))): Logits tensor of shape (batch_size, sequence_length, num_experts) corresponding to raw router logits. This is used later for computing router z-loss.
The hidden states are reshaped to simplify the computation of the router probabilities (combining weights for each experts.)
NllbMoeSparseMLP
class transformers.NllbMoeSparseMLP
( config: NllbMoeConfigffn_dim: intexpert_class: Module = <class 'transformers.models.nllb_moe.modeling_nllb_moe.NllbMoeDenseActDense'> )
Implementation of the NLLB-MoE sparse MLP module.
forward
( hidden_states: Tensorpadding_mask: typing.Optional[torch.Tensor] = False ) โ hidden_states (torch.Tensor
of shape (batch_size, sequence_length, hidden_dim)
)
Parameters
hidden_states (
torch.Tensor
of shape(batch_size, sequence_length, hidden_dim)
) โ The hidden statespadding_mask (
torch.Tensor
, optional, defaults toFalse
) โ Attention mask. Can be in the causal form or not.
Returns
hidden_states (torch.Tensor
of shape (batch_size, sequence_length, hidden_dim)
)
Updated hidden states router_logits (torch.Tensor
of shape (batch_size, sequence_length, num_experts)
): Needed for computing the loss
The goal of this forward pass is to have the same number of operation as the equivalent NllbMoeDenseActDense
(mlp) layer. This means that all of the hidden states should be processed at most twice ( since we are using a top_2 gating mecanism). This means that we keep the complexity to O(batch_size x sequence_length x hidden_dim) instead of O(num_experts x batch_size x sequence_length x hidden_dim).
1- Get the router_probs
from the router
. The shape of the router_mask
is (batch_size X sequence_length, num_expert)
and corresponds to the boolean version of the router_probs
. The inputs are masked using the router_mask
.
2- Dispatch the hidden_states to its associated experts. The router probabilities are used to weight the contribution of each experts when updating the masked hidden states.
NllbMoeModel
class transformers.NllbMoeModel
( config: NllbMoeConfig )
Parameters
config (NllbMoeConfig) โ Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The bare NllbMoe Model outputting raw hidden-states without any specific head on top. This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
( input_ids: typing.Optional[torch.LongTensor] = Noneattention_mask: typing.Optional[torch.Tensor] = Nonedecoder_input_ids: typing.Optional[torch.LongTensor] = Nonedecoder_attention_mask: typing.Optional[torch.LongTensor] = Nonehead_mask: typing.Optional[torch.Tensor] = Nonedecoder_head_mask: typing.Optional[torch.Tensor] = Nonecross_attn_head_mask: typing.Optional[torch.Tensor] = Noneencoder_outputs: typing.Optional[typing.Tuple[typing.Tuple[torch.FloatTensor]]] = Nonepast_key_values: typing.Optional[typing.Tuple[typing.Tuple[torch.FloatTensor]]] = Noneinputs_embeds: typing.Optional[torch.FloatTensor] = Nonedecoder_inputs_embeds: typing.Optional[torch.FloatTensor] = Noneuse_cache: typing.Optional[bool] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Noneoutput_router_logits: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = None ) โ transformers.modeling_outputs.Seq2SeqMoEModelOutput
or tuple(torch.FloatTensor)
Parameters
input_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
) โ Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.
Returns
transformers.modeling_outputs.Seq2SeqMoEModelOutput
or tuple(torch.FloatTensor)
A transformers.modeling_outputs.Seq2SeqMoEModelOutput
or a tuple of torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various elements depending on the configuration (NllbMoeConfig) and inputs.
last_hidden_state (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
) โ Sequence of hidden-states at the output of the last layer of the decoder of the model.If
past_key_values
is used only the last hidden-state of the sequences of shape(batch_size, 1, hidden_size)
is output.past_key_values (
tuple(tuple(torch.FloatTensor))
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) โ Tuple oftuple(torch.FloatTensor)
of lengthconfig.n_layers
, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)
) and 2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)
.Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see
past_key_values
input) to speed up sequential decoding.decoder_hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) โ Tuple oftorch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hidden-states of the decoder at the output of each layer plus the optional initial embedding outputs.
decoder_attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) โ Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the self-attention heads.
decoder_router_logits (
tuple(torch.FloatTensor)
, optional, returned whenoutput_router_logits=True
is passed or whenconfig.add_router_probs=True
) โ Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, sequence_length, num_experts)
.Router logits of the decoder model, useful to compute the auxiliary loss for Mixture of Experts models.
cross_attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) โ Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoderโs cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads.
encoder_last_hidden_state (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
, optional) โ Sequence of hidden-states at the output of the last layer of the encoder of the model.encoder_hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) โ Tuple oftorch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hidden-states of the encoder at the output of each layer plus the optional initial embedding outputs.
encoder_attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) โ Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the self-attention heads.
encoder_router_logits (
tuple(torch.FloatTensor)
, optional, returned whenoutput_router_logits=True
is passed or whenconfig.add_router_probs=True
) โ Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, sequence_length, num_experts)
.Router logits of the encoder model, useful to compute the auxiliary loss and the z_loss for the sparse modules.
The NllbMoeModel forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
The NllbMoeModel forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
Example:
Copied
NllbMoeForConditionalGeneration
class transformers.NllbMoeForConditionalGeneration
( config: NllbMoeConfig )
Parameters
config (NllbMoeConfig) โ Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The NllbMoe Model with a language modeling head. Can be used for summarization. This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
( input_ids: typing.Optional[torch.LongTensor] = Noneattention_mask: typing.Optional[torch.Tensor] = Nonedecoder_input_ids: typing.Optional[torch.LongTensor] = Nonedecoder_attention_mask: typing.Optional[torch.LongTensor] = Nonehead_mask: typing.Optional[torch.Tensor] = Nonedecoder_head_mask: typing.Optional[torch.Tensor] = Nonecross_attn_head_mask: typing.Optional[torch.Tensor] = Noneencoder_outputs: typing.Optional[typing.Tuple[typing.Tuple[torch.FloatTensor]]] = Nonepast_key_values: typing.Optional[typing.Tuple[typing.Tuple[torch.FloatTensor]]] = Noneinputs_embeds: typing.Optional[torch.FloatTensor] = Nonedecoder_inputs_embeds: typing.Optional[torch.FloatTensor] = Nonelabels: typing.Optional[torch.LongTensor] = Noneuse_cache: typing.Optional[bool] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Noneoutput_router_logits: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = None ) โ transformers.modeling_outputs.Seq2SeqMoEOutput
or tuple(torch.FloatTensor)
Parameters
input_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
) โ Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
attention_mask (
torch.Tensor
of shape(batch_size, sequence_length)
, optional) โ Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]
:1 for tokens that are not masked,
0 for tokens that are masked.
decoder_input_ids (
torch.LongTensor
of shape(batch_size, target_sequence_length)
, optional) โ Indices of decoder input sequence tokens in the vocabulary.Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
NllbMoe uses the
eos_token_id
as the starting token fordecoder_input_ids
generation. Ifpast_key_values
is used, optionally only the lastdecoder_input_ids
have to be input (seepast_key_values
).decoder_attention_mask (
torch.LongTensor
of shape(batch_size, target_sequence_length)
, optional) โ Default behavior: generate a tensor that ignores pad tokens indecoder_input_ids
. Causal mask will also be used by default.head_mask (
torch.Tensor
of shape(encoder_layers, encoder_attention_heads)
, optional) โ Mask to nullify selected heads of the attention modules in the encoder. Mask values selected in[0, 1]
:1 indicates the head is not masked,
0 indicates the head is masked.
decoder_head_mask (
torch.Tensor
of shape(decoder_layers, decoder_attention_heads)
, optional) โ Mask to nullify selected heads of the attention modules in the decoder. Mask values selected in[0, 1]
:1 indicates the head is not masked,
0 indicates the head is masked.
cross_attn_head_mask (
torch.Tensor
of shape(decoder_layers, decoder_attention_heads)
, optional) โ Mask to nullify selected heads of the cross-attention modules in the decoder. Mask values selected in[0, 1]
:1 indicates the head is not masked,
0 indicates the head is masked.
encoder_outputs (
tuple(tuple(torch.FloatTensor)
, optional) โ Tuple consists of (last_hidden_state
, optional:hidden_states
, optional:attentions
)last_hidden_state
of shape(batch_size, sequence_length, hidden_size)
, optional) is a sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention of the decoder.past_key_values (
tuple(tuple(torch.FloatTensor))
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) โ Tuple oftuple(torch.FloatTensor)
of lengthconfig.n_layers
, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)
) and 2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)
.Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see
past_key_values
input) to speed up sequential decoding.If
past_key_values
are used, the user can optionally input only the lastdecoder_input_ids
(those that donโt have their past key value states given to this model) of shape(batch_size, 1)
instead of alldecoder_input_ids
of shape(batch_size, sequence_length)
. inputs_embeds (torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
, optional): Optionally, instead of passinginput_ids
you can choose to directly pass an embedded representation. This is useful if you want more control over how to convertinput_ids
indices into associated vectors than the modelโs internal embedding lookup matrix.decoder_inputs_embeds (
torch.FloatTensor
of shape(batch_size, target_sequence_length, hidden_size)
, optional) โ Optionally, instead of passingdecoder_input_ids
you can choose to directly pass an embedded representation. Ifpast_key_values
is used, optionally only the lastdecoder_inputs_embeds
have to be input (seepast_key_values
). This is useful if you want more control over how to convertdecoder_input_ids
indices into associated vectors than the modelโs internal embedding lookup matrix.If
decoder_input_ids
anddecoder_inputs_embeds
are both unset,decoder_inputs_embeds
takes the value ofinputs_embeds
.use_cache (
bool
, optional) โ If set toTrue
,past_key_values
key value states are returned and can be used to speed up decoding (seepast_key_values
).output_attentions (
bool
, optional) โ Whether or not to return the attentions tensors of all attention layers. Seeattentions
under returned tensors for more detail.output_hidden_states (
bool
, optional) โ Whether or not to return the hidden states of all layers. Seehidden_states
under returned tensors for more detail.output_router_logits (
bool
, optional) โ Whether or not to return the logits of all the routers. They are useful for computing the router loss, and should not be returned during inference.return_dict (
bool
, optional) โ Whether or not to return a ModelOutput instead of a plain tuple.labels (
torch.LongTensor
of shape(batch_size, sequence_length)
, optional) โ Labels for computing the masked language modeling loss. Indices should either be in[0, ..., config.vocab_size]
or -100 (seeinput_ids
docstring). Tokens with indices set to-100
are ignored (masked), the loss is only computed for the tokens with labels in[0, ..., config.vocab_size]
.
Returns
transformers.modeling_outputs.Seq2SeqMoEOutput
or tuple(torch.FloatTensor)
A transformers.modeling_outputs.Seq2SeqMoEOutput
or a tuple of torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various elements depending on the configuration (NllbMoeConfig) and inputs.
loss (
torch.FloatTensor
of shape(1,)
, optional, returned whenlabels
is provided) โ Language modeling loss.logits (
torch.FloatTensor
of shape(batch_size, sequence_length, config.vocab_size)
) โ Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).past_key_values (
tuple(tuple(torch.FloatTensor))
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) โ Tuple oftuple(torch.FloatTensor)
of lengthconfig.n_layers
, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)
) and 2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)
.Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see
past_key_values
input) to speed up sequential decoding.decoder_hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) โ Tuple oftorch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hidden-states of the decoder at the output of each layer plus the initial embedding outputs.
decoder_attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) โ Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the self-attention heads.
decoder_router_logits (
tuple(torch.FloatTensor)
, optional, returned whenoutput_router_logits=True
is passed or whenconfig.add_router_probs=True
) โ Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, sequence_length, num_experts)
.Router logits of the decoder model, useful to compute the auxiliary loss for Mixture of Experts models.
cross_attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) โ Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoderโs cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads.
encoder_last_hidden_state (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
, optional) โ Sequence of hidden-states at the output of the last layer of the encoder of the model.encoder_hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) โ Tuple oftorch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hidden-states of the encoder at the output of each layer plus the initial embedding outputs.
encoder_attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) โ Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the self-attention heads.
encoder_router_logits (
tuple(torch.FloatTensor)
, optional, returned whenoutput_router_logits=True
is passed or whenconfig.add_router_probs=True
) โ Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, sequence_length, num_experts)
.Router logits of the encoder model, useful to compute the auxiliary loss and z_loss for Mixture of Experts models.
The NllbMoeForConditionalGeneration forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
Translation example:
Copied
Last updated