ErnieM
ErnieM
Overview
The ErnieM model was proposed in ERNIE-M: Enhanced Multilingual Representation by Aligning Cross-lingual Semantics with Monolingual Corpora by Xuan Ouyang, Shuohuan Wang, Chao Pang, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang.
The abstract from the paper is the following:
Recent studies have demonstrated that pre-trained cross-lingual models achieve impressive performance in downstream cross-lingual tasks. This improvement benefits from learning a large amount of monolingual and parallel corpora. Although it is generally acknowledged that parallel corpora are critical for improving the model performance, existing methods are often constrained by the size of parallel corpora, especially for lowresource languages. In this paper, we propose ERNIE-M, a new training method that encourages the model to align the representation of multiple languages with monolingual corpora, to overcome the constraint that the parallel corpus size places on the model performance. Our key insight is to integrate back-translation into the pre-training process. We generate pseudo-parallel sentence pairs on a monolingual corpus to enable the learning of semantic alignments between different languages, thereby enhancing the semantic modeling of cross-lingual models. Experimental results show that ERNIE-M outperforms existing cross-lingual models and delivers new state-of-the-art results in various cross-lingual downstream tasks.
Tips:
Ernie-M is a BERT-like model so it is a stacked Transformer Encoder.
Instead of using MaskedLM for pretraining (like BERT) the authors used two novel techniques:
Cross-attention Masked Language Modeling
andBack-translation Masked Language Modeling
. For now these two LMHead objectives are not implemented here.It is a multilingual language model.
Next Sentence Prediction was not used in pretraining process.
This model was contributed by Susnato Dhar. The original code can be found here.
Documentation resources
ErnieMConfig
class transformers.ErnieMConfig
( vocab_size: int = 250002hidden_size: int = 768num_hidden_layers: int = 12num_attention_heads: int = 12intermediate_size: int = 3072hidden_act: str = 'gelu'hidden_dropout_prob: float = 0.1attention_probs_dropout_prob: float = 0.1max_position_embeddings: int = 514initializer_range: float = 0.02pad_token_id: int = 1layer_norm_eps: float = 1e-05classifier_dropout = Noneis_decoder = Falseact_dropout = 0.0**kwargs )
Parameters
vocab_size (
int
, optional, defaults to 250002) — Vocabulary size ofinputs_ids
in ErnieMModel. Also is the vocab size of token embedding matrix. Defines the number of different tokens that can be represented by theinputs_ids
passed when calling ErnieMModel.hidden_size (
int
, optional, defaults to 768) — Dimensionality of the embedding layer, encoder layers and pooler layer.num_hidden_layers (
int
, optional, defaults to 12) — Number of hidden layers in the Transformer encoder.num_attention_heads (
int
, optional, defaults to 12) — Number of attention heads for each attention layer in the Transformer encoder.intermediate_size (
int
, optional, defaults to 3072) — Dimensionality of the feed-forward (ff) layer in the encoder. Input tensors to feed-forward layers are firstly projected from hidden_size to intermediate_size, and then projected back to hidden_size. Typically intermediate_size is larger than hidden_size.hidden_act (
str
, optional, defaults to"gelu"
) — The non-linear activation function in the feed-forward layer."gelu"
,"relu"
and any other torch supported activation functions are supported.hidden_dropout_prob (
float
, optional, defaults to 0.1) — The dropout probability for all fully connected layers in the embeddings and encoder.attention_probs_dropout_prob (
float
, optional, defaults to 0.1) — The dropout probability used inMultiHeadAttention
in all encoder layers to drop some attention target.act_dropout (
float
, optional, defaults to 0.0) — This dropout probability is used inErnieMEncoderLayer
after activation.max_position_embeddings (
int
, optional, defaults to 512) — The maximum value of the dimensionality of position encoding, which dictates the maximum supported length of an input sequence.layer_norm_eps (
float
, optional, defaults to 1e-05) — The epsilon used by the layer normalization layers.classifier_dropout (
float
, optional) — The dropout ratio for the classification head.initializer_range (
float
, optional, defaults to 0.02) — The standard deviation of the normal initializer for initializing all weight matrices.pad_token_id(
int
, optional, defaults to 1) — The index of padding token in the token vocabulary.
This is the configuration class to store the configuration of a ErnieMModel. It is used to instantiate a Ernie-M model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the Ernie-M
susnato/ernie-m-base_pytorch architecture.
Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.
A normal_initializer initializes weight matrices as normal distributions. See ErnieMPretrainedModel._init_weights()
for how weights are initialized in ErnieMModel
.
ErnieMTokenizer
class transformers.ErnieMTokenizer
( sentencepiece_model_ckptvocab_file = Nonedo_lower_case = Falseencoding = 'utf8'unk_token = '[UNK]'sep_token = '[SEP]'pad_token = '[PAD]'cls_token = '[CLS]'mask_token = '[MASK]'sp_model_kwargs: typing.Union[typing.Dict[str, typing.Any], NoneType] = None**kwargs )
Parameters
sentencepiece_model_file (
str
) — The file path of sentencepiece model.vocab_file (
str
, optional) — The file path of the vocabulary.do_lower_case (
str
, optional, defaults toTrue
) — Whether or not to lowercase the input when tokenizing.unk_token (
str
, optional, defaults to"[UNK]"
) — A special token representing theunknown (out-of-vocabulary)
token. An unknown token is set to beunk_token
inorder to be converted to an ID.sep_token (
str
, optional, defaults to"[SEP]"
) — A special token separating two different sentences in the same input.pad_token (
str
, optional, defaults to"[PAD]"
) — A special token used to make arrays of tokens the same size for batching purposes.cls_token (
str
, optional, defaults to"[CLS]"
) — A special token used for sequence classification. It is the last token of the sequence when built with special tokens.mask_token (
str
, optional, defaults to"[MASK]"
) — A special token representing a masked token. This is the token used in the masked language modeling task which the model tries to predict the original unmasked ones.
Constructs a Ernie-M tokenizer. It uses the sentencepiece
tools to cut the words to sub-words.
build_inputs_with_special_tokens
( token_ids_0token_ids_1 = None ) → List[int]
Parameters
token_ids_0 (
List[int]
) — List of IDs to which the special tokens will be added.token_ids_1 (
List[int]
, optional) — Optional second list of IDs for sequence pairs.
Returns
List[int]
List of input_id with the appropriate special tokens.
Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. An ErnieM sequence has the following format:
single sequence:
[CLS] X [SEP]
pair of sequences:
[CLS] A [SEP] [SEP] B [SEP]
get_special_tokens_mask
( token_ids_0token_ids_1 = Nonealready_has_special_tokens = False ) → List[int]
Parameters
token_ids_0 (
List[int]
) — List of ids of the first sequence.token_ids_1 (
List[int]
, optional) — Optional second list of IDs for sequence pairs.already_has_special_tokens (
str
, optional, defaults toFalse
) — Whether or not the token list is already formatted with special tokens for the model.
Returns
List[int]
The list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer encode
method.
create_token_type_ids_from_sequences
( token_ids_0: typing.List[int]token_ids_1: typing.Optional[typing.List[int]] = None ) → List[int]
Parameters
token_ids_0 (
List[int]
) — The first tokenized sequence.token_ids_1 (
List[int]
, optional) — The second tokenized sequence.
Returns
List[int]
The token type ids.
Create the token type IDs corresponding to the sequences passed. What are token type IDs? Should be overridden in a subclass if the model has a special way of building: those.
save_vocabulary
( save_directory: strfilename_prefix: typing.Optional[str] = None )
ErnieMModel
class transformers.ErnieMModel
( configadd_pooling_layer = True )
Parameters
config (ErnieMConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The bare ErnieM Model transformer outputting raw hidden-states without any specific head on top.
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
( input_ids: typing.Union[<built-in method tensor of type object at 0x7f8ee20979e0>, NoneType] = Noneposition_ids: typing.Union[<built-in method tensor of type object at 0x7f8ee20979e0>, NoneType] = Noneattention_mask: typing.Union[<built-in method tensor of type object at 0x7f8ee20979e0>, NoneType] = Nonehead_mask: typing.Union[<built-in method tensor of type object at 0x7f8ee20979e0>, NoneType] = Noneinputs_embeds: typing.Union[<built-in method tensor of type object at 0x7f8ee20979e0>, NoneType] = Nonepast_key_values: typing.Union[typing.Tuple[typing.Tuple[<built-in method tensor of type object at 0x7f8ee20979e0>]], NoneType] = Noneuse_cache: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Noneoutput_attentions: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = None ) → transformers.modeling_outputs.BaseModelOutputWithPastAndCrossAttentions or tuple(torch.FloatTensor)
Parameters
input_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
) — Indices of input sequence tokens in the vocabulary.Indices can be obtained using ErnieMTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
attention_mask (
torch.FloatTensor
of shape(batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]
:1 for tokens that are not masked,
0 for tokens that are masked.
position_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
, optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.max_position_embeddings - 1]
.head_mask (
torch.FloatTensor
of shape(num_heads,)
or(num_layers, num_heads)
, optional) — Mask to nullify selected heads of the self-attention modules. Mask values selected in[0, 1]
:1 indicates the head is not masked,
0 indicates the head is masked.
inputs_embeds (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
, optional) — Optionally, instead of passinginput_ids
you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix.output_attentions (
bool
, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentions
under returned tensors for more detail.output_hidden_states (
bool
, optional) — Whether or not to return the hidden states of all layers. Seehidden_states
under returned tensors for more detail.return_dict (
bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
Returns
transformers.modeling_outputs.BaseModelOutputWithPastAndCrossAttentions or tuple(torch.FloatTensor)
A transformers.modeling_outputs.BaseModelOutputWithPastAndCrossAttentions or a tuple of torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various elements depending on the configuration (ErnieMConfig) and inputs.
last_hidden_state (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
) — Sequence of hidden-states at the output of the last layer of the model.If
past_key_values
is used only the last hidden-state of the sequences of shape(batch_size, 1, hidden_size)
is output.past_key_values (
tuple(tuple(torch.FloatTensor))
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) — Tuple oftuple(torch.FloatTensor)
of lengthconfig.n_layers
, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)
) and optionally ifconfig.is_encoder_decoder=True
2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)
.Contains pre-computed hidden-states (key and values in the self-attention blocks and optionally if
config.is_encoder_decoder=True
in the cross-attention blocks) that can be used (seepast_key_values
input) to speed up sequential decoding.hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple oftorch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
cross_attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
andconfig.add_cross_attention=True
is passed or whenconfig.output_attentions=True
) — Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads.
The ErnieMModel forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
Example:
Copied
ErnieMForSequenceClassification
class transformers.ErnieMForSequenceClassification
( config )
Parameters
config (ErnieMConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
ErnieM Model transformer with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for GLUE tasks.
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
( input_ids: typing.Optional[torch.Tensor] = Noneattention_mask: typing.Optional[torch.Tensor] = Noneposition_ids: typing.Optional[torch.Tensor] = Nonehead_mask: typing.Optional[torch.Tensor] = Noneinputs_embeds: typing.Optional[torch.Tensor] = Nonepast_key_values: typing.Optional[typing.List[torch.Tensor]] = Noneuse_cache: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Noneoutput_attentions: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = Truelabels: typing.Optional[torch.Tensor] = None ) → transformers.modeling_outputs.SequenceClassifierOutput or tuple(torch.FloatTensor)
Parameters
input_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
) — Indices of input sequence tokens in the vocabulary.Indices can be obtained using ErnieMTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
attention_mask (
torch.FloatTensor
of shape(batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]
:1 for tokens that are not masked,
0 for tokens that are masked.
position_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
, optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.max_position_embeddings - 1]
.head_mask (
torch.FloatTensor
of shape(num_heads,)
or(num_layers, num_heads)
, optional) — Mask to nullify selected heads of the self-attention modules. Mask values selected in[0, 1]
:1 indicates the head is not masked,
0 indicates the head is masked.
inputs_embeds (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
, optional) — Optionally, instead of passinginput_ids
you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix.output_attentions (
bool
, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentions
under returned tensors for more detail.output_hidden_states (
bool
, optional) — Whether or not to return the hidden states of all layers. Seehidden_states
under returned tensors for more detail.return_dict (
bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple.labels (
torch.LongTensor
of shape(batch_size,)
, optional) — Labels for computing the sequence classification/regression loss. Indices should be in[0, ..., config.num_labels - 1]
. Ifconfig.num_labels == 1
a regression loss is computed (Mean-Square loss), Ifconfig.num_labels > 1
a classification loss is computed (Cross-Entropy).
Returns
transformers.modeling_outputs.SequenceClassifierOutput or tuple(torch.FloatTensor)
A transformers.modeling_outputs.SequenceClassifierOutput or a tuple of torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various elements depending on the configuration (ErnieMConfig) and inputs.
loss (
torch.FloatTensor
of shape(1,)
, optional, returned whenlabels
is provided) — Classification (or regression if config.num_labels==1) loss.logits (
torch.FloatTensor
of shape(batch_size, config.num_labels)
) — Classification (or regression if config.num_labels==1) scores (before SoftMax).hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple oftorch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The ErnieMForSequenceClassification forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
Example of single-label classification:
Copied
Example of multi-label classification:
Copied
ErnieMForMultipleChoice
class transformers.ErnieMForMultipleChoice
( config )
Parameters
config (ErnieMConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
ErnieM Model with a multiple choice classification head on top (a linear layer on top of the pooled output and a softmax) e.g. for RocStories/SWAG tasks.
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
( input_ids: typing.Optional[torch.Tensor] = Noneattention_mask: typing.Optional[torch.Tensor] = Noneposition_ids: typing.Optional[torch.Tensor] = Nonehead_mask: typing.Optional[torch.Tensor] = Noneinputs_embeds: typing.Optional[torch.Tensor] = Nonelabels: typing.Optional[torch.Tensor] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = True ) → transformers.modeling_outputs.MultipleChoiceModelOutput or tuple(torch.FloatTensor)
Parameters
input_ids (
torch.LongTensor
of shape(batch_size, num_choices, sequence_length)
) — Indices of input sequence tokens in the vocabulary.Indices can be obtained using ErnieMTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
attention_mask (
torch.FloatTensor
of shape(batch_size, num_choices, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]
:1 for tokens that are not masked,
0 for tokens that are masked.
position_ids (
torch.LongTensor
of shape(batch_size, num_choices, sequence_length)
, optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.max_position_embeddings - 1]
.head_mask (
torch.FloatTensor
of shape(num_heads,)
or(num_layers, num_heads)
, optional) — Mask to nullify selected heads of the self-attention modules. Mask values selected in[0, 1]
:1 indicates the head is not masked,
0 indicates the head is masked.
inputs_embeds (
torch.FloatTensor
of shape(batch_size, num_choices, sequence_length, hidden_size)
, optional) — Optionally, instead of passinginput_ids
you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix.output_attentions (
bool
, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentions
under returned tensors for more detail.output_hidden_states (
bool
, optional) — Whether or not to return the hidden states of all layers. Seehidden_states
under returned tensors for more detail.return_dict (
bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple.labels (
torch.LongTensor
of shape(batch_size,)
, optional) — Labels for computing the multiple choice classification loss. Indices should be in[0, ..., num_choices-1]
wherenum_choices
is the size of the second dimension of the input tensors. (Seeinput_ids
above)
Returns
transformers.modeling_outputs.MultipleChoiceModelOutput or tuple(torch.FloatTensor)
A transformers.modeling_outputs.MultipleChoiceModelOutput or a tuple of torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various elements depending on the configuration (ErnieMConfig) and inputs.
loss (
torch.FloatTensor
of shape (1,), optional, returned whenlabels
is provided) — Classification loss.logits (
torch.FloatTensor
of shape(batch_size, num_choices)
) — num_choices is the second dimension of the input tensors. (see input_ids above).Classification scores (before SoftMax).
hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple oftorch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The ErnieMForMultipleChoice forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
Example:
Copied
ErnieMForTokenClassification
class transformers.ErnieMForTokenClassification
( config )
Parameters
config (ErnieMConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
ErnieM Model with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks.
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
( input_ids: typing.Optional[torch.Tensor] = Noneattention_mask: typing.Optional[torch.Tensor] = Noneposition_ids: typing.Optional[torch.Tensor] = Nonehead_mask: typing.Optional[torch.Tensor] = Noneinputs_embeds: typing.Optional[torch.Tensor] = Nonepast_key_values: typing.Optional[typing.List[torch.Tensor]] = Noneoutput_hidden_states: typing.Optional[bool] = Noneoutput_attentions: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = Truelabels: typing.Optional[torch.Tensor] = None ) → transformers.modeling_outputs.TokenClassifierOutput or tuple(torch.FloatTensor)
Parameters
input_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
) — Indices of input sequence tokens in the vocabulary.Indices can be obtained using ErnieMTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
attention_mask (
torch.FloatTensor
of shape(batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]
:1 for tokens that are not masked,
0 for tokens that are masked.
position_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
, optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.max_position_embeddings - 1]
.head_mask (
torch.FloatTensor
of shape(num_heads,)
or(num_layers, num_heads)
, optional) — Mask to nullify selected heads of the self-attention modules. Mask values selected in[0, 1]
:1 indicates the head is not masked,
0 indicates the head is masked.
inputs_embeds (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
, optional) — Optionally, instead of passinginput_ids
you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix.output_attentions (
bool
, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentions
under returned tensors for more detail.output_hidden_states (
bool
, optional) — Whether or not to return the hidden states of all layers. Seehidden_states
under returned tensors for more detail.return_dict (
bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple.labels (
torch.LongTensor
of shape(batch_size, sequence_length)
, optional) — Labels for computing the token classification loss. Indices should be in[0, ..., config.num_labels - 1]
.
Returns
transformers.modeling_outputs.TokenClassifierOutput or tuple(torch.FloatTensor)
A transformers.modeling_outputs.TokenClassifierOutput or a tuple of torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various elements depending on the configuration (ErnieMConfig) and inputs.
loss (
torch.FloatTensor
of shape(1,)
, optional, returned whenlabels
is provided) — Classification loss.logits (
torch.FloatTensor
of shape(batch_size, sequence_length, config.num_labels)
) — Classification scores (before SoftMax).hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple oftorch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The ErnieMForTokenClassification forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
Example:
Copied
ErnieMForQuestionAnswering
class transformers.ErnieMForQuestionAnswering
( config )
Parameters
config (ErnieMConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
ErnieM Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of the hidden-states output to compute span start logits
and span end logits
).
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
( input_ids: typing.Optional[torch.Tensor] = Noneattention_mask: typing.Optional[torch.Tensor] = Noneposition_ids: typing.Optional[torch.Tensor] = Nonehead_mask: typing.Optional[torch.Tensor] = Noneinputs_embeds: typing.Optional[torch.Tensor] = Nonestart_positions: typing.Optional[torch.Tensor] = Noneend_positions: typing.Optional[torch.Tensor] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = True ) → transformers.modeling_outputs.QuestionAnsweringModelOutput or tuple(torch.FloatTensor)
Parameters
input_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
) — Indices of input sequence tokens in the vocabulary.Indices can be obtained using ErnieMTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
attention_mask (
torch.FloatTensor
of shape(batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]
:1 for tokens that are not masked,
0 for tokens that are masked.
position_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
, optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.max_position_embeddings - 1]
.head_mask (
torch.FloatTensor
of shape(num_heads,)
or(num_layers, num_heads)
, optional) — Mask to nullify selected heads of the self-attention modules. Mask values selected in[0, 1]
:1 indicates the head is not masked,
0 indicates the head is masked.
inputs_embeds (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
, optional) — Optionally, instead of passinginput_ids
you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix.output_attentions (
bool
, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentions
under returned tensors for more detail.output_hidden_states (
bool
, optional) — Whether or not to return the hidden states of all layers. Seehidden_states
under returned tensors for more detail.return_dict (
bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple.start_positions (
torch.LongTensor
of shape(batch_size,)
, optional) — Labels for position (index) of the start of the labelled span for computing the token classification loss. Positions are clamped to the length of the sequence (sequence_length
). Position outside of the sequence are not taken into account for computing the loss.end_positions (
torch.LongTensor
of shape(batch_size,)
, optional) — Labels for position (index) of the end of the labelled span for computing the token classification loss. Positions are clamped to the length of the sequence (sequence_length
). Position outside of the sequence are not taken into account for computing the loss.
Returns
transformers.modeling_outputs.QuestionAnsweringModelOutput or tuple(torch.FloatTensor)
A transformers.modeling_outputs.QuestionAnsweringModelOutput or a tuple of torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various elements depending on the configuration (ErnieMConfig) and inputs.
loss (
torch.FloatTensor
of shape(1,)
, optional, returned whenlabels
is provided) — Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.start_logits (
torch.FloatTensor
of shape(batch_size, sequence_length)
) — Span-start scores (before SoftMax).end_logits (
torch.FloatTensor
of shape(batch_size, sequence_length)
) — Span-end scores (before SoftMax).hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple oftorch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The ErnieMForQuestionAnswering forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
Example:
Copied
ErnieMForInformationExtraction
class transformers.ErnieMForInformationExtraction
( config )
Parameters
config (ErnieMConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
ErnieMForInformationExtraction is a Ernie-M Model with two linear layer on top of the hidden-states output to compute start_prob
and end_prob
, designed for Universal Information Extraction.
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
( input_ids: typing.Optional[torch.Tensor] = Noneattention_mask: typing.Optional[torch.Tensor] = Noneposition_ids: typing.Optional[torch.Tensor] = Nonehead_mask: typing.Optional[torch.Tensor] = Noneinputs_embeds: typing.Optional[torch.Tensor] = Nonestart_positions: typing.Optional[torch.Tensor] = Noneend_positions: typing.Optional[torch.Tensor] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = True )
Parameters
input_ids (
torch.LongTensor
of shape(batch_size, num_choices, sequence_length)
) — Indices of input sequence tokens in the vocabulary.Indices can be obtained using ErnieMTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
attention_mask (
torch.FloatTensor
of shape(batch_size, num_choices, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]
:1 for tokens that are not masked,
0 for tokens that are masked.
position_ids (
torch.LongTensor
of shape(batch_size, num_choices, sequence_length)
, optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.max_position_embeddings - 1]
.head_mask (
torch.FloatTensor
of shape(num_heads,)
or(num_layers, num_heads)
, optional) — Mask to nullify selected heads of the self-attention modules. Mask values selected in[0, 1]
:1 indicates the head is not masked,
0 indicates the head is masked.
inputs_embeds (
torch.FloatTensor
of shape(batch_size, num_choices, sequence_length, hidden_size)
, optional) — Optionally, instead of passinginput_ids
you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix.output_attentions (
bool
, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentions
under returned tensors for more detail.output_hidden_states (
bool
, optional) — Whether or not to return the hidden states of all layers. Seehidden_states
under returned tensors for more detail.return_dict (
bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple.start_positions (
torch.LongTensor
of shape(batch_size, sequence_length)
, optional) — Labels for position (index) for computing the start_positions loss. Position outside of the sequence are not taken into account for computing the loss.end_positions (
torch.LongTensor
of shape(batch_size,)
, optional) — Labels for position (index) for computing the end_positions loss. Position outside of the sequence are not taken into account for computing the loss.
The ErnieMForInformationExtraction forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
Last updated