LLaMA
Last updated
Last updated
The LLaMA model was proposed in by Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, Guillaume Lample. It is a collection of foundation language models ranging from 7B to 65B parameters.
The abstract from the paper is the following:
We introduce LLaMA, a collection of foundation language models ranging from 7B to 65B parameters. We train our models on trillions of tokens, and show that it is possible to train state-of-the-art models using publicly available datasets exclusively, without resorting to proprietary and inaccessible datasets. In particular, LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA-65B is competitive with the best models, Chinchilla-70B and PaLM-540B. We release all our models to the research community.
Tips:
Weights for the LLaMA models can be obtained from by filling out
After downloading the weights, they will need to be converted to the BOINC AI Transformers format using the . The script can be called with the following (example) command:
Copied
After conversion, the model and tokenizer can be loaded via:
Copied
Note that executing the script requires enough CPU RAM to host the whole model in float16 precision (even if the biggest versions come in several checkpoints they each contain a part of each weight of the model, so we need to load them all in RAM). For the 65B model, it’s thus 130GB of RAM needed.
Based on the original LLaMA model, Meta AI has released some follow-up works:
A list of official BOINC AI and community (indicated by 🌎) resources to help you get started with LLaMA. If you’re interested in submitting a resource to be included here, please feel free to open a Pull Request and we’ll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
Text Classification
Question Answering
⚗️ Optimization
⚡️ Inference
🚀 Deploy
( vocab_size = 32000hidden_size = 4096intermediate_size = 11008num_hidden_layers = 32num_attention_heads = 32num_key_value_heads = Nonehidden_act = 'silu'max_position_embeddings = 2048initializer_range = 0.02rms_norm_eps = 1e-06use_cache = Truepad_token_id = Nonebos_token_id = 1eos_token_id = 2pretraining_tp = 1tie_word_embeddings = Falserope_theta = 10000.0rope_scaling = Noneattention_bias = False**kwargs )
Parameters
hidden_size (int
, optional, defaults to 4096) — Dimension of the hidden representations.
intermediate_size (int
, optional, defaults to 11008) — Dimension of the MLP representations.
num_hidden_layers (int
, optional, defaults to 32) — Number of hidden layers in the Transformer encoder.
num_attention_heads (int
, optional, defaults to 32) — Number of attention heads for each attention layer in the Transformer encoder.
num_key_value_heads (int
, optional) — This is the number of key_value heads that should be used to implement Grouped Query Attention. If num_key_value_heads=num_attention_heads
, the model will use Multi Head Attention (MHA), if num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed by meanpooling all the original heads within that group. For more details checkout [this paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to
num_attention_heads`.
hidden_act (str
or function
, optional, defaults to "silu"
) — The non-linear activation function (function or string) in the decoder.
max_position_embeddings (int
, optional, defaults to 2048) — The maximum sequence length that this model might ever be used with. Llama 1 supports up to 2048 tokens, Llama 2 up to 4096, CodeLlama up to 16384.
initializer_range (float
, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
rms_norm_eps (float
, optional, defaults to 1e-12) — The epsilon used by the rms normalization layers.
use_cache (bool
, optional, defaults to True
) — Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if config.is_decoder=True
.
tie_word_embeddings(bool
, optional, defaults to False
) — Whether to tie weight embeddings
rope_theta (float
, optional, defaults to 10000.0) — The base period of the RoPE embeddings.
attention_bias (bool
, defaults to False
) — Whether to use a bias in the query, key, value and output projection layers during self-attention.
Example —
Copied
( vocab_fileunk_token = '<unk>'bos_token = '<s>'eos_token = '</s>'pad_token = Nonesp_model_kwargs: typing.Union[typing.Dict[str, typing.Any], NoneType] = Noneadd_bos_token = Trueadd_eos_token = Falseclean_up_tokenization_spaces = Falseuse_default_system_prompt = Truespaces_between_special_tokens = Falselegacy = None**kwargs )
Parameters
vocab_file (str
) — Path to the vocabulary file.
legacy (bool
, optional) — Whether or not the legacy
behavior of the tokenizer should be used. Legacy is before the merge of #24622 and #25224 which includes fixes to properly handle tokens that appear after special tokens. A simple example:
legacy=True
:
Construct a Llama tokenizer. Based on byte-level Byte-Pair-Encoding. The default padding token is unset as there is no padding token in the original model.
build_inputs_with_special_tokens
( token_ids_0token_ids_1 = None )
get_special_tokens_mask
( token_ids_0: typing.List[int]token_ids_1: typing.Optional[typing.List[int]] = Nonealready_has_special_tokens: bool = False ) → List[int]
Parameters
token_ids_0 (List[int]
) — List of IDs.
token_ids_1 (List[int]
, optional) — Optional second list of IDs for sequence pairs.
already_has_special_tokens (bool
, optional, defaults to False
) — Whether or not the token list is already formatted with special tokens for the model.
Returns
List[int]
A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer prepare_for_model
method.
create_token_type_ids_from_sequences
( token_ids_0: typing.List[int]token_ids_1: typing.Optional[typing.List[int]] = None ) → List[int]
Parameters
token_ids_0 (List[int]
) — List of ids.
token_ids_1 (List[int]
, optional) — Optional second list of IDs for sequence pairs.
Returns
List[int]
Creates a mask from the two sequences passed to be used in a sequence-pair classification task. An ALBERT
sequence pair mask has the following format:
Copied
if token_ids_1 is None, only returns the first portion of the mask (0s).
save_vocabulary
( save_directoryfilename_prefix: typing.Optional[str] = None ) → Tuple(str)
Parameters
save_directory (str
) — The directory in which to save the vocabulary.
Returns
Tuple(str)
Paths to the files saved.
Save the vocabulary and special tokens file to a directory.
( vocab_file = Nonetokenizer_file = Noneclean_up_tokenization_spaces = Falseunk_token = '<unk>'bos_token = '<s>'eos_token = '</s>'add_bos_token = Trueadd_eos_token = Falseuse_default_system_prompt = True**kwargs )
Parameters
clean_up_tokenization_spaces (str
, optional, defaults to False
) — Wether to cleanup spaces after decoding, cleanup consists in removing potential artifacts like extra spaces.
bos_token (str
, optional, defaults to "<s>"
) — The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
eos_token (str
, optional, defaults to "</s>"
) — The end of sequence token.
unk_token (str
, optional, defaults to "<unk>"
) — The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead.
Construct a Llama tokenizer. Based on byte-level Byte-Pair-Encoding.
This uses notably ByteFallback and no normalization.
Copied
build_inputs_with_special_tokens
( token_ids_0token_ids_1 = None )
get_special_tokens_mask
( token_ids_0: typing.List[int]token_ids_1: typing.Optional[typing.List[int]] = Nonealready_has_special_tokens: bool = False ) → A list of integers in the range [0, 1]
Parameters
token_ids_0 (List[int]
) — List of ids of the first sequence.
token_ids_1 (List[int]
, optional) — List of ids of the second sequence.
already_has_special_tokens (bool
, optional, defaults to False
) — Whether or not the token list is already formatted with special tokens for the model.
Returns
A list of integers in the range [0, 1]
1 for a special token, 0 for a sequence token.
Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer prepare_for_model
or encode_plus
methods.
create_token_type_ids_from_sequences
( token_ids_0: typing.List[int]token_ids_1: typing.Optional[typing.List[int]] = None ) → List[int]
Parameters
token_ids_0 (List[int]
) — The first tokenized sequence.
token_ids_1 (List[int]
, optional) — The second tokenized sequence.
Returns
List[int]
The token type ids.
Should be overridden in a subclass if the model has a special way of building those.
update_post_processor
( )
Updates the underlying post processor with the current bos_token
and eos_token
.
save_vocabulary
( save_directory: strfilename_prefix: typing.Optional[str] = None )
( config: LlamaConfig )
Parameters
Transformer decoder consisting of config.num_hidden_layers layers. Each layer is a LlamaDecoderLayer
forward
( input_ids: LongTensor = Noneattention_mask: typing.Optional[torch.Tensor] = Noneposition_ids: typing.Optional[torch.LongTensor] = Nonepast_key_values: typing.Optional[typing.List[torch.FloatTensor]] = Noneinputs_embeds: typing.Optional[torch.FloatTensor] = Noneuse_cache: typing.Optional[bool] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = None )
Parameters
input_ids (torch.LongTensor
of shape (batch_size, sequence_length)
) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.
attention_mask (torch.Tensor
of shape (batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
:
1 for tokens that are not masked,
0 for tokens that are masked.
If past_key_values
is used, optionally only the last input_ids
have to be input (see past_key_values
).
1 indicates the head is not masked,
0 indicates the head is masked.
position_ids (torch.LongTensor
of shape (batch_size, sequence_length)
, optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.n_positions - 1]
.
past_key_values (tuple(tuple(torch.FloatTensor))
, optional, returned when use_cache=True
is passed or when config.use_cache=True
) — Tuple of tuple(torch.FloatTensor)
of length config.n_layers
, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head)
) and 2 additional tensors of shape (batch_size, num_heads, encoder_sequence_length, embed_size_per_head)
.
Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see past_key_values
input) to speed up sequential decoding.
If past_key_values
are used, the user can optionally input only the last input_ids
(those that don’t have their past key value states given to this model) of shape (batch_size, 1)
instead of all input_ids
of shape (batch_size, sequence_length)
.
inputs_embeds (torch.FloatTensor
of shape (batch_size, sequence_length, hidden_size)
, optional) — Optionally, instead of passing input_ids
you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids
indices into associated vectors than the model’s internal embedding lookup matrix.
use_cache (bool
, optional) — If set to True
, past_key_values
key value states are returned and can be used to speed up decoding (see past_key_values
).
output_attentions (bool
, optional) — Whether or not to return the attentions tensors of all attention layers. See attentions
under returned tensors for more detail.
output_hidden_states (bool
, optional) — Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for more detail.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
( config )
forward
Parameters
input_ids (torch.LongTensor
of shape (batch_size, sequence_length)
) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.
attention_mask (torch.Tensor
of shape (batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
:
1 for tokens that are not masked,
0 for tokens that are masked.
If past_key_values
is used, optionally only the last input_ids
have to be input (see past_key_values
).
1 indicates the head is not masked,
0 indicates the head is masked.
position_ids (torch.LongTensor
of shape (batch_size, sequence_length)
, optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.n_positions - 1]
.
past_key_values (tuple(tuple(torch.FloatTensor))
, optional, returned when use_cache=True
is passed or when config.use_cache=True
) — Tuple of tuple(torch.FloatTensor)
of length config.n_layers
, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head)
) and 2 additional tensors of shape (batch_size, num_heads, encoder_sequence_length, embed_size_per_head)
.
Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see past_key_values
input) to speed up sequential decoding.
If past_key_values
are used, the user can optionally input only the last input_ids
(those that don’t have their past key value states given to this model) of shape (batch_size, 1)
instead of all input_ids
of shape (batch_size, sequence_length)
.
inputs_embeds (torch.FloatTensor
of shape (batch_size, sequence_length, hidden_size)
, optional) — Optionally, instead of passing input_ids
you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids
indices into associated vectors than the model’s internal embedding lookup matrix.
use_cache (bool
, optional) — If set to True
, past_key_values
key value states are returned and can be used to speed up decoding (see past_key_values
).
output_attentions (bool
, optional) — Whether or not to return the attentions tensors of all attention layers. See attentions
under returned tensors for more detail.
output_hidden_states (bool
, optional) — Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for more detail.
Args — labels (torch.LongTensor
of shape (batch_size, sequence_length)
, optional): Labels for computing the masked language modeling loss. Indices should either be in [0, ..., config.vocab_size]
or -100 (see input_ids
docstring). Tokens with indices set to -100
are ignored (masked), the loss is only computed for the tokens with labels in [0, ..., config.vocab_size]
.
Returns
loss (torch.FloatTensor
of shape (1,)
, optional, returned when labels
is provided) — Language modeling loss (for next-token prediction).
logits (torch.FloatTensor
of shape (batch_size, sequence_length, config.vocab_size)
) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
past_key_values (tuple(tuple(torch.FloatTensor))
, optional, returned when use_cache=True
is passed or when config.use_cache=True
) — Tuple of tuple(torch.FloatTensor)
of length config.n_layers
, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head)
)
Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see past_key_values
input) to speed up sequential decoding.
hidden_states (tuple(torch.FloatTensor)
, optional, returned when output_hidden_states=True
is passed or when config.output_hidden_states=True
) — Tuple of torch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size)
.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (tuple(torch.FloatTensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) — Tuple of torch.FloatTensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
Example:
Copied
( config )
Parameters
The LLaMa Model transformer with a sequence classification head on top (linear layer).
Since it does classification on the last token, it requires to know the position of the last token. If a pad_token_id
is defined in the configuration, it finds the last token that is not a padding token in each row. If no pad_token_id
is defined, it simply takes the last value in each row of the batch. Since it cannot guess the padding tokens when inputs_embeds
are passed instead of input_ids
, it does the same (take the last value in each row of the batch).
forward
( input_ids: LongTensor = Noneattention_mask: typing.Optional[torch.Tensor] = Noneposition_ids: typing.Optional[torch.LongTensor] = Nonepast_key_values: typing.Optional[typing.List[torch.FloatTensor]] = Noneinputs_embeds: typing.Optional[torch.FloatTensor] = Nonelabels: typing.Optional[torch.LongTensor] = Noneuse_cache: typing.Optional[bool] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = None )
Parameters
input_ids (torch.LongTensor
of shape (batch_size, sequence_length)
) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.
attention_mask (torch.Tensor
of shape (batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
:
1 for tokens that are not masked,
0 for tokens that are masked.
If past_key_values
is used, optionally only the last input_ids
have to be input (see past_key_values
).
1 indicates the head is not masked,
0 indicates the head is masked.
position_ids (torch.LongTensor
of shape (batch_size, sequence_length)
, optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.n_positions - 1]
.
past_key_values (tuple(tuple(torch.FloatTensor))
, optional, returned when use_cache=True
is passed or when config.use_cache=True
) — Tuple of tuple(torch.FloatTensor)
of length config.n_layers
, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head)
) and 2 additional tensors of shape (batch_size, num_heads, encoder_sequence_length, embed_size_per_head)
.
Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see past_key_values
input) to speed up sequential decoding.
If past_key_values
are used, the user can optionally input only the last input_ids
(those that don’t have their past key value states given to this model) of shape (batch_size, 1)
instead of all input_ids
of shape (batch_size, sequence_length)
.
inputs_embeds (torch.FloatTensor
of shape (batch_size, sequence_length, hidden_size)
, optional) — Optionally, instead of passing input_ids
you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids
indices into associated vectors than the model’s internal embedding lookup matrix.
use_cache (bool
, optional) — If set to True
, past_key_values
key value states are returned and can be used to speed up decoding (see past_key_values
).
output_attentions (bool
, optional) — Whether or not to return the attentions tensors of all attention layers. See attentions
under returned tensors for more detail.
output_hidden_states (bool
, optional) — Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for more detail.
labels (torch.LongTensor
of shape (batch_size,)
, optional) — Labels for computing the sequence classification/regression loss. Indices should be in [0, ..., config.num_labels - 1]
. If config.num_labels == 1
a regression loss is computed (Mean-Square loss), If config.num_labels > 1
a classification loss is computed (Cross-Entropy).
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
The LLaMA tokenizer is a BPE model based on . One quirk of sentencepiece is that when decoding a sequence, if the first token is the start of the word (e.g. “Banana”), the tokenizer does not prepend the prefix space to the string.
This model was contributed by with contributions from . The code of the implementation in BOINC AI is based on GPT-NeoX . The original code of the authors can be found .
Llama2: Llama2 is an improved version of Llama with some architectural tweaks (Grouped Query Attention), and is pre-trained on 2Trillion tokens. Refer to the documentation of Llama2 which can be found .
A on how to use prompt tuning to adapt the LLaMA model for text classification task. 🌎
, a blog post about how to train LLaMA to answer questions on with RLHF.
A on how to fine-tune LLaMA model using xturing library on GPU which has limited memory. 🌎
A on how to run the LLaMA Model using PeftModel from the 🌎 PEFT library. 🌎
A on how to load a PEFT adapter LLaMA model with LangChain. 🌎
A on how to fine-tune LLaMA model using LoRA method via the 🌎 PEFT library with intuitive UI. 🌎
A on how to deploy Open-LLaMA model for text generation on Amazon SageMaker. 🌎
vocab_size (int
, optional, defaults to 32000) — Vocabulary size of the LLaMA model. Defines the number of different tokens that can be represented by the inputs_ids
passed when calling
pretraining_tp (int
, optional, defaults to 1
) — Experimental feature. Tensor parallelism rank used during pretraining. Please refer to to understand more about it. This value is necessary to ensure exact reproducibility of the pretraining results. Please refer to .
rope_scaling (Dict
, optional) — Dictionary containing the scaling configuration for the RoPE embeddings. Currently supports two scaling strategies: linear and dynamic. Their scaling factor must be an float greater than 1. The expected format is {"type": strategy name, "factor": scaling factor}
. When using this flag, don’t update max_position_embeddings
to the expected new maximum. See the following thread for more information on how these scaling strategies behave: . This is an experimental feature, subject to breaking API changes in future versions.
This is the configuration class to store the configuration of a . It is used to instantiate an LLaMA model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the LLaMA-7B.
Configuration objects inherit from and can be used to control the model outputs. Read the documentation from for more information.
List of according to the given sequence(s).
vocab_file (str
) — file (generally has a .model extension) that contains the vocabulary necessary to instantiate a tokenizer.
tokenizer_file (str
) — file (generally has a .json extension) that contains everything needed to load the tokenizer.
If you want to change the bos_token
or the eos_token
, make sure to specify them when initializing the model, or call tokenizer.update_post_processor()
to make sure that the post-processing is correctly done (otherwise the values of the first token and final token of an encoded sequence will not be correct). For more details, checkout [post-processors] () documentation.
This tokenizer inherits from which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.
Create the token type IDs corresponding to the sequences passed.
config () — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the method to load the model weights. config — LlamaConfig
The bare LLaMA Model outputting raw hidden-states without any specific head on top. This model inherits from . Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
Indices can be obtained using . See and for details.
Indices can be obtained using . See and for details.
If you want to change padding behavior, you should read modeling_opt._prepare_decoder_attention_mask
and modify to your needs. See diagram 1 in for more information on the default strategy.
return_dict (bool
, optional) — Whether or not to return a instead of a plain tuple.
The forward method, overrides the __call__
special method.
( input_ids: LongTensor = Noneattention_mask: typing.Optional[torch.Tensor] = Noneposition_ids: typing.Optional[torch.LongTensor] = Nonepast_key_values: typing.Optional[typing.List[torch.FloatTensor]] = Noneinputs_embeds: typing.Optional[torch.FloatTensor] = Nonelabels: typing.Optional[torch.LongTensor] = Noneuse_cache: typing.Optional[bool] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = None ) → or tuple(torch.FloatTensor)
Indices can be obtained using . See and for details.
Indices can be obtained using . See and for details.
If you want to change padding behavior, you should read modeling_opt._prepare_decoder_attention_mask
and modify to your needs. See diagram 1 in for more information on the default strategy.
return_dict (bool
, optional) — Whether or not to return a instead of a plain tuple.
or tuple(torch.FloatTensor)
A or a tuple of torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various elements depending on the configuration () and inputs.
The forward method, overrides the __call__
special method.
config () — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the method to load the model weights.
uses the last token in order to do the classification, as other causal models (e.g. GPT-2) do.
This model inherits from . Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
Indices can be obtained using . See and for details.
Indices can be obtained using . See and for details.
If you want to change padding behavior, you should read modeling_opt._prepare_decoder_attention_mask
and modify to your needs. See diagram 1 in for more information on the default strategy.
return_dict (bool
, optional) — Whether or not to return a instead of a plain tuple.
The forward method, overrides the __call__
special method.