LLama2
Last updated
Last updated
The Llama2 model was proposed in by Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushka rMishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing EllenTan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, Thomas Scialom. It is a collection of foundation language models ranging from 7B to 70B parameters, with checkpoints finetuned for chat application!
The abstract from the paper is the following:
In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our models outperform open-source chat models on most benchmarks we tested, and based on our human evaluations for helpfulness and safety, may be a suitable substitute for closed-source models. We provide a detailed description of our approach to fine-tuning and safety improvements of Llama 2-Chat in order to enable the community to build on our work and contribute to the responsible development of LLMs.
Checkout all Llama2 models
The Llama2
models were trained using bfloat16
, but the original inference uses float16. The checkpoints uploaded on the hub use
torch_dtype = ‘float16’which will be used by the
AutoModelAPI to cast the checkpoints from
torch.float32to
torch.float16`.
The dtype
of the online weights is mostly irrelevant, unless you are using torch_dtype="auto"
when initializing a model using model = AutoModelForCausalLM.from_pretrained("path", torch_dtype = "auto")
. The reason is that the model will first be downloaded ( using the dtype
of the checkpoints online) then it will be casted to the default dtype
of torch
(becomes torch.float32
) and finally, if there is a torch_dtype
provided in the config, it will be used.
Training the model in float16
is not recommended and known to produce nan
, as such the model should be trained in bfloat16
.
Tips:
Weights for the Llama2 models can be obtained by filling out
The architecture is very similar to the first Llama, with the addition of Grouped Query Attention (GQA) following this
Setting config.pretraining_tp
to a value different than 1 will activate the more accurate but slower computation of the linear layers, which should better match the original logits.
The original model uses pad_id = -1
which means that there is no padding token. We can’t have the same logic, make sure to add a padding token using tokenizer.add_special_tokens({"pad_token":"<pad>"})
and resize the token embedding accordingly. You should also set the model.config.pad_token_id
. The embed_tokens
layer of the model is initialized with self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.config.padding_idx)
, which makes sure that encoding the padding token will output zeros, so passing it when initializing is recommended.
After filling out the form and gaining access to the model checkpoints, you should be able to use the already converted checkpoints. Otherwise, if you are converting your own model, feel free to use the . The script can be called with the following (example) command:
Copied
After conversion, the model and tokenizer can be loaded via:
Copied
Note that executing the script requires enough CPU RAM to host the whole model in float16 precision (even if the biggest versions come in several checkpoints they each contain a part of each weight of the model, so we need to load them all in RAM). For the 75B model, it’s thus 145GB of RAM needed.
A list of official BOINC AI and community (indicated by 🌎) resources to help you get started with LLaMA2. If you’re interested in submitting a resource to be included here, please feel free to open a Pull Request and we’ll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
Text Generation
Text Classification
⚗️ Optimization
⚡️ Inference
🚀 Deploy
( vocab_size = 32000hidden_size = 4096intermediate_size = 11008num_hidden_layers = 32num_attention_heads = 32num_key_value_heads = Nonehidden_act = 'silu'max_position_embeddings = 2048initializer_range = 0.02rms_norm_eps = 1e-06use_cache = Truepad_token_id = Nonebos_token_id = 1eos_token_id = 2pretraining_tp = 1tie_word_embeddings = Falserope_theta = 10000.0rope_scaling = Noneattention_bias = False**kwargs )
Parameters
hidden_size (int
, optional, defaults to 4096) — Dimension of the hidden representations.
intermediate_size (int
, optional, defaults to 11008) — Dimension of the MLP representations.
num_hidden_layers (int
, optional, defaults to 32) — Number of hidden layers in the Transformer encoder.
num_attention_heads (int
, optional, defaults to 32) — Number of attention heads for each attention layer in the Transformer encoder.
num_key_value_heads (int
, optional) — This is the number of key_value heads that should be used to implement Grouped Query Attention. If num_key_value_heads=num_attention_heads
, the model will use Multi Head Attention (MHA), if num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed by meanpooling all the original heads within that group. For more details checkout [this paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to
num_attention_heads`.
hidden_act (str
or function
, optional, defaults to "silu"
) — The non-linear activation function (function or string) in the decoder.
max_position_embeddings (int
, optional, defaults to 2048) — The maximum sequence length that this model might ever be used with. Llama 1 supports up to 2048 tokens, Llama 2 up to 4096, CodeLlama up to 16384.
initializer_range (float
, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
rms_norm_eps (float
, optional, defaults to 1e-12) — The epsilon used by the rms normalization layers.
use_cache (bool
, optional, defaults to True
) — Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if config.is_decoder=True
.
tie_word_embeddings(bool
, optional, defaults to False
) — Whether to tie weight embeddings
rope_theta (float
, optional, defaults to 10000.0) — The base period of the RoPE embeddings.
attention_bias (bool
, defaults to False
) — Whether to use a bias in the query, key, value and output projection layers during self-attention.
Example —
Copied
( vocab_fileunk_token = '<unk>'bos_token = '<s>'eos_token = '</s>'pad_token = Nonesp_model_kwargs: typing.Union[typing.Dict[str, typing.Any], NoneType] = Noneadd_bos_token = Trueadd_eos_token = Falseclean_up_tokenization_spaces = Falseuse_default_system_prompt = Truespaces_between_special_tokens = Falselegacy = None**kwargs )
Parameters
vocab_file (str
) — Path to the vocabulary file.
legacy (bool
, optional) — Whether or not the legacy
behavior of the tokenizer should be used. Legacy is before the merge of #24622 and #25224 which includes fixes to properly handle tokens that appear after special tokens. A simple example:
legacy=True
:
Construct a Llama tokenizer. Based on byte-level Byte-Pair-Encoding. The default padding token is unset as there is no padding token in the original model.
build_inputs_with_special_tokens
( token_ids_0token_ids_1 = None )
get_special_tokens_mask
( token_ids_0: typing.List[int]token_ids_1: typing.Optional[typing.List[int]] = Nonealready_has_special_tokens: bool = False ) → List[int]
Parameters
token_ids_0 (List[int]
) — List of IDs.
token_ids_1 (List[int]
, optional) — Optional second list of IDs for sequence pairs.
already_has_special_tokens (bool
, optional, defaults to False
) — Whether or not the token list is already formatted with special tokens for the model.
Returns
List[int]
A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer prepare_for_model
method.
create_token_type_ids_from_sequences
( token_ids_0: typing.List[int]token_ids_1: typing.Optional[typing.List[int]] = None ) → List[int]
Parameters
token_ids_0 (List[int]
) — List of ids.
token_ids_1 (List[int]
, optional) — Optional second list of IDs for sequence pairs.
Returns
List[int]
Creates a mask from the two sequences passed to be used in a sequence-pair classification task. An ALBERT
sequence pair mask has the following format:
Copied
if token_ids_1 is None, only returns the first portion of the mask (0s).
save_vocabulary
( save_directoryfilename_prefix: typing.Optional[str] = None ) → Tuple(str)
Parameters
save_directory (str
) — The directory in which to save the vocabulary.
Returns
Tuple(str)
Paths to the files saved.
Save the vocabulary and special tokens file to a directory.
( vocab_file = Nonetokenizer_file = Noneclean_up_tokenization_spaces = Falseunk_token = '<unk>'bos_token = '<s>'eos_token = '</s>'add_bos_token = Trueadd_eos_token = Falseuse_default_system_prompt = True**kwargs )
Parameters
clean_up_tokenization_spaces (str
, optional, defaults to False
) — Wether to cleanup spaces after decoding, cleanup consists in removing potential artifacts like extra spaces.
bos_token (str
, optional, defaults to "<s>"
) — The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
eos_token (str
, optional, defaults to "</s>"
) — The end of sequence token.
unk_token (str
, optional, defaults to "<unk>"
) — The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead.
Construct a Llama tokenizer. Based on byte-level Byte-Pair-Encoding.
This uses notably ByteFallback and no normalization.
Copied
build_inputs_with_special_tokens
( token_ids_0token_ids_1 = None )
get_special_tokens_mask
( token_ids_0: typing.List[int]token_ids_1: typing.Optional[typing.List[int]] = Nonealready_has_special_tokens: bool = False ) → A list of integers in the range [0, 1]
Parameters
token_ids_0 (List[int]
) — List of ids of the first sequence.
token_ids_1 (List[int]
, optional) — List of ids of the second sequence.
already_has_special_tokens (bool
, optional, defaults to False
) — Whether or not the token list is already formatted with special tokens for the model.
Returns
A list of integers in the range [0, 1]
1 for a special token, 0 for a sequence token.
Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer prepare_for_model
or encode_plus
methods.
create_token_type_ids_from_sequences
( token_ids_0: typing.List[int]token_ids_1: typing.Optional[typing.List[int]] = None ) → List[int]
Parameters
token_ids_0 (List[int]
) — The first tokenized sequence.
token_ids_1 (List[int]
, optional) — The second tokenized sequence.
Returns
List[int]
The token type ids.
Should be overridden in a subclass if the model has a special way of building those.
update_post_processor
( )
Updates the underlying post processor with the current bos_token
and eos_token
.
save_vocabulary
( save_directory: strfilename_prefix: typing.Optional[str] = None )
( config: LlamaConfig )
Parameters
Transformer decoder consisting of config.num_hidden_layers layers. Each layer is a LlamaDecoderLayer
forward
( input_ids: LongTensor = Noneattention_mask: typing.Optional[torch.Tensor] = Noneposition_ids: typing.Optional[torch.LongTensor] = Nonepast_key_values: typing.Optional[typing.List[torch.FloatTensor]] = Noneinputs_embeds: typing.Optional[torch.FloatTensor] = Noneuse_cache: typing.Optional[bool] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = None )
Parameters
input_ids (torch.LongTensor
of shape (batch_size, sequence_length)
) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.
attention_mask (torch.Tensor
of shape (batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
:
1 for tokens that are not masked,
0 for tokens that are masked.
If past_key_values
is used, optionally only the last input_ids
have to be input (see past_key_values
).
1 indicates the head is not masked,
0 indicates the head is masked.
position_ids (torch.LongTensor
of shape (batch_size, sequence_length)
, optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.n_positions - 1]
.
past_key_values (tuple(tuple(torch.FloatTensor))
, optional, returned when use_cache=True
is passed or when config.use_cache=True
) — Tuple of tuple(torch.FloatTensor)
of length config.n_layers
, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head)
) and 2 additional tensors of shape (batch_size, num_heads, encoder_sequence_length, embed_size_per_head)
.
Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see past_key_values
input) to speed up sequential decoding.
If past_key_values
are used, the user can optionally input only the last input_ids
(those that don’t have their past key value states given to this model) of shape (batch_size, 1)
instead of all input_ids
of shape (batch_size, sequence_length)
.
inputs_embeds (torch.FloatTensor
of shape (batch_size, sequence_length, hidden_size)
, optional) — Optionally, instead of passing input_ids
you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids
indices into associated vectors than the model’s internal embedding lookup matrix.
use_cache (bool
, optional) — If set to True
, past_key_values
key value states are returned and can be used to speed up decoding (see past_key_values
).
output_attentions (bool
, optional) — Whether or not to return the attentions tensors of all attention layers. See attentions
under returned tensors for more detail.
output_hidden_states (bool
, optional) — Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for more detail.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
( config )
forward
Parameters
input_ids (torch.LongTensor
of shape (batch_size, sequence_length)
) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.
attention_mask (torch.Tensor
of shape (batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
:
1 for tokens that are not masked,
0 for tokens that are masked.
If past_key_values
is used, optionally only the last input_ids
have to be input (see past_key_values
).
1 indicates the head is not masked,
0 indicates the head is masked.
position_ids (torch.LongTensor
of shape (batch_size, sequence_length)
, optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.n_positions - 1]
.
past_key_values (tuple(tuple(torch.FloatTensor))
, optional, returned when use_cache=True
is passed or when config.use_cache=True
) — Tuple of tuple(torch.FloatTensor)
of length config.n_layers
, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head)
) and 2 additional tensors of shape (batch_size, num_heads, encoder_sequence_length, embed_size_per_head)
.
Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see past_key_values
input) to speed up sequential decoding.
If past_key_values
are used, the user can optionally input only the last input_ids
(those that don’t have their past key value states given to this model) of shape (batch_size, 1)
instead of all input_ids
of shape (batch_size, sequence_length)
.
inputs_embeds (torch.FloatTensor
of shape (batch_size, sequence_length, hidden_size)
, optional) — Optionally, instead of passing input_ids
you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids
indices into associated vectors than the model’s internal embedding lookup matrix.
use_cache (bool
, optional) — If set to True
, past_key_values
key value states are returned and can be used to speed up decoding (see past_key_values
).
output_attentions (bool
, optional) — Whether or not to return the attentions tensors of all attention layers. See attentions
under returned tensors for more detail.
output_hidden_states (bool
, optional) — Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for more detail.
Args — labels (torch.LongTensor
of shape (batch_size, sequence_length)
, optional): Labels for computing the masked language modeling loss. Indices should either be in [0, ..., config.vocab_size]
or -100 (see input_ids
docstring). Tokens with indices set to -100
are ignored (masked), the loss is only computed for the tokens with labels in [0, ..., config.vocab_size]
.
Returns
loss (torch.FloatTensor
of shape (1,)
, optional, returned when labels
is provided) — Language modeling loss (for next-token prediction).
logits (torch.FloatTensor
of shape (batch_size, sequence_length, config.vocab_size)
) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
past_key_values (tuple(tuple(torch.FloatTensor))
, optional, returned when use_cache=True
is passed or when config.use_cache=True
) — Tuple of tuple(torch.FloatTensor)
of length config.n_layers
, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head)
)
Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see past_key_values
input) to speed up sequential decoding.
hidden_states (tuple(torch.FloatTensor)
, optional, returned when output_hidden_states=True
is passed or when config.output_hidden_states=True
) — Tuple of torch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size)
.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (tuple(torch.FloatTensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) — Tuple of torch.FloatTensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
Example:
Copied
( config )
Parameters
The LLaMa Model transformer with a sequence classification head on top (linear layer).
Since it does classification on the last token, it requires to know the position of the last token. If a pad_token_id
is defined in the configuration, it finds the last token that is not a padding token in each row. If no pad_token_id
is defined, it simply takes the last value in each row of the batch. Since it cannot guess the padding tokens when inputs_embeds
are passed instead of input_ids
, it does the same (take the last value in each row of the batch).
forward
( input_ids: LongTensor = Noneattention_mask: typing.Optional[torch.Tensor] = Noneposition_ids: typing.Optional[torch.LongTensor] = Nonepast_key_values: typing.Optional[typing.List[torch.FloatTensor]] = Noneinputs_embeds: typing.Optional[torch.FloatTensor] = Nonelabels: typing.Optional[torch.LongTensor] = Noneuse_cache: typing.Optional[bool] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = None )
Parameters
input_ids (torch.LongTensor
of shape (batch_size, sequence_length)
) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.
attention_mask (torch.Tensor
of shape (batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
:
1 for tokens that are not masked,
0 for tokens that are masked.
If past_key_values
is used, optionally only the last input_ids
have to be input (see past_key_values
).
1 indicates the head is not masked,
0 indicates the head is masked.
position_ids (torch.LongTensor
of shape (batch_size, sequence_length)
, optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.n_positions - 1]
.
past_key_values (tuple(tuple(torch.FloatTensor))
, optional, returned when use_cache=True
is passed or when config.use_cache=True
) — Tuple of tuple(torch.FloatTensor)
of length config.n_layers
, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head)
) and 2 additional tensors of shape (batch_size, num_heads, encoder_sequence_length, embed_size_per_head)
.
Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see past_key_values
input) to speed up sequential decoding.
If past_key_values
are used, the user can optionally input only the last input_ids
(those that don’t have their past key value states given to this model) of shape (batch_size, 1)
instead of all input_ids
of shape (batch_size, sequence_length)
.
inputs_embeds (torch.FloatTensor
of shape (batch_size, sequence_length, hidden_size)
, optional) — Optionally, instead of passing input_ids
you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids
indices into associated vectors than the model’s internal embedding lookup matrix.
use_cache (bool
, optional) — If set to True
, past_key_values
key value states are returned and can be used to speed up decoding (see past_key_values
).
output_attentions (bool
, optional) — Whether or not to return the attentions tensors of all attention layers. See attentions
under returned tensors for more detail.
output_hidden_states (bool
, optional) — Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for more detail.
labels (torch.LongTensor
of shape (batch_size,)
, optional) — Labels for computing the sequence classification/regression loss. Indices should be in [0, ..., config.num_labels - 1]
. If config.num_labels == 1
a regression loss is computed (Mean-Square loss), If config.num_labels > 1
a classification loss is computed (Cross-Entropy).
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
The LLaMA tokenizer is a BPE model based on . One quirk of sentencepiece is that when decoding a sequence, if the first token is the start of the word (e.g. “Banana”), the tokenizer does not prepend the prefix space to the string.
This model was contributed by with contributions from . The code of the implementation in BOINC AI is based on GPT-NeoX . The original code of the authors can be found .
, a blog post about Llama 2 and how to use it with 🌎Transformers and 🌎 PEFT.
, a compilation of relevant resources to learn about LLaMA 2 and how to get started quickly.
A on how to fine-tune Llama 2 in Google Colab using QLoRA and 4-bit precision. 🌎
A on how to fine-tune the “Llama-v2-7b-guanaco” model with 4-bit QLoRA and generate Q&A datasets from PDFs. 🌎
A on how to fine-tune the Llama 2 model with QLoRa, TRL, and Korean text classification dataset. 🌎🇰🇷
, a guide to using the TRL library’s DPO method to fine tune Llama 2 on a specific dataset.
, a guide to training Llama 2 to generate instructions from inputs, transforming the model from instruction-following to instruction-giving.
A on how to fine-tune the Llama 2 model on a personal computer using QLoRa and TRL. 🌎
A on how to quantize the Llama 2 model using GPTQ from the AutoGPTQ library. 🌎
A on how to run the Llama 2 Chat Model with 4-bit quantization on a local computer or Google Colab. 🌎
, a complete guide from setup to QLoRA fine-tuning and deployment on Amazon SageMaker.
, a guide on using BOINCAI’s LLM DLC container for secure and scalable deployment.
vocab_size (int
, optional, defaults to 32000) — Vocabulary size of the LLaMA model. Defines the number of different tokens that can be represented by the inputs_ids
passed when calling
pretraining_tp (int
, optional, defaults to 1
) — Experimental feature. Tensor parallelism rank used during pretraining. Please refer to to understand more about it. This value is necessary to ensure exact reproducibility of the pretraining results. Please refer to .
rope_scaling (Dict
, optional) — Dictionary containing the scaling configuration for the RoPE embeddings. Currently supports two scaling strategies: linear and dynamic. Their scaling factor must be an float greater than 1. The expected format is {"type": strategy name, "factor": scaling factor}
. When using this flag, don’t update max_position_embeddings
to the expected new maximum. See the following thread for more information on how these scaling strategies behave: . This is an experimental feature, subject to breaking API changes in future versions.
This is the configuration class to store the configuration of a . It is used to instantiate an LLaMA model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the LLaMA-7B.
Configuration objects inherit from and can be used to control the model outputs. Read the documentation from for more information.
List of according to the given sequence(s).
vocab_file (str
) — file (generally has a .model extension) that contains the vocabulary necessary to instantiate a tokenizer.
tokenizer_file (str
) — file (generally has a .json extension) that contains everything needed to load the tokenizer.
If you want to change the bos_token
or the eos_token
, make sure to specify them when initializing the model, or call tokenizer.update_post_processor()
to make sure that the post-processing is correctly done (otherwise the values of the first token and final token of an encoded sequence will not be correct). For more details, checkout [post-processors] () documentation.
This tokenizer inherits from which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.
Create the token type IDs corresponding to the sequences passed.
config () — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the method to load the model weights. config — LlamaConfig
The bare LLaMA Model outputting raw hidden-states without any specific head on top. This model inherits from . Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
Indices can be obtained using . See and for details.
Indices can be obtained using . See and for details.
If you want to change padding behavior, you should read modeling_opt._prepare_decoder_attention_mask
and modify to your needs. See diagram 1 in for more information on the default strategy.
return_dict (bool
, optional) — Whether or not to return a instead of a plain tuple.
The forward method, overrides the __call__
special method.
( input_ids: LongTensor = Noneattention_mask: typing.Optional[torch.Tensor] = Noneposition_ids: typing.Optional[torch.LongTensor] = Nonepast_key_values: typing.Optional[typing.List[torch.FloatTensor]] = Noneinputs_embeds: typing.Optional[torch.FloatTensor] = Nonelabels: typing.Optional[torch.LongTensor] = Noneuse_cache: typing.Optional[bool] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = None ) → or tuple(torch.FloatTensor)
Indices can be obtained using . See and for details.
Indices can be obtained using . See and for details.
If you want to change padding behavior, you should read modeling_opt._prepare_decoder_attention_mask
and modify to your needs. See diagram 1 in for more information on the default strategy.
return_dict (bool
, optional) — Whether or not to return a instead of a plain tuple.
or tuple(torch.FloatTensor)
A or a tuple of torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various elements depending on the configuration () and inputs.
The forward method, overrides the __call__
special method.
config () — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the method to load the model weights.
uses the last token in order to do the classification, as other causal models (e.g. GPT-2) do.
This model inherits from . Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
Indices can be obtained using . See and for details.
Indices can be obtained using . See and for details.
If you want to change padding behavior, you should read modeling_opt._prepare_decoder_attention_mask
and modify to your needs. See diagram 1 in for more information on the default strategy.
return_dict (bool
, optional) — Whether or not to return a instead of a plain tuple.
The forward method, overrides the __call__
special method.