> For the complete documentation index, see [llms.txt](https://boinc-ai.gitbook.io/transformers/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://boinc-ai.gitbook.io/transformers/api/models/text-models/llama2.md).

# LLama2

## Llama2

### Overview

The Llama2 model was proposed in [LLaMA: Open Foundation and Fine-Tuned Chat Models](https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/) by Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushka rMishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing EllenTan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, Thomas Scialom. It is a collection of foundation language models ranging from 7B to 70B parameters, with checkpoints finetuned for chat application!

The abstract from the paper is the following:

*In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our models outperform open-source chat models on most benchmarks we tested, and based on our human evaluations for helpfulness and safety, may be a suitable substitute for closed-source models. We provide a detailed description of our approach to fine-tuning and safety improvements of Llama 2-Chat in order to enable the community to build on our work and contribute to the responsible development of LLMs.*

Checkout all Llama2 models [here](https://huggingface.co/models?search=llama2)

The `Llama2` models were trained using `bfloat16`, but the original inference uses `float16. The checkpoints uploaded on the hub use` torch\_dtype = ‘float16’`which will be used by the`AutoModel`API to cast the checkpoints from`torch.float32`to`torch.float16\`.

The `dtype` of the online weights is mostly irrelevant, unless you are using `torch_dtype="auto"` when initializing a model using `model = AutoModelForCausalLM.from_pretrained("path", torch_dtype = "auto")`. The reason is that the model will first be downloaded ( using the `dtype` of the checkpoints online) then it will be casted to the default `dtype` of `torch` (becomes `torch.float32`) and finally, if there is a `torch_dtype` provided in the config, it will be used.

Training the model in `float16` is not recommended and known to produce `nan`, as such the model should be trained in `bfloat16`.

Tips:

* Weights for the Llama2 models can be obtained by filling out [this form](https://ai.meta.com/resources/models-and-libraries/llama-downloads/)
* The architecture is very similar to the first Llama, with the addition of Grouped Query Attention (GQA) following this [paper](https://arxiv.org/pdf/2305.13245.pdf)
* Setting `config.pretraining_tp` to a value different than 1 will activate the more accurate but slower computation of the linear layers, which should better match the original logits.
* The original model uses `pad_id = -1` which means that there is no padding token. We can’t have the same logic, make sure to add a padding token using `tokenizer.add_special_tokens({"pad_token":"<pad>"})` and resize the token embedding accordingly. You should also set the `model.config.pad_token_id`. The `embed_tokens` layer of the model is initialized with `self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.config.padding_idx)`, which makes sure that encoding the padding token will output zeros, so passing it when initializing is recommended.
* After filling out the form and gaining access to the model checkpoints, you should be able to use the already converted checkpoints. Otherwise, if you are converting your own model, feel free to use the [conversion script](https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/convert_llama_weights_to_hf.py). The script can be called with the following (example) command:

Copied

```
python src/transformers/models/llama/convert_llama_weights_to_hf.py \
    --input_dir /path/to/downloaded/llama/weights --model_size 7B --output_dir /output/path
```

* After conversion, the model and tokenizer can be loaded via:

Copied

```
from transformers import LlamaForCausalLM, LlamaTokenizer

tokenizer = LlamaTokenizer.from_pretrained("/output/path")
model = LlamaForCausalLM.from_pretrained("/output/path")
```

Note that executing the script requires enough CPU RAM to host the whole model in float16 precision (even if the biggest versions come in several checkpoints they each contain a part of each weight of the model, so we need to load them all in RAM). For the 75B model, it’s thus 145GB of RAM needed.

* The LLaMA tokenizer is a BPE model based on [sentencepiece](https://github.com/google/sentencepiece). One quirk of sentencepiece is that when decoding a sequence, if the first token is the start of the word (e.g. “Banana”), the tokenizer does not prepend the prefix space to the string.

This model was contributed by [Arthur Zucker](https://huggingface.co/ArthurZ) with contributions from [Lysandre Debut](https://huggingface.co/lysandre). The code of the implementation in BOINC AI is based on GPT-NeoX [here](https://github.com/EleutherAI/gpt-neox). The original code of the authors can be found [here](https://github.com/facebookresearch/llama).

### Resources

A list of official BOINC AI and community (indicated by 🌎) resources to help you get started with LLaMA2. If you’re interested in submitting a resource to be included here, please feel free to open a Pull Request and we’ll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.

* [Llama 2 is here - get it on BOINC AI](https://huggingface.co/blog/llama2), a blog post about Llama 2 and how to use it with 🌎Transformers and 🌎 PEFT.
* [LLaMA 2 - Every Resource you need](https://www.philschmid.de/llama-2), a compilation of relevant resources to learn about LLaMA 2 and how to get started quickly.

Text Generation

* A [notebook](https://colab.research.google.com/drive/1PEQyJO1-f6j0S_XJ8DV50NkpzasXkrzd?usp=sharing) on how to fine-tune Llama 2 in Google Colab using QLoRA and 4-bit precision. 🌎
* A [notebook](https://colab.research.google.com/drive/134o_cXcMe_lsvl15ZE_4Y75Kstepsntu?usp=sharing) on how to fine-tune the “Llama-v2-7b-guanaco” model with 4-bit QLoRA and generate Q\&A datasets from PDFs. 🌎

Text Classification

* A [notebook](https://colab.research.google.com/drive/1ggaa2oRFphdBmqIjSEbnb_HGkcIRC2ZB?usp=sharing) on how to fine-tune the Llama 2 model with QLoRa, TRL, and Korean text classification dataset. 🌎🇰🇷

⚗️ Optimization

* [Fine-tune Llama 2 with DPO](https://huggingface.co/blog/dpo-trl), a guide to using the TRL library’s DPO method to fine tune Llama 2 on a specific dataset.
* [Extended Guide: Instruction-tune Llama 2](https://www.philschmid.de/instruction-tune-llama-2), a guide to training Llama 2 to generate instructions from inputs, transforming the model from instruction-following to instruction-giving.
* A [notebook](https://colab.research.google.com/drive/1SYpgFpcmtIUzdE7pxqknrM4ArCASfkFQ?usp=sharing) on how to fine-tune the Llama 2 model on a personal computer using QLoRa and TRL. 🌎

⚡️ Inference

* A [notebook](https://colab.research.google.com/drive/1TC56ArKerXUpbgRy5vM3woRsbTEVNq7h?usp=sharing) on how to quantize the Llama 2 model using GPTQ from the AutoGPTQ library. 🌎
* A [notebook](https://colab.research.google.com/drive/1X1z9Q6domMKl2CnEM0QGHNwidLfR4dW2?usp=sharing) on how to run the Llama 2 Chat Model with 4-bit quantization on a local computer or Google Colab. 🌎

🚀 Deploy

* [Fine-tune LLaMA 2 (7-70B) on Amazon SageMaker](https://www.philschmid.de/sagemaker-llama2-qlora), a complete guide from setup to QLoRA fine-tuning and deployment on Amazon SageMaker.
* [Deploy Llama 2 7B/13B/70B on Amazon SageMaker](https://www.philschmid.de/sagemaker-llama-llm), a guide on using BOINCAI’s LLM DLC container for secure and scalable deployment.

### LlamaConfig

#### class transformers.LlamaConfig

[\<source>](https://github.com/huggingface/transformers/blob/v4.34.1/src/transformers/models/llama/configuration_llama.py#L31)

( vocab\_size = 32000hidden\_size = 4096intermediate\_size = 11008num\_hidden\_layers = 32num\_attention\_heads = 32num\_key\_value\_heads = Nonehidden\_act = 'silu'max\_position\_embeddings = 2048initializer\_range = 0.02rms\_norm\_eps = 1e-06use\_cache = Truepad\_token\_id = Nonebos\_token\_id = 1eos\_token\_id = 2pretraining\_tp = 1tie\_word\_embeddings = Falserope\_theta = 10000.0rope\_scaling = Noneattention\_bias = False\*\*kwargs )

Parameters

* **vocab\_size** (`int`, *optional*, defaults to 32000) — Vocabulary size of the LLaMA model. Defines the number of different tokens that can be represented by the `inputs_ids` passed when calling [LlamaModel](https://huggingface.co/docs/transformers/v4.34.1/en/model_doc/llama2#transformers.LlamaModel)
* **hidden\_size** (`int`, *optional*, defaults to 4096) — Dimension of the hidden representations.
* **intermediate\_size** (`int`, *optional*, defaults to 11008) — Dimension of the MLP representations.
* **num\_hidden\_layers** (`int`, *optional*, defaults to 32) — Number of hidden layers in the Transformer encoder.
* **num\_attention\_heads** (`int`, *optional*, defaults to 32) — Number of attention heads for each attention layer in the Transformer encoder.
* **num\_key\_value\_heads** (`int`, *optional*) — This is the number of key\_value heads that should be used to implement Grouped Query Attention. If `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if `num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed by meanpooling all the original heads within that group. For more details checkout [this paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to` num\_attention\_heads\`.
* **pretraining\_tp** (`int`, *optional*, defaults to `1`) — Experimental feature. Tensor parallelism rank used during pretraining. Please refer to [this document](https://huggingface.co/docs/transformers/parallelism) to understand more about it. This value is necessary to ensure exact reproducibility of the pretraining results. Please refer to [this issue](https://github.com/pytorch/pytorch/issues/76232).
* **hidden\_act** (`str` or `function`, *optional*, defaults to `"silu"`) — The non-linear activation function (function or string) in the decoder.
* **max\_position\_embeddings** (`int`, *optional*, defaults to 2048) — The maximum sequence length that this model might ever be used with. Llama 1 supports up to 2048 tokens, Llama 2 up to 4096, CodeLlama up to 16384.
* **initializer\_range** (`float`, *optional*, defaults to 0.02) — The standard deviation of the truncated\_normal\_initializer for initializing all weight matrices.
* **rms\_norm\_eps** (`float`, *optional*, defaults to 1e-12) — The epsilon used by the rms normalization layers.
* **use\_cache** (`bool`, *optional*, defaults to `True`) — Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if `config.is_decoder=True`.
* **tie\_word\_embeddings(`bool`,** *optional*, defaults to `False`) — Whether to tie weight embeddings
* **rope\_theta** (`float`, *optional*, defaults to 10000.0) — The base period of the RoPE embeddings.
* **rope\_scaling** (`Dict`, *optional*) — Dictionary containing the scaling configuration for the RoPE embeddings. Currently supports two scaling strategies: linear and dynamic. Their scaling factor must be an float greater than 1. The expected format is `{"type": strategy name, "factor": scaling factor}`. When using this flag, don’t update `max_position_embeddings` to the expected new maximum. See the following thread for more information on how these scaling strategies behave: <https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases/>. This is an experimental feature, subject to breaking API changes in future versions.
* **attention\_bias** (`bool`, defaults to `False`) — Whether to use a bias in the query, key, value and output projection layers during self-attention.

  Example —

This is the configuration class to store the configuration of a [LlamaModel](https://huggingface.co/docs/transformers/v4.34.1/en/model_doc/llama2#transformers.LlamaModel). It is used to instantiate an LLaMA model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the LLaMA-7B.

Configuration objects inherit from [PretrainedConfig](https://huggingface.co/docs/transformers/v4.34.1/en/main_classes/configuration#transformers.PretrainedConfig) and can be used to control the model outputs. Read the documentation from [PretrainedConfig](https://huggingface.co/docs/transformers/v4.34.1/en/main_classes/configuration#transformers.PretrainedConfig) for more information.

Copied

```
>>> from transformers import LlamaModel, LlamaConfig

>>> # Initializing a LLaMA llama-7b style configuration
>>> configuration = LlamaConfig()

>>> # Initializing a model from the llama-7b style configuration
>>> model = LlamaModel(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config
```

### LlamaTokenizer

#### class transformers.LlamaTokenizer

[\<source>](https://github.com/huggingface/transformers/blob/v4.34.1/src/transformers/models/llama/tokenization_llama.py#L66)

( vocab\_fileunk\_token = '\<unk>'bos\_token = '\<s>'eos\_token = '\</s>'pad\_token = Nonesp\_model\_kwargs: typing.Union\[typing.Dict\[str, typing.Any], NoneType] = Noneadd\_bos\_token = Trueadd\_eos\_token = Falseclean\_up\_tokenization\_spaces = Falseuse\_default\_system\_prompt = Truespaces\_between\_special\_tokens = Falselegacy = None\*\*kwargs )

Parameters

* **vocab\_file** (`str`) — Path to the vocabulary file.
* **legacy** (`bool`, *optional*) — Whether or not the `legacy` behavior of the tokenizer should be used. Legacy is before the merge of #24622 and #25224 which includes fixes to properly handle tokens that appear after special tokens. A simple example:
  * `legacy=True`:

Construct a Llama tokenizer. Based on byte-level Byte-Pair-Encoding. The default padding token is unset as there is no padding token in the original model.

**build\_inputs\_with\_special\_tokens**

[\<source>](https://github.com/huggingface/transformers/blob/v4.34.1/src/transformers/models/llama/tokenization_llama.py#L296)

( token\_ids\_0token\_ids\_1 = None )

**get\_special\_tokens\_mask**

[\<source>](https://github.com/huggingface/transformers/blob/v4.34.1/src/transformers/models/llama/tokenization_llama.py#L307)

( token\_ids\_0: typing.List\[int]token\_ids\_1: typing.Optional\[typing.List\[int]] = Nonealready\_has\_special\_tokens: bool = False ) → `List[int]`

Parameters

* **token\_ids\_0** (`List[int]`) — List of IDs.
* **token\_ids\_1** (`List[int]`, *optional*) — Optional second list of IDs for sequence pairs.
* **already\_has\_special\_tokens** (`bool`, *optional*, defaults to `False`) — Whether or not the token list is already formatted with special tokens for the model.

Returns

`List[int]`

A list of integers in the range \[0, 1]: 1 for a special token, 0 for a sequence token.

Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer `prepare_for_model` method.

**create\_token\_type\_ids\_from\_sequences**

[\<source>](https://github.com/huggingface/transformers/blob/v4.34.1/src/transformers/models/llama/tokenization_llama.py#L344)

( token\_ids\_0: typing.List\[int]token\_ids\_1: typing.Optional\[typing.List\[int]] = None ) → `List[int]`

Parameters

* **token\_ids\_0** (`List[int]`) — List of ids.
* **token\_ids\_1** (`List[int]`, *optional*) — Optional second list of IDs for sequence pairs.

Returns

`List[int]`

List of [token type IDs](https://huggingface.co/docs/transformers/glossary#token-type-ids) according to the given sequence(s).

Creates a mask from the two sequences passed to be used in a sequence-pair classification task. An ALBERT

sequence pair mask has the following format:

Copied

```
0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
| first sequence    | second sequence |
```

if token\_ids\_1 is None, only returns the first portion of the mask (0s).

**save\_vocabulary**

[\<source>](https://github.com/huggingface/transformers/blob/v4.34.1/src/transformers/models/llama/tokenization_llama.py#L269)

( save\_directoryfilename\_prefix: typing.Optional\[str] = None ) → `Tuple(str)`

Parameters

* **save\_directory** (`str`) — The directory in which to save the vocabulary.

Returns

`Tuple(str)`

Paths to the files saved.

Save the vocabulary and special tokens file to a directory.

### LlamaTokenizerFast

#### class transformers.LlamaTokenizerFast

[\<source>](https://github.com/huggingface/transformers/blob/v4.34.1/src/transformers/models/llama/tokenization_llama_fast.py#L57)

( vocab\_file = Nonetokenizer\_file = Noneclean\_up\_tokenization\_spaces = Falseunk\_token = '\<unk>'bos\_token = '\<s>'eos\_token = '\</s>'add\_bos\_token = Trueadd\_eos\_token = Falseuse\_default\_system\_prompt = True\*\*kwargs )

Parameters

* **vocab\_file** (`str`) — [SentencePiece](https://github.com/google/sentencepiece) file (generally has a .model extension) that contains the vocabulary necessary to instantiate a tokenizer.
* **tokenizer\_file** (`str`) — [tokenizers](https://github.com/huggingface/tokenizers) file (generally has a .json extension) that contains everything needed to load the tokenizer.
* **clean\_up\_tokenization\_spaces** (`str`, *optional*, defaults to `False`) — Wether to cleanup spaces after decoding, cleanup consists in removing potential artifacts like extra spaces.
* **bos\_token** (`str`, *optional*, defaults to `"<s>"`) — The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
* **eos\_token** (`str`, *optional*, defaults to `"</s>"`) — The end of sequence token.
* **unk\_token** (`str`, *optional*, defaults to `"<unk>"`) — The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead.

Construct a Llama tokenizer. Based on byte-level Byte-Pair-Encoding.

This uses notably ByteFallback and no normalization.

Copied

```
from transformers import LlamaTokenizerFast

tokenizer = LlamaTokenizerFast.from_pretrained("hf-internal-testing/llama-tokenizer")
tokenizer.encode("Hello this is a test")
>>> [1, 15043, 445, 338, 263, 1243]
```

If you want to change the `bos_token` or the `eos_token`, make sure to specify them when initializing the model, or call `tokenizer.update_post_processor()` to make sure that the post-processing is correctly done (otherwise the values of the first token and final token of an encoded sequence will not be correct). For more details, checkout \[post-processors] ([https://boincai.com/docs/tokenizers/api/post-processors](https://huggingface.co/docs/tokenizers/api/post-processors)) documentation.

This tokenizer inherits from [PreTrainedTokenizerFast](https://huggingface.co/docs/transformers/v4.34.1/en/main_classes/tokenizer#transformers.PreTrainedTokenizerFast) which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.

**build\_inputs\_with\_special\_tokens**

[\<source>](https://github.com/huggingface/transformers/blob/v4.34.1/src/transformers/models/llama/tokenization_llama_fast.py#L255)

( token\_ids\_0token\_ids\_1 = None )

**get\_special\_tokens\_mask**

[\<source>](https://github.com/huggingface/transformers/blob/v4.34.1/src/transformers/tokenization_utils_base.py#L3754)

( token\_ids\_0: typing.List\[int]token\_ids\_1: typing.Optional\[typing.List\[int]] = Nonealready\_has\_special\_tokens: bool = False ) → A list of integers in the range \[0, 1]

Parameters

* **token\_ids\_0** (`List[int]`) — List of ids of the first sequence.
* **token\_ids\_1** (`List[int]`, *optional*) — List of ids of the second sequence.
* **already\_has\_special\_tokens** (`bool`, *optional*, defaults to `False`) — Whether or not the token list is already formatted with special tokens for the model.

Returns

A list of integers in the range \[0, 1]

1 for a special token, 0 for a sequence token.

Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer `prepare_for_model` or `encode_plus` methods.

**create\_token\_type\_ids\_from\_sequences**

[\<source>](https://github.com/huggingface/transformers/blob/v4.34.1/src/transformers/tokenization_utils_base.py#L3289)

( token\_ids\_0: typing.List\[int]token\_ids\_1: typing.Optional\[typing.List\[int]] = None ) → `List[int]`

Parameters

* **token\_ids\_0** (`List[int]`) — The first tokenized sequence.
* **token\_ids\_1** (`List[int]`, *optional*) — The second tokenized sequence.

Returns

`List[int]`

The token type ids.

Create the token type IDs corresponding to the sequences passed. [What are token type IDs?](https://huggingface.co/docs/transformers/glossary#token-type-ids)

Should be overridden in a subclass if the model has a special way of building those.

**update\_post\_processor**

[\<source>](https://github.com/huggingface/transformers/blob/v4.34.1/src/transformers/models/llama/tokenization_llama_fast.py#L142)

( )

Updates the underlying post processor with the current `bos_token` and `eos_token`.

**save\_vocabulary**

[\<source>](https://github.com/huggingface/transformers/blob/v4.34.1/src/transformers/models/llama/tokenization_llama_fast.py#L182)

( save\_directory: strfilename\_prefix: typing.Optional\[str] = None )

### LlamaModel

#### class transformers.LlamaModel

[\<source>](https://github.com/huggingface/transformers/blob/v4.34.1/src/transformers/models/llama/modeling_llama.py#L776)

( config: LlamaConfig )

Parameters

* **config** ([LlamaConfig](https://huggingface.co/docs/transformers/v4.34.1/en/model_doc/llama2#transformers.LlamaConfig)) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the [from\_pretrained()](https://huggingface.co/docs/transformers/v4.34.1/en/main_classes/model#transformers.PreTrainedModel.from_pretrained) method to load the model weights. config — LlamaConfig

The bare LLaMA Model outputting raw hidden-states without any specific head on top. This model inherits from [PreTrainedModel](https://huggingface.co/docs/transformers/v4.34.1/en/main_classes/model#transformers.PreTrainedModel). Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

Transformer decoder consisting of *config.num\_hidden\_layers* layers. Each layer is a `LlamaDecoderLayer`

**forward**

[\<source>](https://github.com/huggingface/transformers/blob/v4.34.1/src/transformers/models/llama/modeling_llama.py#L827)

( input\_ids: LongTensor = Noneattention\_mask: typing.Optional\[torch.Tensor] = Noneposition\_ids: typing.Optional\[torch.LongTensor] = Nonepast\_key\_values: typing.Optional\[typing.List\[torch.FloatTensor]] = Noneinputs\_embeds: typing.Optional\[torch.FloatTensor] = Noneuse\_cache: typing.Optional\[bool] = Noneoutput\_attentions: typing.Optional\[bool] = Noneoutput\_hidden\_states: typing.Optional\[bool] = Nonereturn\_dict: typing.Optional\[bool] = None )

Parameters

* **input\_ids** (`torch.LongTensor` of shape `(batch_size, sequence_length)`) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.

  Indices can be obtained using [AutoTokenizer](https://huggingface.co/docs/transformers/v4.34.1/en/model_doc/auto#transformers.AutoTokenizer). See [PreTrainedTokenizer.encode()](https://huggingface.co/docs/transformers/v4.34.1/en/main_classes/tokenizer#transformers.PreTrainedTokenizerFast.encode) and [PreTrainedTokenizer.**call**()](https://huggingface.co/docs/transformers/v4.34.1/en/model_doc/vits#transformers.VitsTokenizer.__call__) for details.

  [What are input IDs?](https://huggingface.co/docs/transformers/glossary#input-ids)
* **attention\_mask** (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*) — Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:

  * 1 for tokens that are **not masked**,
  * 0 for tokens that are **masked**.

  [What are attention masks?](https://huggingface.co/docs/transformers/glossary#attention-mask)

  Indices can be obtained using [AutoTokenizer](https://huggingface.co/docs/transformers/v4.34.1/en/model_doc/auto#transformers.AutoTokenizer). See [PreTrainedTokenizer.encode()](https://huggingface.co/docs/transformers/v4.34.1/en/main_classes/tokenizer#transformers.PreTrainedTokenizerFast.encode) and [PreTrainedTokenizer.**call**()](https://huggingface.co/docs/transformers/v4.34.1/en/model_doc/vits#transformers.VitsTokenizer.__call__) for details.

  If `past_key_values` is used, optionally only the last `input_ids` have to be input (see `past_key_values`).

  If you want to change padding behavior, you should read `modeling_opt._prepare_decoder_attention_mask` and modify to your needs. See diagram 1 in [the paper](https://arxiv.org/abs/1910.13461) for more information on the default strategy.

  * 1 indicates the head is **not masked**,
  * 0 indicates the head is **masked**.
* **position\_ids** (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0, config.n_positions - 1]`.

  [What are position IDs?](https://huggingface.co/docs/transformers/glossary#position-ids)
* **past\_key\_values** (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`) — Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape `(batch_size, num_heads, sequence_length, embed_size_per_head)`) and 2 additional tensors of shape `(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)`.

  Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see `past_key_values` input) to speed up sequential decoding.

  If `past_key_values` are used, the user can optionally input only the last `input_ids` (those that don’t have their past key value states given to this model) of shape `(batch_size, 1)` instead of all `input_ids` of shape `(batch_size, sequence_length)`.
* **inputs\_embeds** (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*) — Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert `input_ids` indices into associated vectors than the model’s internal embedding lookup matrix.
* **use\_cache** (`bool`, *optional*) — If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see `past_key_values`).
* **output\_attentions** (`bool`, *optional*) — Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned tensors for more detail.
* **output\_hidden\_states** (`bool`, *optional*) — Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for more detail.
* **return\_dict** (`bool`, *optional*) — Whether or not to return a [ModelOutput](https://huggingface.co/docs/transformers/v4.34.1/en/main_classes/output#transformers.utils.ModelOutput) instead of a plain tuple.

The [LlamaModel](https://huggingface.co/docs/transformers/v4.34.1/en/model_doc/llama2#transformers.LlamaModel) forward method, overrides the `__call__` special method.

Although the recipe for forward pass needs to be defined within this function, one should call the `Module` instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

### LlamaForCausalLM

#### class transformers.LlamaForCausalLM

[\<source>](https://github.com/huggingface/transformers/blob/v4.34.1/src/transformers/models/llama/modeling_llama.py#L960)

( config )

**forward**

[\<source>](https://github.com/huggingface/transformers/blob/v4.34.1/src/transformers/models/llama/modeling_llama.py#L990)

( input\_ids: LongTensor = Noneattention\_mask: typing.Optional\[torch.Tensor] = Noneposition\_ids: typing.Optional\[torch.LongTensor] = Nonepast\_key\_values: typing.Optional\[typing.List\[torch.FloatTensor]] = Noneinputs\_embeds: typing.Optional\[torch.FloatTensor] = Nonelabels: typing.Optional\[torch.LongTensor] = Noneuse\_cache: typing.Optional\[bool] = Noneoutput\_attentions: typing.Optional\[bool] = Noneoutput\_hidden\_states: typing.Optional\[bool] = Nonereturn\_dict: typing.Optional\[bool] = None ) → [transformers.modeling\_outputs.CausalLMOutputWithPast](https://huggingface.co/docs/transformers/v4.34.1/en/main_classes/output#transformers.modeling_outputs.CausalLMOutputWithPast) or `tuple(torch.FloatTensor)`

Parameters

* **input\_ids** (`torch.LongTensor` of shape `(batch_size, sequence_length)`) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.

  Indices can be obtained using [AutoTokenizer](https://huggingface.co/docs/transformers/v4.34.1/en/model_doc/auto#transformers.AutoTokenizer). See [PreTrainedTokenizer.encode()](https://huggingface.co/docs/transformers/v4.34.1/en/main_classes/tokenizer#transformers.PreTrainedTokenizerFast.encode) and [PreTrainedTokenizer.**call**()](https://huggingface.co/docs/transformers/v4.34.1/en/model_doc/vits#transformers.VitsTokenizer.__call__) for details.

  [What are input IDs?](https://huggingface.co/docs/transformers/glossary#input-ids)
* **attention\_mask** (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*) — Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:

  * 1 for tokens that are **not masked**,
  * 0 for tokens that are **masked**.

  [What are attention masks?](https://huggingface.co/docs/transformers/glossary#attention-mask)

  Indices can be obtained using [AutoTokenizer](https://huggingface.co/docs/transformers/v4.34.1/en/model_doc/auto#transformers.AutoTokenizer). See [PreTrainedTokenizer.encode()](https://huggingface.co/docs/transformers/v4.34.1/en/main_classes/tokenizer#transformers.PreTrainedTokenizerFast.encode) and [PreTrainedTokenizer.**call**()](https://huggingface.co/docs/transformers/v4.34.1/en/model_doc/vits#transformers.VitsTokenizer.__call__) for details.

  If `past_key_values` is used, optionally only the last `input_ids` have to be input (see `past_key_values`).

  If you want to change padding behavior, you should read `modeling_opt._prepare_decoder_attention_mask` and modify to your needs. See diagram 1 in [the paper](https://arxiv.org/abs/1910.13461) for more information on the default strategy.

  * 1 indicates the head is **not masked**,
  * 0 indicates the head is **masked**.
* **position\_ids** (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0, config.n_positions - 1]`.

  [What are position IDs?](https://huggingface.co/docs/transformers/glossary#position-ids)
* **past\_key\_values** (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`) — Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape `(batch_size, num_heads, sequence_length, embed_size_per_head)`) and 2 additional tensors of shape `(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)`.

  Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see `past_key_values` input) to speed up sequential decoding.

  If `past_key_values` are used, the user can optionally input only the last `input_ids` (those that don’t have their past key value states given to this model) of shape `(batch_size, 1)` instead of all `input_ids` of shape `(batch_size, sequence_length)`.
* **inputs\_embeds** (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*) — Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert `input_ids` indices into associated vectors than the model’s internal embedding lookup matrix.
* **use\_cache** (`bool`, *optional*) — If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see `past_key_values`).
* **output\_attentions** (`bool`, *optional*) — Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned tensors for more detail.
* **output\_hidden\_states** (`bool`, *optional*) — Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for more detail.
* **return\_dict** (`bool`, *optional*) — Whether or not to return a [ModelOutput](https://huggingface.co/docs/transformers/v4.34.1/en/main_classes/output#transformers.utils.ModelOutput) instead of a plain tuple.

  Args — labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*): Labels for computing the masked language modeling loss. Indices should either be in `[0, ..., config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.

Returns

[transformers.modeling\_outputs.CausalLMOutputWithPast](https://huggingface.co/docs/transformers/v4.34.1/en/main_classes/output#transformers.modeling_outputs.CausalLMOutputWithPast) or `tuple(torch.FloatTensor)`

A [transformers.modeling\_outputs.CausalLMOutputWithPast](https://huggingface.co/docs/transformers/v4.34.1/en/main_classes/output#transformers.modeling_outputs.CausalLMOutputWithPast) or a tuple of `torch.FloatTensor` (if `return_dict=False` is passed or when `config.return_dict=False`) comprising various elements depending on the configuration ([LlamaConfig](https://huggingface.co/docs/transformers/v4.34.1/en/model_doc/llama2#transformers.LlamaConfig)) and inputs.

* **loss** (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided) — Language modeling loss (for next-token prediction).
* **logits** (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
* **past\_key\_values** (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`) — Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape `(batch_size, num_heads, sequence_length, embed_size_per_head)`)

  Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see `past_key_values` input) to speed up sequential decoding.
* **hidden\_states** (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`) — Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.

  Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
* **attentions** (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`) — Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, sequence_length)`.

  Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

The [LlamaForCausalLM](https://huggingface.co/docs/transformers/v4.34.1/en/model_doc/llama2#transformers.LlamaForCausalLM) forward method, overrides the `__call__` special method.

Although the recipe for forward pass needs to be defined within this function, one should call the `Module` instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example:

Copied

```
>>> from transformers import AutoTokenizer, LlamaForCausalLM

>>> model = LlamaForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
>>> tokenizer = AutoTokenizer.from_pretrained(PATH_TO_CONVERTED_TOKENIZER)

>>> prompt = "Hey, are you conscious? Can you talk to me?"
>>> inputs = tokenizer(prompt, return_tensors="pt")

>>> # Generate
>>> generate_ids = model.generate(inputs.input_ids, max_length=30)
>>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
"Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
```

### LlamaForSequenceClassification

#### class transformers.LlamaForSequenceClassification

[\<source>](https://github.com/huggingface/transformers/blob/v4.34.1/src/transformers/models/llama/modeling_llama.py#L1139)

( config )

Parameters

* **config** ([LlamaConfig](https://huggingface.co/docs/transformers/v4.34.1/en/model_doc/llama2#transformers.LlamaConfig)) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the [from\_pretrained()](https://huggingface.co/docs/transformers/v4.34.1/en/main_classes/model#transformers.PreTrainedModel.from_pretrained) method to load the model weights.

The LLaMa Model transformer with a sequence classification head on top (linear layer).

[LlamaForSequenceClassification](https://huggingface.co/docs/transformers/v4.34.1/en/model_doc/llama2#transformers.LlamaForSequenceClassification) uses the last token in order to do the classification, as other causal models (e.g. GPT-2) do.

Since it does classification on the last token, it requires to know the position of the last token. If a `pad_token_id` is defined in the configuration, it finds the last token that is not a padding token in each row. If no `pad_token_id` is defined, it simply takes the last value in each row of the batch. Since it cannot guess the padding tokens when `inputs_embeds` are passed instead of `input_ids`, it does the same (take the last value in each row of the batch).

This model inherits from [PreTrainedModel](https://huggingface.co/docs/transformers/v4.34.1/en/main_classes/model#transformers.PreTrainedModel). Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

**forward**

[\<source>](https://github.com/huggingface/transformers/blob/v4.34.1/src/transformers/models/llama/modeling_llama.py#L1155)

( input\_ids: LongTensor = Noneattention\_mask: typing.Optional\[torch.Tensor] = Noneposition\_ids: typing.Optional\[torch.LongTensor] = Nonepast\_key\_values: typing.Optional\[typing.List\[torch.FloatTensor]] = Noneinputs\_embeds: typing.Optional\[torch.FloatTensor] = Nonelabels: typing.Optional\[torch.LongTensor] = Noneuse\_cache: typing.Optional\[bool] = Noneoutput\_attentions: typing.Optional\[bool] = Noneoutput\_hidden\_states: typing.Optional\[bool] = Nonereturn\_dict: typing.Optional\[bool] = None )

Parameters

* **input\_ids** (`torch.LongTensor` of shape `(batch_size, sequence_length)`) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.

  Indices can be obtained using [AutoTokenizer](https://huggingface.co/docs/transformers/v4.34.1/en/model_doc/auto#transformers.AutoTokenizer). See [PreTrainedTokenizer.encode()](https://huggingface.co/docs/transformers/v4.34.1/en/main_classes/tokenizer#transformers.PreTrainedTokenizerFast.encode) and [PreTrainedTokenizer.**call**()](https://huggingface.co/docs/transformers/v4.34.1/en/model_doc/vits#transformers.VitsTokenizer.__call__) for details.

  [What are input IDs?](https://huggingface.co/docs/transformers/glossary#input-ids)
* **attention\_mask** (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*) — Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:

  * 1 for tokens that are **not masked**,
  * 0 for tokens that are **masked**.

  [What are attention masks?](https://huggingface.co/docs/transformers/glossary#attention-mask)

  Indices can be obtained using [AutoTokenizer](https://huggingface.co/docs/transformers/v4.34.1/en/model_doc/auto#transformers.AutoTokenizer). See [PreTrainedTokenizer.encode()](https://huggingface.co/docs/transformers/v4.34.1/en/main_classes/tokenizer#transformers.PreTrainedTokenizerFast.encode) and [PreTrainedTokenizer.**call**()](https://huggingface.co/docs/transformers/v4.34.1/en/model_doc/vits#transformers.VitsTokenizer.__call__) for details.

  If `past_key_values` is used, optionally only the last `input_ids` have to be input (see `past_key_values`).

  If you want to change padding behavior, you should read `modeling_opt._prepare_decoder_attention_mask` and modify to your needs. See diagram 1 in [the paper](https://arxiv.org/abs/1910.13461) for more information on the default strategy.

  * 1 indicates the head is **not masked**,
  * 0 indicates the head is **masked**.
* **position\_ids** (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0, config.n_positions - 1]`.

  [What are position IDs?](https://huggingface.co/docs/transformers/glossary#position-ids)
* **past\_key\_values** (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`) — Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape `(batch_size, num_heads, sequence_length, embed_size_per_head)`) and 2 additional tensors of shape `(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)`.

  Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see `past_key_values` input) to speed up sequential decoding.

  If `past_key_values` are used, the user can optionally input only the last `input_ids` (those that don’t have their past key value states given to this model) of shape `(batch_size, 1)` instead of all `input_ids` of shape `(batch_size, sequence_length)`.
* **inputs\_embeds** (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*) — Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert `input_ids` indices into associated vectors than the model’s internal embedding lookup matrix.
* **use\_cache** (`bool`, *optional*) — If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see `past_key_values`).
* **output\_attentions** (`bool`, *optional*) — Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned tensors for more detail.
* **output\_hidden\_states** (`bool`, *optional*) — Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for more detail.
* **return\_dict** (`bool`, *optional*) — Whether or not to return a [ModelOutput](https://huggingface.co/docs/transformers/v4.34.1/en/main_classes/output#transformers.utils.ModelOutput) instead of a plain tuple.
* **labels** (`torch.LongTensor` of shape `(batch_size,)`, *optional*) — Labels for computing the sequence classification/regression loss. Indices should be in `[0, ..., config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If `config.num_labels > 1` a classification loss is computed (Cross-Entropy).

The [LlamaForSequenceClassification](https://huggingface.co/docs/transformers/v4.34.1/en/model_doc/llama2#transformers.LlamaForSequenceClassification) forward method, overrides the `__call__` special method.

Although the recipe for forward pass needs to be defined within this function, one should call the `Module` instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://boinc-ai.gitbook.io/transformers/api/models/text-models/llama2.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
