Persimmon
Persimmon
Overview
The Persimmon model was created by ADEPT, and authored by Erich Elsen, Augustus Odena, Maxwell Nye, SaΔnak TaΕΔ±rlar, Tri Dao, Curtis Hawthorne, Deepak Moparthi, Arushi Somani.
The authors introduced Persimmon-8B, a decoder model based on the classic transformers architecture, with query and key normalization. Persimmon-8B is a fully permissively-licensed model with approximately 8 billion parameters, released under the Apache license. Some of the key attributes of Persimmon-8B are long context size (16K), performance, and capabilities for multimodal extensions.
The authors showcase their approach to model evaluation, focusing on practical text generation, mirroring how users interact with language models. The work also includes a comparative analysis, pitting Persimmon-8B against other prominent models (MPT 7B Instruct and Llama 2 Base 7B 1-Shot), across various evaluation tasks. The results demonstrate Persimmon-8Bβs competitive performance, even with limited training data.
In terms of model details, the work outlines the architecture and training methodology of Persimmon-8B, providing insights into its design choices, sequence length, and dataset composition. The authors present a fast inference code that outperforms traditional implementations through operator fusion and CUDA graph utilization while maintaining code coherence. They express their anticipation of how the community will leverage this contribution to drive innovation, hinting at further upcoming releases as part of an ongoing series of developments.
The Persimmon models were trained using bfloat16, but the original inference uses float16 The checkpoints uploaded on the hub use torch_dtype = 'float16' which will be used by the AutoModel API to cast the checkpoints from torch.float32 to torch.float16.
The dtype of the online weights is mostly irrelevant, unless you are using torch_dtype="auto" when initializing a model using model = AutoModelForCausalLM.from_pretrained("path", torch_dtype = "auto"). The reason is that the model will first be downloaded ( using the dtype of the checkpoints online) then it will be cast to the default dtype of torch (becomes torch.float32). Users should specify the torch_dtype they want, and if they donβt it will be torch.float32.
Finetuning the model in float16 is not recommended and known to produce nan, as such the model should be fine-tuned in bfloat16.
Tips:
- To convert the model, you need to clone the original repository using - git clone https://github.com/persimmon-ai-labs/adept-inference, then get the checkpoints:
Copied
git clone https://github.com/persimmon-ai-labs/adept-inference
wget https://axtkn4xl5cip.objectstorage.us-phoenix-1.oci.customer-oci.com/n/axtkn4xl5cip/b/adept-public-data/o/8b_base_model_release.tar
tar -xvf 8b_base_model_release.tar
python src/transformers/models/persimmon/convert_persimmon_weights_to_hf.py  --input_dir /path/to/downloaded/persimmon/weights/ --output_dir /output/path \
    --pt_model_path /path/to/8b_chat_model_release/iter_0001251/mp_rank_00/model_optim_rng.pt
    --ada_lib_path /path/to/adept-inferenceFor the chat model:
Copied
wget https://axtkn4xl5cip.objectstorage.us-phoenix-1.oci.customer-oci.com/n/axtkn4xl5cip/b/adept-public-data/o/8b_chat_model_release.tar
tar -xvf 8b_base_model_release.tarThereafter, models can be loaded via:
Copied
from transformers import PersimmonForCausalLM, PersimmonTokenizer
model = PersimmonForCausalLM.from_pretrained("/output/path")
tokenizer = PersimmonTokenizer.from_pretrained("/output/path")This model was contributed by ArthurZ. The original code can be found here.
- Perismmon uses a - sentencepiecebased tokenizer, with a- Unigrammodel. It supports bytefallback, which is only available in- tokenizers==0.14.0for the fast tokenizer. The- LlamaTokenizeris used as it is a standard wrapper around sentencepiece. The- chattemplate will be updated with the templating functions in a follow up PR!
- The authors suggest to use the following prompt format for the chat mode: - f"human: {prompt}\n\nadept:"
PersimmonConfig
class transformers.PersimmonConfig
( vocab_size = 262144hidden_size = 4096intermediate_size = 16384num_hidden_layers = 36num_attention_heads = 64hidden_act = 'relu2'max_position_embeddings = 16384initializer_range = 0.02layer_norm_eps = 1e-05use_cache = Truetie_word_embeddings = Falserope_theta = 25000.0rope_scaling = Noneqk_layernorm = Truehidden_dropout = 0.0attention_dropout = 0.0partial_rotary_factor = 0.5pad_token_id = Nonebos_token_id = 1eos_token_id = 2**kwargs )
Parameters
- vocab_size ( - int, optional, defaults to 262144) β Vocabulary size of the Persimmon model. Defines the number of different tokens that can be represented by the- inputs_idspassed when calling PersimmonModel
- hidden_size ( - int, optional, defaults to 4096) β Dimension of the hidden representations.
- intermediate_size ( - int, optional, defaults to 16384) β Dimension of the MLP representations.
- num_hidden_layers ( - int, optional, defaults to 36) β Number of hidden layers in the Transformer encoder.
- num_attention_heads ( - int, optional, defaults to 64) β Number of attention heads for each attention layer in the Transformer encoder.
- hidden_act ( - stror- function, optional, defaults to- "relu2") β The non-linear activation function (function or string) in the decoder.
- max_position_embeddings ( - int, optional, defaults to 16384) β The maximum sequence length that this model might ever be used with.
- initializer_range ( - float, optional, defaults to 0.02) β The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
- layer_norm_eps ( - float, optional, defaults to 1e-5) β The epsilon used by the rms normalization layers.
- use_cache ( - bool, optional, defaults to- True) β Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if- config.is_decoder=True.
- tie_word_embeddings( - bool, optional, defaults to- False) β Whether to tie weight embeddings
- rope_theta ( - float, optional, defaults to 25000.0) β The base period of the RoPE embeddings.
- rope_scaling ( - Dict, optional) β Dictionary containing the scaling configuration for the RoPE embeddings. Currently supports two scaling strategies: linear and dynamic. Their scaling factor must be an float greater than 1. The expected format is- {"type": strategy name, "factor": scaling factor}. When using this flag, donβt update- max_position_embeddingsto the expected new maximum. See the following thread for more information on how these scaling strategies behave: https://www.reddit.com/r/LocalPersimmon/comments/14mrgpr/dynamically_scaled_rope_further_increases/. This is an experimental feature, subject to breaking API changes in future versions.
- qk_layernorm ( - bool, optional, default to- True) β Whether or not to normalize the Queries and Keys after projecting the hidden states
- hidden_dropout ( - float, optional, default to 0.0) β The dropout ratio after applying the MLP to the hidden states.
- attention_dropout ( - float, optional, default to 0.0) β The dropout ratio after computing the attention scores.
- partial_rotary_factor ( - float, optional, default to 0.5) β Percentage of the query and keys which will have rotary embedding.- Example β 
This is the configuration class to store the configuration of a PersimmonModel. It is used to instantiate an Persimmon model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the adept/persimmon-8b-base.
Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.
Copied
>>> from transformers import PersimmonModel, PersimmonConfig
>>> # Initializing a Persimmon persimmon-7b style configuration
>>> configuration = PersimmonConfig()PersimmonModel
class transformers.PersimmonModel
( config: PersimmonConfig )
Parameters
- config (PersimmonConfig) β Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights. config β PersimmonConfig 
The bare Persimmon Model outputting raw hidden-states without any specific head on top. This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
Transformer decoder consisting of config.num_hidden_layers layers. Each layer is a PersimmonDecoderLayer
forward
( input_ids: LongTensor = Noneattention_mask: typing.Optional[torch.Tensor] = Noneposition_ids: typing.Optional[torch.LongTensor] = Nonepast_key_values: typing.Optional[typing.List[torch.FloatTensor]] = Noneinputs_embeds: typing.Optional[torch.FloatTensor] = Noneuse_cache: typing.Optional[bool] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = None )
Parameters
- input_ids ( - torch.LongTensorof shape- (batch_size, sequence_length)) β Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.- Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. 
- attention_mask ( - torch.Tensorof shape- (batch_size, sequence_length), optional) β Mask to avoid performing attention on padding token indices. Mask values selected in- [0, 1]:- 1 for tokens that are not masked, 
- 0 for tokens that are masked. 
 - Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. - If - past_key_valuesis used, optionally only the last- decoder_input_idshave to be input (see- past_key_values).- If you want to change padding behavior, you should read - modeling_opt._prepare_decoder_attention_maskand modify to your needs. See diagram 1 in the paper for more information on the default strategy.- 1 indicates the head is not masked, 
- 0 indicates the head is masked. 
 
- position_ids ( - torch.LongTensorof shape- (batch_size, sequence_length), optional) β Indices of positions of each input sequence tokens in the position embeddings. Selected in the range- [0, config.n_positions - 1].
- past_key_values ( - tuple(tuple(torch.FloatTensor)), optional, returned when- use_cache=Trueis passed or when- config.use_cache=True) β Tuple of- tuple(torch.FloatTensor)of length- config.n_layers, with each tuple having 2 tensors of shape- (batch_size, num_heads, sequence_length, embed_size_per_head)) and 2 additional tensors of shape- (batch_size, num_heads, encoder_sequence_length, embed_size_per_head).- Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see - past_key_valuesinput) to speed up sequential decoding.- If - past_key_valuesare used, the user can optionally input only the last- decoder_input_ids(those that donβt have their past key value states given to this model) of shape- (batch_size, 1)instead of all- decoder_input_idsof shape- (batch_size, sequence_length).
- inputs_embeds ( - torch.FloatTensorof shape- (batch_size, sequence_length, hidden_size), optional) β Optionally, instead of passing- input_idsyou can choose to directly pass an embedded representation. This is useful if you want more control over how to convert- input_idsindices into associated vectors than the modelβs internal embedding lookup matrix.
- use_cache ( - bool, optional) β If set to- True,- past_key_valueskey value states are returned and can be used to speed up decoding (see- past_key_values).
- output_attentions ( - bool, optional) β Whether or not to return the attentions tensors of all attention layers. See- attentionsunder returned tensors for more detail.
- output_hidden_states ( - bool, optional) β Whether or not to return the hidden states of all layers. See- hidden_statesunder returned tensors for more detail.
- return_dict ( - bool, optional) β Whether or not to return a ModelOutput instead of a plain tuple.
The PersimmonModel forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
PersimmonForCausalLM
class transformers.PersimmonForCausalLM
( config )
forward
( input_ids: LongTensor = Noneattention_mask: typing.Optional[torch.Tensor] = Noneposition_ids: typing.Optional[torch.LongTensor] = Nonepast_key_values: typing.Optional[typing.List[torch.FloatTensor]] = Noneinputs_embeds: typing.Optional[torch.FloatTensor] = Nonelabels: typing.Optional[torch.LongTensor] = Noneuse_cache: typing.Optional[bool] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = None ) β transformers.modeling_outputs.CausalLMOutputWithPast or tuple(torch.FloatTensor)
Parameters
- input_ids ( - torch.LongTensorof shape- (batch_size, sequence_length)) β Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.- Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. 
- attention_mask ( - torch.Tensorof shape- (batch_size, sequence_length), optional) β Mask to avoid performing attention on padding token indices. Mask values selected in- [0, 1]:- 1 for tokens that are not masked, 
- 0 for tokens that are masked. 
 - Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. - If - past_key_valuesis used, optionally only the last- decoder_input_idshave to be input (see- past_key_values).- If you want to change padding behavior, you should read - modeling_opt._prepare_decoder_attention_maskand modify to your needs. See diagram 1 in the paper for more information on the default strategy.- 1 indicates the head is not masked, 
- 0 indicates the head is masked. 
 
- position_ids ( - torch.LongTensorof shape- (batch_size, sequence_length), optional) β Indices of positions of each input sequence tokens in the position embeddings. Selected in the range- [0, config.n_positions - 1].
- past_key_values ( - tuple(tuple(torch.FloatTensor)), optional, returned when- use_cache=Trueis passed or when- config.use_cache=True) β Tuple of- tuple(torch.FloatTensor)of length- config.n_layers, with each tuple having 2 tensors of shape- (batch_size, num_heads, sequence_length, embed_size_per_head)) and 2 additional tensors of shape- (batch_size, num_heads, encoder_sequence_length, embed_size_per_head).- Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see - past_key_valuesinput) to speed up sequential decoding.- If - past_key_valuesare used, the user can optionally input only the last- decoder_input_ids(those that donβt have their past key value states given to this model) of shape- (batch_size, 1)instead of all- decoder_input_idsof shape- (batch_size, sequence_length).
- inputs_embeds ( - torch.FloatTensorof shape- (batch_size, sequence_length, hidden_size), optional) β Optionally, instead of passing- input_idsyou can choose to directly pass an embedded representation. This is useful if you want more control over how to convert- input_idsindices into associated vectors than the modelβs internal embedding lookup matrix.
- use_cache ( - bool, optional) β If set to- True,- past_key_valueskey value states are returned and can be used to speed up decoding (see- past_key_values).
- output_attentions ( - bool, optional) β Whether or not to return the attentions tensors of all attention layers. See- attentionsunder returned tensors for more detail.
- output_hidden_states ( - bool, optional) β Whether or not to return the hidden states of all layers. See- hidden_statesunder returned tensors for more detail.
- return_dict ( - bool, optional) β Whether or not to return a ModelOutput instead of a plain tuple.- Args β labels ( - torch.LongTensorof shape- (batch_size, sequence_length), optional): Labels for computing the masked language modeling loss. Indices should either be in- [0, ..., config.vocab_size]or -100 (see- input_idsdocstring). Tokens with indices set to- -100are ignored (masked), the loss is only computed for the tokens with labels in- [0, ..., config.vocab_size].
Returns
transformers.modeling_outputs.CausalLMOutputWithPast or tuple(torch.FloatTensor)
A transformers.modeling_outputs.CausalLMOutputWithPast or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (PersimmonConfig) and inputs.
- loss ( - torch.FloatTensorof shape- (1,), optional, returned when- labelsis provided) β Language modeling loss (for next-token prediction).
- logits ( - torch.FloatTensorof shape- (batch_size, sequence_length, config.vocab_size)) β Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
- past_key_values ( - tuple(tuple(torch.FloatTensor)), optional, returned when- use_cache=Trueis passed or when- config.use_cache=True) β Tuple of- tuple(torch.FloatTensor)of length- config.n_layers, with each tuple having 2 tensors of shape- (batch_size, num_heads, sequence_length, embed_size_per_head))- Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see - past_key_valuesinput) to speed up sequential decoding.
- hidden_states ( - tuple(torch.FloatTensor), optional, returned when- output_hidden_states=Trueis passed or when- config.output_hidden_states=True) β Tuple of- torch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape- (batch_size, sequence_length, hidden_size).- Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. 
- attentions ( - tuple(torch.FloatTensor), optional, returned when- output_attentions=Trueis passed or when- config.output_attentions=True) β Tuple of- torch.FloatTensor(one for each layer) of shape- (batch_size, num_heads, sequence_length, sequence_length).- Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. 
The PersimmonForCausalLM forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
Example:
Copied
>>> from transformers import AutoTokenizer, PersimmonForCausalLM
>>> model = PersimmonForCausalLM.from_pretrained("adept/persimmon-8b-base")
>>> tokenizer = AutoTokenizer.from_pretrained("adept/persimmon-8b-base")
>>> prompt = "human: Hey, what should I eat for dinner?"
>>> inputs = tokenizer(prompt, return_tensors="pt")
>>> # Generate
>>> generate_ids = model.generate(inputs.input_ids, max_length=30)
>>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
'human: Hey, what should I eat for dinner?\n\ncat: π±\n\nhuman: π\n\n'PersimmonForSequenceClassification
class transformers.PersimmonForSequenceClassification
( config )
Parameters
- config (PersimmonConfig) β Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights. 
The Persimmon transformer with a sequence classification head on top (linear layer).
PersimmonForSequenceClassification uses the last token in order to do the classification, as other causal models (e.g. GPT-2) do.
Since it does classification on the last token, it requires to know the position of the last token. If a pad_token_id is defined in the configuration, it finds the last token that is not a padding token in each row. If no pad_token_id is defined, it simply takes the last value in each row of the batch. Since it cannot guess the padding tokens when inputs_embeds are passed instead of input_ids, it does the same (take the last value in each row of the batch).
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
( input_ids: LongTensor = Noneattention_mask: typing.Optional[torch.Tensor] = Noneposition_ids: typing.Optional[torch.LongTensor] = Nonepast_key_values: typing.Optional[typing.List[torch.FloatTensor]] = Noneinputs_embeds: typing.Optional[torch.FloatTensor] = Nonelabels: typing.Optional[torch.LongTensor] = Noneuse_cache: typing.Optional[bool] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = None )
Parameters
- input_ids ( - torch.LongTensorof shape- (batch_size, sequence_length)) β Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.- Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. 
- attention_mask ( - torch.Tensorof shape- (batch_size, sequence_length), optional) β Mask to avoid performing attention on padding token indices. Mask values selected in- [0, 1]:- 1 for tokens that are not masked, 
- 0 for tokens that are masked. 
 - Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. - If - past_key_valuesis used, optionally only the last- decoder_input_idshave to be input (see- past_key_values).- If you want to change padding behavior, you should read - modeling_opt._prepare_decoder_attention_maskand modify to your needs. See diagram 1 in the paper for more information on the default strategy.- 1 indicates the head is not masked, 
- 0 indicates the head is masked. 
 
- position_ids ( - torch.LongTensorof shape- (batch_size, sequence_length), optional) β Indices of positions of each input sequence tokens in the position embeddings. Selected in the range- [0, config.n_positions - 1].
- past_key_values ( - tuple(tuple(torch.FloatTensor)), optional, returned when- use_cache=Trueis passed or when- config.use_cache=True) β Tuple of- tuple(torch.FloatTensor)of length- config.n_layers, with each tuple having 2 tensors of shape- (batch_size, num_heads, sequence_length, embed_size_per_head)) and 2 additional tensors of shape- (batch_size, num_heads, encoder_sequence_length, embed_size_per_head).- Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see - past_key_valuesinput) to speed up sequential decoding.- If - past_key_valuesare used, the user can optionally input only the last- decoder_input_ids(those that donβt have their past key value states given to this model) of shape- (batch_size, 1)instead of all- decoder_input_idsof shape- (batch_size, sequence_length).
- inputs_embeds ( - torch.FloatTensorof shape- (batch_size, sequence_length, hidden_size), optional) β Optionally, instead of passing- input_idsyou can choose to directly pass an embedded representation. This is useful if you want more control over how to convert- input_idsindices into associated vectors than the modelβs internal embedding lookup matrix.
- use_cache ( - bool, optional) β If set to- True,- past_key_valueskey value states are returned and can be used to speed up decoding (see- past_key_values).
- output_attentions ( - bool, optional) β Whether or not to return the attentions tensors of all attention layers. See- attentionsunder returned tensors for more detail.
- output_hidden_states ( - bool, optional) β Whether or not to return the hidden states of all layers. See- hidden_statesunder returned tensors for more detail.
- return_dict ( - bool, optional) β Whether or not to return a ModelOutput instead of a plain tuple.
- labels ( - torch.LongTensorof shape- (batch_size,), optional) β Labels for computing the sequence classification/regression loss. Indices should be in- [0, ..., config.num_labels - 1]. If- config.num_labels == 1a regression loss is computed (Mean-Square loss), If- config.num_labels > 1a classification loss is computed (Cross-Entropy).
The PersimmonForSequenceClassification forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
Last updated
