Perceiver
Last updated
Last updated
The Perceiver IO model was proposed in Perceiver IO: A General Architecture for Structured Inputs & Outputs by Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, João Carreira.
Perceiver IO is a generalization of Perceiver to handle arbitrary outputs in addition to arbitrary inputs. The original Perceiver only produced a single classification label. In addition to classification labels, Perceiver IO can produce (for example) language, optical flow, and multimodal videos with audio. This is done using the same building blocks as the original Perceiver. The computational complexity of Perceiver IO is linear in the input and output size and the bulk of the processing occurs in the latent space, allowing us to process inputs and outputs that are much larger than can be handled by standard Transformers. This means, for example, Perceiver IO can do BERT-style masked language modeling directly using bytes instead of tokenized inputs.
The abstract from the paper is the following:
The recently-proposed Perceiver model obtains good results on several domains (images, audio, multimodal, point clouds) while scaling linearly in compute and memory with the input size. While the Perceiver supports many kinds of inputs, it can only produce very simple outputs such as class scores. Perceiver IO overcomes this limitation without sacrificing the original’s appealing properties by learning to flexibly query the model’s latent space to produce outputs of arbitrary size and semantics. Perceiver IO still decouples model depth from data size and still scales linearly with data size, but now with respect to both input and output sizes. The full Perceiver IO model achieves strong results on tasks with highly structured output spaces, such as natural language and visual understanding, StarCraft II, and multi-task and multi-modal domains. As highlights, Perceiver IO matches a Transformer-based BERT baseline on the GLUE language benchmark without the need for input tokenization and achieves state-of-the-art performance on Sintel optical flow estimation.
Here’s a TLDR explaining how Perceiver works:
The main problem with the self-attention mechanism of the Transformer is that the time and memory requirements scale quadratically with the sequence length. Hence, models like BERT and RoBERTa are limited to a max sequence length of 512 tokens. Perceiver aims to solve this issue by, instead of performing self-attention on the inputs, perform it on a set of latent variables, and only use the inputs for cross-attention. In this way, the time and memory requirements don’t depend on the length of the inputs anymore, as one uses a fixed amount of latent variables, like 256 or 512. These are randomly initialized, after which they are trained end-to-end using backpropagation.
Internally, PerceiverModel will create the latents, which is a tensor of shape (batch_size, num_latents, d_latents)
. One must provide inputs
(which could be text, images, audio, you name it!) to the model, which it will use to perform cross-attention with the latents. The output of the Perceiver encoder is a tensor of the same shape. One can then, similar to BERT, convert the last hidden states of the latents to classification logits by averaging along the sequence dimension, and placing a linear layer on top of that to project the d_latents
to num_labels
.
This was the idea of the original Perceiver paper. However, it could only output classification logits. In a follow-up work, PerceiverIO, they generalized it to let the model also produce outputs of arbitrary size. How, you might ask? The idea is actually relatively simple: one defines outputs of an arbitrary size, and then applies cross-attention with the last hidden states of the latents, using the outputs as queries, and the latents as keys and values.
So let’s say one wants to perform masked language modeling (BERT-style) with the Perceiver. As the Perceiver’s input length will not have an impact on the computation time of the self-attention layers, one can provide raw bytes, providing inputs
of length 2048 to the model. If one now masks out certain of these 2048 tokens, one can define the outputs
as being of shape: (batch_size, 2048, 768)
. Next, one performs cross-attention with the final hidden states of the latents to update the outputs
tensor. After cross-attention, one still has a tensor of shape (batch_size, 2048, 768)
. One can then place a regular language modeling head on top, to project the last dimension to the vocabulary size of the model, i.e. creating logits of shape (batch_size, 2048, 262)
(as Perceiver uses a vocabulary size of 262 byte IDs).
Perceiver IO architecture. Taken from the original paper
This model was contributed by nielsr. The original code can be found here.
Tips:
The quickest way to get started with the Perceiver is by checking the tutorial notebooks.
Refer to the blog post if you want to fully understand how the model works and is implemented in the library. Note that the models available in the library only showcase some examples of what you can do with the Perceiver. There are many more use cases, including question answering, named-entity recognition, object detection, audio classification, video classification, etc.
Note:
Perceiver does not work with torch.nn.DataParallel
due to a bug in PyTorch, see issue #36035
( logits: FloatTensor = Nonelast_hidden_state: FloatTensor = Nonehidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = Noneattentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = Nonecross_attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None )
Parameters
logits (torch.FloatTensor
of shape (batch_size, num_labels)
) — Classification (or regression if config.num_labels==1) scores (before SoftMax).
last_hidden_state (torch.FloatTensor
of shape (batch_size, sequence_length, hidden_size)
) — Sequence of hidden-states at the output of the last layer of the model.
hidden_states (tuple(torch.FloatTensor)
, optional, returned when output_hidden_states=True
is passed or when config.output_hidden_states=True
) — Tuple of torch.FloatTensor
(one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size)
. Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (tuple(torch.FloatTensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) — Tuple of torch.FloatTensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
. Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
cross_attentions (tuple(torch.FloatTensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) — Tuple of torch.FloatTensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
. Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads.
Base class for Perceiver base model’s outputs, with potential hidden states, attentions and cross-attentions.
( logits: FloatTensor = Nonecross_attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None )
Parameters
logits (torch.FloatTensor
of shape (batch_size, num_labels)
) — Output of the basic decoder.
cross_attentions (tuple(torch.FloatTensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) — Tuple of torch.FloatTensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
. Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads.
Base class for Perceiver decoder outputs, with potential cross-attentions.
( loss: typing.Optional[torch.FloatTensor] = Nonelogits: FloatTensor = Nonehidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = Noneattentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = Nonecross_attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None )
Parameters
loss (torch.FloatTensor
of shape (1,)
, optional, returned when labels
is provided) — Masked language modeling (MLM) loss.
logits (torch.FloatTensor
of shape (batch_size, sequence_length, config.vocab_size)
) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
hidden_states (tuple(torch.FloatTensor)
, optional, returned when output_hidden_states=True
is passed or when config.output_hidden_states=True
) — Tuple of torch.FloatTensor
(one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size)
. Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (tuple(torch.FloatTensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) — Tuple of torch.FloatTensor
(one for each layer) of shape (batch_size, num_heads, num_latents, num_latents)
. Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
cross_attentions (tuple(torch.FloatTensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) — Tuple of torch.FloatTensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
. Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads.
Base class for Perceiver’s masked language model outputs.
( loss: typing.Optional[torch.FloatTensor] = Nonelogits: FloatTensor = Nonehidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = Noneattentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = Nonecross_attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None )
Parameters
loss (torch.FloatTensor
of shape (1,)
, optional, returned when labels
is provided) — Classification (or regression if config.num_labels==1) loss.
logits (torch.FloatTensor
of shape (batch_size, config.num_labels)
) — Classification (or regression if config.num_labels==1) scores (before SoftMax).
hidden_states (tuple(torch.FloatTensor)
, optional, returned when output_hidden_states=True
is passed or when config.output_hidden_states=True
) — Tuple of torch.FloatTensor
(one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size)
. Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (tuple(torch.FloatTensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) — Tuple of torch.FloatTensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
. Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
cross_attentions (tuple(torch.FloatTensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) — Tuple of torch.FloatTensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
. Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads.
Base class for Perceiver’s outputs of sequence/image classification models, optical flow and multimodal autoencoding.
( num_latents = 256d_latents = 1280d_model = 768num_blocks = 1num_self_attends_per_block = 26num_self_attention_heads = 8num_cross_attention_heads = 8qk_channels = Nonev_channels = Nonecross_attention_shape_for_attention = 'kv'self_attention_widening_factor = 1cross_attention_widening_factor = 1hidden_act = 'gelu'attention_probs_dropout_prob = 0.1initializer_range = 0.02layer_norm_eps = 1e-12use_query_residual = Truevocab_size = 262max_position_embeddings = 2048image_size = 56train_size = [368, 496]num_frames = 16audio_samples_per_frame = 1920samples_per_patch = 16output_shape = [1, 16, 224, 224]output_num_channels = 512_label_trainable_num_channels = 1024**kwargs )
Parameters
num_latents (int
, optional, defaults to 256) — The number of latents.
d_latents (int
, optional, defaults to 1280) — Dimension of the latent embeddings.
d_model (int
, optional, defaults to 768) — Dimension of the inputs. Should only be provided in case [PerceiverTextPreprocessor] is used or no preprocessor is provided.
num_blocks (int
, optional, defaults to 1) — Number of blocks in the Transformer encoder.
num_self_attends_per_block (int
, optional, defaults to 26) — The number of self-attention layers per block.
num_self_attention_heads (int
, optional, defaults to 8) — Number of attention heads for each self-attention layer in the Transformer encoder.
num_cross_attention_heads (int
, optional, defaults to 8) — Number of attention heads for each cross-attention layer in the Transformer encoder.
qk_channels (int
, optional) — Dimension to project the queries + keys before applying attention in the cross-attention and self-attention layers of the encoder. Will default to preserving the dimension of the queries if not specified.
v_channels (int
, optional) — Dimension to project the values before applying attention in the cross-attention and self-attention layers of the encoder. Will default to preserving the dimension of the queries if not specified.
cross_attention_shape_for_attention (str
, optional, defaults to 'kv'
) — Dimension to use when downsampling the queries and keys in the cross-attention layer of the encoder.
self_attention_widening_factor (int
, optional, defaults to 1) — Dimension of the feed-forward layer in the cross-attention layer of the Transformer encoder.
cross_attention_widening_factor (int
, optional, defaults to 1) — Dimension of the feed-forward layer in the self-attention layers of the Transformer encoder.
hidden_act (str
or function
, optional, defaults to "gelu"
) — The non-linear activation function (function or string) in the encoder and pooler. If string, "gelu"
, "relu"
, "selu"
and "gelu_new"
are supported.
attention_probs_dropout_prob (float
, optional, defaults to 0.1) — The dropout ratio for the attention probabilities.
initializer_range (float
, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (float
, optional, defaults to 1e-12) — The epsilon used by the layer normalization layers.
use_query_residual (float
, optional, defaults to True
) — Whether to add a query residual in the cross-attention layer of the encoder.
vocab_size (int
, optional, defaults to 262) — Vocabulary size for the masked language modeling model.
max_position_embeddings (int
, optional, defaults to 2048) — The maximum sequence length that the masked language modeling model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
image_size (int
, optional, defaults to 56) — Size of the images after preprocessing, for PerceiverForImageClassificationLearned.
train_size (List[int]
, optional, defaults to [368, 496]) — Training size of the images for the optical flow model.
num_frames (int
, optional, defaults to 16) — Number of video frames used for the multimodal autoencoding model.
audio_samples_per_frame (int
, optional, defaults to 1920) — Number of audio samples per frame for the multimodal autoencoding model.
samples_per_patch (int
, optional, defaults to 16) — Number of audio samples per patch when preprocessing the audio for the multimodal autoencoding model.
output_num_channels (int
, optional, defaults to 512) — Number of output channels for each modalitiy decoder.
output_shape (List[int]
, optional, defaults to [1, 16, 224, 224]
) — Shape of the output (batch_size, num_frames, height, width) for the video decoder queries of the multimodal autoencoding model. This excludes the channel dimension.
This is the configuration class to store the configuration of a PerceiverModel. It is used to instantiate an Perceiver model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the Perceiver deepmind/language-perceiver architecture.
Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.
Example:
Copied
( pad_token = '[PAD]'bos_token = '[BOS]'eos_token = '[EOS]'mask_token = '[MASK]'cls_token = '[CLS]'sep_token = '[SEP]'model_max_length = 2048**kwargs )
Parameters
pad_token (str
, optional, defaults to "[PAD]"
) — The token used for padding, for example when batching sequences of different lengths.
bos_token (str
, optional, defaults to "[BOS]"
) — The BOS token (reserved in the vocab, but not actually used).
eos_token (str
, optional, defaults to "[EOS]"
) — The end of sequence token (reserved in the vocab, but not actually used).
When building a sequence using special tokens, this is not the token that is used for the end of sequence. The token used is the sep_token
.
mask_token (str
, optional, defaults to "[MASK]"
) — The MASK token, useful for masked language modeling.
cls_token (str
, optional, defaults to "[CLS]"
) — The CLS token (reserved in the vocab, but not actually used).
sep_token (str
, optional, defaults to "[SEP]"
) — The separator token, which is used when building a sequence from two sequences.
Construct a Perceiver tokenizer. The Perceiver simply uses raw bytes utf-8 encoding.
This tokenizer inherits from PreTrainedTokenizer which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.
__call__
( text: typing.Union[str, typing.List[str], typing.List[typing.List[str]]] = Nonetext_pair: typing.Union[str, typing.List[str], typing.List[typing.List[str]], NoneType] = Nonetext_target: typing.Union[str, typing.List[str], typing.List[typing.List[str]]] = Nonetext_pair_target: typing.Union[str, typing.List[str], typing.List[typing.List[str]], NoneType] = Noneadd_special_tokens: bool = Truepadding: typing.Union[bool, str, transformers.utils.generic.PaddingStrategy] = Falsetruncation: typing.Union[bool, str, transformers.tokenization_utils_base.TruncationStrategy] = Nonemax_length: typing.Optional[int] = Nonestride: int = 0is_split_into_words: bool = Falsepad_to_multiple_of: typing.Optional[int] = Nonereturn_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = Nonereturn_token_type_ids: typing.Optional[bool] = Nonereturn_attention_mask: typing.Optional[bool] = Nonereturn_overflowing_tokens: bool = Falsereturn_special_tokens_mask: bool = Falsereturn_offsets_mapping: bool = Falsereturn_length: bool = Falseverbose: bool = True**kwargs ) → BatchEncoding
Parameters
text (str
, List[str]
, List[List[str]]
, optional) — The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set is_split_into_words=True
(to lift the ambiguity with a batch of sequences).
text_pair (str
, List[str]
, List[List[str]]
, optional) — The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set is_split_into_words=True
(to lift the ambiguity with a batch of sequences).
text_target (str
, List[str]
, List[List[str]]
, optional) — The sequence or batch of sequences to be encoded as target texts. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set is_split_into_words=True
(to lift the ambiguity with a batch of sequences).
text_pair_target (str
, List[str]
, List[List[str]]
, optional) — The sequence or batch of sequences to be encoded as target texts. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set is_split_into_words=True
(to lift the ambiguity with a batch of sequences).
add_special_tokens (bool
, optional, defaults to True
) — Whether or not to add special tokens when encoding the sequences. This will use the underlying PretrainedTokenizerBase.build_inputs_with_special_tokens
function, which defines which tokens are automatically added to the input ids. This is usefull if you want to add bos
or eos
tokens automatically.
padding (bool
, str
or PaddingStrategy, optional, defaults to False
) — Activates and controls padding. Accepts the following values:
True
or 'longest'
: Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).
'max_length'
: Pad to a maximum length specified with the argument max_length
or to the maximum acceptable input length for the model if that argument is not provided.
False
or 'do_not_pad'
(default): No padding (i.e., can output a batch with sequences of different lengths).
truncation (bool
, str
or TruncationStrategy, optional, defaults to False
) — Activates and controls truncation. Accepts the following values:
True
or 'longest_first'
: Truncate to a maximum length specified with the argument max_length
or to the maximum acceptable input length for the model if that argument is not provided. This will truncate token by token, removing a token from the longest sequence in the pair if a pair of sequences (or a batch of pairs) is provided.
'only_first'
: Truncate to a maximum length specified with the argument max_length
or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
'only_second'
: Truncate to a maximum length specified with the argument max_length
or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
False
or 'do_not_truncate'
(default): No truncation (i.e., can output batch with sequence lengths greater than the model maximum admissible input size).
max_length (int
, optional) — Controls the maximum length to use by one of the truncation/padding parameters.
If left unset or set to None
, this will use the predefined model maximum length if a maximum length is required by one of the truncation/padding parameters. If the model has no specific maximum input length (like XLNet) truncation/padding to a maximum length will be deactivated.
stride (int
, optional, defaults to 0) — If set to a number along with max_length
, the overflowing tokens returned when return_overflowing_tokens=True
will contain some tokens from the end of the truncated sequence returned to provide some overlap between truncated and overflowing sequences. The value of this argument defines the number of overlapping tokens.
is_split_into_words (bool
, optional, defaults to False
) — Whether or not the input is already pre-tokenized (e.g., split into words). If set to True
, the tokenizer assumes the input is already split into words (for instance, by splitting it on whitespace) which it will tokenize. This is useful for NER or token classification.
pad_to_multiple_of (int
, optional) — If set will pad the sequence to a multiple of the provided value. Requires padding
to be activated. This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5
(Volta).
return_tensors (str
or TensorType, optional) — If set, will return tensors instead of list of python integers. Acceptable values are:
'tf'
: Return TensorFlow tf.constant
objects.
'pt'
: Return PyTorch torch.Tensor
objects.
'np'
: Return Numpy np.ndarray
objects.
return_token_type_ids (bool
, optional) — Whether to return token type IDs. If left to the default, will return the token type IDs according to the specific tokenizer’s default, defined by the return_outputs
attribute.
return_attention_mask (bool
, optional) — Whether to return the attention mask. If left to the default, will return the attention mask according to the specific tokenizer’s default, defined by the return_outputs
attribute.
return_overflowing_tokens (bool
, optional, defaults to False
) — Whether or not to return overflowing token sequences. If a pair of sequences of input ids (or a batch of pairs) is provided with truncation_strategy = longest_first
or True
, an error is raised instead of returning overflowing tokens.
return_special_tokens_mask (bool
, optional, defaults to False
) — Whether or not to return special tokens mask information.
return_offsets_mapping (bool
, optional, defaults to False
) — Whether or not to return (char_start, char_end)
for each token.
This is only available on fast tokenizers inheriting from PreTrainedTokenizerFast, if using Python’s tokenizer, this method will raise NotImplementedError
.
return_length (bool
, optional, defaults to False
) — Whether or not to return the lengths of the encoded inputs.
verbose (bool
, optional, defaults to True
) — Whether or not to print more information and warnings. **kwargs — passed to the self.tokenize()
method
Returns
A BatchEncoding with the following fields:
input_ids — List of token ids to be fed to a model.
token_type_ids — List of token type ids to be fed to a model (when return_token_type_ids=True
or if “token_type_ids” is in self.model_input_names
).
attention_mask — List of indices specifying which tokens should be attended to by the model (when return_attention_mask=True
or if “attention_mask” is in self.model_input_names
).
overflowing_tokens — List of overflowing tokens sequences (when a max_length
is specified and return_overflowing_tokens=True
).
num_truncated_tokens — Number of tokens truncated (when a max_length
is specified and return_overflowing_tokens=True
).
special_tokens_mask — List of 0s and 1s, with 1 specifying added special tokens and 0 specifying regular sequence tokens (when add_special_tokens=True
and return_special_tokens_mask=True
).
length — The length of the inputs (when return_length=True
)
Main method to tokenize and prepare for the model one or several sequence(s) or one or several pair(s) of sequences.
( *args**kwargs )
__call__
( images**kwargs )
Preprocess an image or a batch of images.
( do_center_crop: bool = Truecrop_size: typing.Dict[str, int] = Nonedo_resize: bool = Truesize: typing.Dict[str, int] = Noneresample: Resampling = <Resampling.BICUBIC: 3>do_rescale: bool = Truerescale_factor: typing.Union[int, float] = 0.00392156862745098do_normalize: bool = Trueimage_mean: typing.Union[float, typing.List[float], NoneType] = Noneimage_std: typing.Union[float, typing.List[float], NoneType] = None**kwargs )
Parameters
do_center_crop (bool
, optional
, defaults to True
) — Whether or not to center crop the image. If the input size if smaller than crop_size
along any edge, the image will be padded with zeros and then center cropped. Can be overridden by the do_center_crop
parameter in the preprocess
method.
crop_size (Dict[str, int]
, optional, defaults to {"height" -- 256, "width": 256}
): Desired output size when applying center-cropping. Can be overridden by the crop_size
parameter in the preprocess
method.
do_resize (bool
, optional, defaults to True
) — Whether to resize the image to (size["height"], size["width"])
. Can be overridden by the do_resize
parameter in the preprocess
method.
size (Dict[str, int]
optional, defaults to {"height" -- 224, "width": 224}
): Size of the image after resizing. Can be overridden by the size
parameter in the preprocess
method.
resample (PILImageResampling
, optional, defaults to PILImageResampling.BICUBIC
) — Defines the resampling filter to use if resizing the image. Can be overridden by the resample
parameter in the preprocess
method.
do_rescale (bool
, optional, defaults to True
) — Whether to rescale the image by the specified scale rescale_factor
. Can be overridden by the do_rescale
parameter in the preprocess
method.
rescale_factor (int
or float
, optional, defaults to 1/255
) — Defines the scale factor to use if rescaling the image. Can be overridden by the rescale_factor
parameter in the preprocess
method. do_normalize — Whether to normalize the image. Can be overridden by the do_normalize
parameter in the preprocess
method.
image_mean (float
or List[float]
, optional, defaults to IMAGENET_STANDARD_MEAN
) — Mean to use if normalizing the image. This is a float or list of floats the length of the number of channels in the image. Can be overridden by the image_mean
parameter in the preprocess
method.
image_std (float
or List[float]
, optional, defaults to IMAGENET_STANDARD_STD
) — Standard deviation to use if normalizing the image. This is a float or list of floats the length of the number of channels in the image. Can be overridden by the image_std
parameter in the preprocess
method.
Constructs a Perceiver image processor.
preprocess
( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), typing.List[ForwardRef('PIL.Image.Image')], typing.List[numpy.ndarray], typing.List[ForwardRef('torch.Tensor')]]do_center_crop: typing.Optional[bool] = Nonecrop_size: typing.Union[typing.Dict[str, int], NoneType] = Nonedo_resize: typing.Optional[bool] = Nonesize: typing.Union[typing.Dict[str, int], NoneType] = Noneresample: Resampling = Nonedo_rescale: typing.Optional[bool] = Nonerescale_factor: typing.Optional[float] = Nonedo_normalize: typing.Optional[bool] = Noneimage_mean: typing.Union[float, typing.List[float], NoneType] = Noneimage_std: typing.Union[float, typing.List[float], NoneType] = Nonereturn_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = Nonedata_format: ChannelDimension = <ChannelDimension.FIRST: 'channels_first'>input_data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None**kwargs )
Parameters
images (ImageInput
) — Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If passing in images with pixel values between 0 and 1, set do_rescale=False
.
do_center_crop (bool
, optional, defaults to self.do_center_crop
) — Whether to center crop the image to crop_size
.
crop_size (Dict[str, int]
, optional, defaults to self.crop_size
) — Desired output size after applying the center crop.
do_resize (bool
, optional, defaults to self.do_resize
) — Whether to resize the image.
size (Dict[str, int]
, optional, defaults to self.size
) — Size of the image after resizing.
resample (int
, optional, defaults to self.resample
) — Resampling filter to use if resizing the image. This can be one of the enum PILImageResampling
, Only has an effect if do_resize
is set to True
.
do_rescale (bool
, optional, defaults to self.do_rescale
) — Whether to rescale the image.
rescale_factor (float
, optional, defaults to self.rescale_factor
) — Rescale factor to rescale the image by if do_rescale
is set to True
.
do_normalize (bool
, optional, defaults to self.do_normalize
) — Whether to normalize the image.
image_mean (float
or List[float]
, optional, defaults to self.image_mean
) — Image mean.
image_std (float
or List[float]
, optional, defaults to self.image_std
) — Image standard deviation.
return_tensors (str
or TensorType
, optional) — The type of tensors to return. Can be one of:
Unset: Return a list of np.ndarray
.
TensorType.TENSORFLOW
or 'tf'
: Return a batch of type tf.Tensor
.
TensorType.PYTORCH
or 'pt'
: Return a batch of type torch.Tensor
.
TensorType.NUMPY
or 'np'
: Return a batch of type np.ndarray
.
TensorType.JAX
or 'jax'
: Return a batch of type jax.numpy.ndarray
.
data_format (ChannelDimension
or str
, optional, defaults to ChannelDimension.FIRST
) — The channel dimension format for the output image. Can be one of:
ChannelDimension.FIRST
: image in (num_channels, height, width) format.
ChannelDimension.LAST
: image in (height, width, num_channels) format.
input_data_format (ChannelDimension
or str
, optional) — The channel dimension format for the input image. If unset, the channel dimension format is inferred from the input image. Can be one of:
"channels_first"
or ChannelDimension.FIRST
: image in (num_channels, height, width) format.
"channels_last"
or ChannelDimension.LAST
: image in (height, width, num_channels) format.
"none"
or ChannelDimension.NONE
: image in (height, width) format.
Preprocess an image or batch of images.
( config: PerceiverConfig )
Parameters
config (PerceiverConfig) — Model configuration.
Text preprocessing for Perceiver Encoder. Can be used to embed inputs
and add positional encodings.
The dimensionality of the embeddings is determined by the d_model
attribute of the configuration.
( configprep_type = 'conv'spatial_downsample: int = 4temporal_downsample: int = 1position_encoding_type: str = 'fourier'in_channels: int = 3out_channels: int = 64conv_after_patching: bool = Falseconv_after_patching_in_channels: int = 54conv2d_use_batchnorm: bool = Trueconcat_or_add_pos: str = 'concat'project_pos_dim: int = -1**position_encoding_kwargs )
Parameters
config ([PerceiverConfig]) — Model configuration.
prep_type (str
, optional, defaults to "conv"
) — Preprocessing type. Can be “conv1x1”, “conv”, “patches”, “pixels”.
spatial_downsample (int
, optional, defaults to 4) — Spatial downsampling factor.
temporal_downsample (int
, optional, defaults to 1) — Temporal downsampling factor (only relevant in case a time dimension is present).
position_encoding_type (str
, optional, defaults to "fourier"
) — Position encoding type. Can be “fourier” or “trainable”.
in_channels (int
, optional, defaults to 3) — Number of channels in the input.
out_channels (int
, optional, defaults to 64) — Number of channels in the output.
conv_after_patching (bool
, optional, defaults to False
) — Whether to apply a convolutional layer after patching.
conv_after_patching_in_channels (int
, optional, defaults to 54) — Number of channels in the input of the convolutional layer after patching.
conv2d_use_batchnorm (bool
, optional, defaults to True
) — Whether to use batch normalization in the convolutional layer.
concat_or_add_pos (str
, optional, defaults to "concat"
) — How to concatenate the position encoding to the input. Can be “concat” or “add”.
project_pos_dim (int
, optional, defaults to -1) — Dimension of the position encoding to project to. If -1, no projection is applied.
**position_encoding_kwargs (Dict
, optional) — Keyword arguments for the position encoding.
Image preprocessing for Perceiver Encoder.
Note: the out_channels argument refers to the output channels of a convolutional layer, if prep_type is set to “conv1x1” or “conv”. If one adds absolute position embeddings, one must make sure the num_channels of the position encoding kwargs are set equal to the out_channels.
( config: PerceiverConfig )
Parameters
config (PerceiverConfig) — Model configuration.
One-hot preprocessor for Perceiver Encoder. Can be used to add a dummy index dimension to the input.
( configprep_type: str = 'patches'samples_per_patch: int = 96position_encoding_type: str = 'fourier'concat_or_add_pos: str = 'concat'out_channels = 64project_pos_dim = -1**position_encoding_kwargs )
Parameters
config ([PerceiverConfig]) — Model configuration.
prep_type (str
, optional, defaults to "patches"
) — Preprocessor type to use. Only “patches” is supported.
samples_per_patch (int
, optional, defaults to 96) — Number of samples per patch.
position_encoding_type (str
, optional, defaults to "fourier"
) — Type of position encoding to use. Can be “trainable” or “fourier”.
concat_or_add_pos (str
, optional, defaults to "concat"
) — How to concatenate the position encoding to the input. Can be “concat” or “add”.
out_channels (int
, optional, defaults to 64) — Number of channels in the output.
project_pos_dim (int
, optional, defaults to -1) — Dimension of the position encoding to project to. If -1, no projection is applied.
**position_encoding_kwargs (Dict
, optional) — Keyword arguments for the position encoding.
Audio preprocessing for Perceiver Encoder.
( modalities: typing.Mapping[str, typing.Callable[..., typing.Tuple[torch.Tensor, typing.Optional[torch.Tensor], torch.Tensor]]]mask_probs: typing.Union[typing.Mapping[str, float], NoneType] = Nonemin_padding_size: int = 2 )
Parameters
modalities (Mapping[str, PreprocessorType]
) — Dict mapping modality name to preprocessor.
mask_probs (Dict[str, float]
) — Dict mapping modality name to masking probability of that modality.
min_padding_size (int
, optional, defaults to 2) — The minimum padding size for all modalities. The final output will have num_channels equal to the maximum channels across all modalities plus min_padding_size.
Multimodal preprocessing for Perceiver Encoder.
Inputs for each modality are preprocessed, then padded with trainable position embeddings to have the same number of channels.
( config )
Parameters
config (PerceiverConfig) — Model configuration.
Baseline projection decoder (no cross-attention).
( config: PerceiverConfigoutput_num_channels: intposition_encoding_type: typing.Optional[str] = 'trainable'output_index_dims: typing.Optional[int] = Nonenum_channels: typing.Optional[int] = 128subsampled_index_dims: typing.Optional[int] = Noneqk_channels: typing.Optional[int] = Nonev_channels: typing.Optional[int] = Nonenum_heads: typing.Optional[int] = 1widening_factor: typing.Optional[int] = 1use_query_residual: typing.Optional[bool] = Falseconcat_preprocessed_input: typing.Optional[bool] = Falsefinal_project: typing.Optional[bool] = Trueposition_encoding_only: typing.Optional[bool] = False**position_encoding_kwargs )
Parameters
config ([PerceiverConfig]) — Model configuration.
output_num_channels (int
, optional) — The number of channels in the output. Will only be used in case final_project is set to True
.
position_encoding_type (str
, optional, defaults to “trainable”) — The type of position encoding to use. Can be either “trainable”, “fourier”, or “none”.
output_index_dims (int
, optional) — The number of dimensions of the output queries. Ignored if ‘position_encoding_type’ == ‘none’.
num_channels (int
, optional, defaults to 128) — The number of channels of the decoder queries. Ignored if ‘position_encoding_type’ == ‘none’.
qk_channels (int
, optional) — The number of channels of the queries and keys in the cross-attention layer.
v_channels (int
, optional) — The number of channels of the values in the cross-attention layer.
num_heads (int
, optional, defaults to 1) — The number of attention heads in the cross-attention layer.
widening_factor (int
, optional, defaults to 1) — The widening factor of the cross-attention layer.
use_query_residual (bool
, optional, defaults to False
) — Whether to use a residual connection between the query and the output of the cross-attention layer.
concat_preprocessed_input (bool
, optional, defaults to False
) — Whether to concatenate the preprocessed input to the query.
final_project (bool
, optional, defaults to True
) — Whether to project the output of the cross-attention layer to a target dimension.
position_encoding_only (bool
, optional, defaults to False
) — Whether to only use this class to define output queries.
Cross-attention-based decoder. This class can be used to decode the final hidden states of the latents using a cross-attention operation, in which the latents produce keys and values.
The shape of the output of this class depends on how one defines the output queries (also called decoder queries).
( config**decoder_kwargs )
Parameters
config (PerceiverConfig) — Model configuration.
Cross-attention based classification decoder. Light-weight wrapper of PerceiverBasicDecoder
for logit output. Will turn the output of the Perceiver encoder which is of shape (batch_size, num_latents, d_latents) to a tensor of shape (batch_size, num_labels). The queries are of shape (batch_size, 1, num_labels).
( configoutput_image_shapeoutput_num_channels = 2rescale_factor = 100.0**decoder_kwargs )
Cross-attention based optical flow decoder.
( config: PerceiverConfigoutput_shape: typing.List[int]position_encoding_type: str**decoder_kwargs )
Parameters
config ([PerceiverConfig]) — Model configuration.
output_shape (List[int]
) — Shape of the output as (batch_size, num_frames, height, width), excluding the channel dimension.
position_encoding_type (str
) — The type of position encoding to use. Can be either “trainable”, “fourier”, or “none”.
Cross-attention based video-autoencoding decoder. Light-weight wrapper of [PerceiverBasicDecoder] with video reshaping logic.
( config: PerceiverConfigmodalities: typing.Dict[str, transformers.models.perceiver.modeling_perceiver.PerceiverAbstractDecoder]num_outputs: intoutput_num_channels: intmin_padding_size: typing.Optional[int] = 2subsampled_index_dims: typing.Union[typing.Dict[str, transformers.models.perceiver.modeling_perceiver.PerceiverAbstractDecoder], NoneType] = None**decoder_kwargs )
Parameters
config ([PerceiverConfig]) — Model configuration.
modalities (Dict[str, PerceiverAbstractDecoder]
) — Dictionary mapping modality name to the decoder of that modality.
num_outputs (int
) — The number of outputs of the decoder.
output_num_channels (int
) — The number of channels in the output.
min_padding_size (int
, optional, defaults to 2) — The minimum padding size for all modalities. The final output will have num_channels equal to the maximum channels across all modalities plus min_padding_size.
subsampled_index_dims (Dict[str, PerceiverAbstractDecoder]
, optional) — Dictionary mapping modality name to the subsampled index dimensions to use for the decoder query of that modality.
Multimodal decoding by composing uni-modal decoders. The modalities argument of the constructor is a dictionary mapping modality name to the decoder of that modality. That decoder will be used to construct queries for that modality. Modality-specific queries are padded with trainable modality-specific parameters, after which they are concatenated along the time dimension.
Next, there is a shared cross attention operation across all modalities.
( in_channels: intout_channels: int )
Parameters
in_channels (int
) — Number of channels in the input.
out_channels (int
) — Number of channels in the output.
Projection postprocessing for Perceiver. Can be used to project the channels of the decoder output to a lower dimension.
( config: PerceiverConfigin_channels: intpostproc_type: str = 'patches' )
Parameters
config ([PerceiverConfig]) — Model configuration.
in_channels (int
) — Number of channels in the input.
postproc_type (str
, optional, defaults to "patches"
) — Postprocessor type to use. Currently, only “patches” is supported.
Audio postprocessing for Perceiver. Can be used to convert the decoder output to audio features.
( config: PerceiverConfigin_channels: int )
Parameters
config ([PerceiverConfig]) — Model configuration.
in_channels (int
) — Number of channels in the input.
Classification postprocessing for Perceiver. Can be used to convert the decoder output to classification logits.
( modalities: typing.Mapping[str, typing.Callable[..., typing.Any]]input_is_dict: bool = False )
Parameters
modalities (Mapping[str, PostprocessorType]
) — Dictionary mapping modality name to postprocessor class for that modality.
input_is_dict (bool
, optional, defaults to False
) — If True, input is assumed to be dictionary structured, and outputs keep the same dictionary shape. If False, input is a tensor which is sliced up during postprocessing by modality_sizes.
Multimodal postprocessing for Perceiver. Can be used to combine modality-specific postprocessors into a single postprocessor.
( configdecoder = Noneinput_preprocessor: typing.Callable[..., typing.Tuple[torch.Tensor, typing.Optional[torch.Tensor], torch.Tensor]] = Noneoutput_postprocessor: typing.Callable[..., typing.Any] = None )
Parameters
config (PerceiverConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
decoder (DecoderType, optional) — Optional decoder to use to decode the latent representation of the encoder. Examples include transformers.models.perceiver.modeling_perceiver.PerceiverBasicDecoder, transformers.models.perceiver.modeling_perceiver.PerceiverClassificationDecoder, transformers.models.perceiver.modeling_perceiver.PerceiverMultimodalDecoder.
input_preprocessor (PreprocessorType, optional) — Optional input preprocessor to use. Examples include transformers.models.perceiver.modeling_perceiver.PerceiverImagePreprocessor, transformers.models.perceiver.modeling_perceiver.PerceiverAudioPreprocessor, transformers.models.perceiver.modeling_perceiver.PerceiverTextPreprocessor, transformers.models.perceiver.modeling_perceiver.PerceiverMultimodalPreprocessor.
output_postprocessor (PostprocessorType, optional) — Optional output postprocessor to use. Examples include transformers.models.perceiver.modeling_perceiver.PerceiverImagePostprocessor, transformers.models.perceiver.modeling_perceiver.PerceiverAudioPostprocessor, transformers.models.perceiver.modeling_perceiver.PerceiverClassificationPostprocessor, transformers.models.perceiver.modeling_perceiver.PerceiverProjectionPostprocessor, transformers.models.perceiver.modeling_perceiver.PerceiverMultimodalPostprocessor.
Note that you can define your own decoders, preprocessors and/or postprocessors to fit your use-case. —
The Perceiver: a scalable, fully attentional architecture. This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
( inputs: FloatTensorattention_mask: typing.Optional[torch.FloatTensor] = Nonesubsampled_output_points: typing.Union[typing.Dict[str, torch.Tensor], NoneType] = Nonehead_mask: typing.Optional[torch.FloatTensor] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = None ) → transformers.models.perceiver.modeling_perceiver.PerceiverModelOutput or tuple(torch.FloatTensor)
Parameters
inputs (torch.FloatTensor
) — Inputs to the perceiver. Can be anything: images, text, audio, video, etc.
attention_mask (torch.FloatTensor
of shape (batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
:
1 for tokens that are not masked,
0 for tokens that are masked.
head_mask (torch.FloatTensor
of shape (num_heads,)
or (num_layers, num_heads)
, optional) — Mask to nullify selected heads of the self-attention modules. Mask values selected in [0, 1]
:
1 indicates the head is not masked,
0 indicates the head is masked.
output_attentions (bool
, optional) — Whether or not to return the attentions tensors of all attention layers. See attentions
under returned tensors for more detail.
output_hidden_states (bool
, optional) — Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for more detail.
return_dict (bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
Returns
transformers.models.perceiver.modeling_perceiver.PerceiverModelOutput or tuple(torch.FloatTensor)
A transformers.models.perceiver.modeling_perceiver.PerceiverModelOutput or a tuple of torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various elements depending on the configuration (PerceiverConfig) and inputs.
logits (torch.FloatTensor
of shape (batch_size, num_labels)
) — Classification (or regression if config.num_labels==1) scores (before SoftMax).
last_hidden_state (torch.FloatTensor
of shape (batch_size, sequence_length, hidden_size)
) — Sequence of hidden-states at the output of the last layer of the model.
hidden_states (tuple(torch.FloatTensor)
, optional, returned when output_hidden_states=True
is passed or when config.output_hidden_states=True
) — Tuple of torch.FloatTensor
(one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size)
. Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (tuple(torch.FloatTensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) — Tuple of torch.FloatTensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
. Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
cross_attentions (tuple(torch.FloatTensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) — Tuple of torch.FloatTensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
. Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads.
The PerceiverModel forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
Examples:
Copied
( config: PerceiverConfig )
Parameters
config (PerceiverConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
Example use of Perceiver for masked language modeling. This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
( inputs: typing.Optional[torch.Tensor] = Noneattention_mask: typing.Optional[torch.Tensor] = Nonehead_mask: typing.Optional[torch.Tensor] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonelabels: typing.Optional[torch.Tensor] = Nonereturn_dict: typing.Optional[bool] = Noneinput_ids: typing.Optional[torch.Tensor] = None ) → transformers.models.perceiver.modeling_perceiver.PerceiverMaskedLMOutput or tuple(torch.FloatTensor)
Parameters
inputs (torch.FloatTensor
) — Inputs to the perceiver. Can be anything: images, text, audio, video, etc.
attention_mask (torch.FloatTensor
of shape batch_size, sequence_length
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
:
1 for tokens that are not masked,
0 for tokens that are masked.
head_mask (torch.FloatTensor
of shape (num_heads,)
or (num_layers, num_heads)
, optional) — Mask to nullify selected heads of the self-attention modules. Mask values selected in [0, 1]
:
1 indicates the head is not masked,
0 indicates the head is masked.
output_attentions (bool
, optional) — Whether or not to return the attentions tensors of all attention layers. See attentions
under returned tensors for more detail.
output_hidden_states (bool
, optional) — Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for more detail.
return_dict (bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
labels (torch.LongTensor
of shape (batch_size, sequence_length)
, optional) — Labels for computing the masked language modeling loss. Indices should be in [-100, 0, ..., config.vocab_size]
(see input_ids
docstring) Tokens with indices set to -100
are ignored (masked), the loss is only computed for the tokens with labels in [0, ..., config.vocab_size]
Returns
transformers.models.perceiver.modeling_perceiver.PerceiverMaskedLMOutput or tuple(torch.FloatTensor)
A transformers.models.perceiver.modeling_perceiver.PerceiverMaskedLMOutput or a tuple of torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various elements depending on the configuration (PerceiverConfig) and inputs.
loss (torch.FloatTensor
of shape (1,)
, optional, returned when labels
is provided) — Masked language modeling (MLM) loss.
logits (torch.FloatTensor
of shape (batch_size, sequence_length, config.vocab_size)
) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
hidden_states (tuple(torch.FloatTensor)
, optional, returned when output_hidden_states=True
is passed or when config.output_hidden_states=True
) — Tuple of torch.FloatTensor
(one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size)
. Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (tuple(torch.FloatTensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) — Tuple of torch.FloatTensor
(one for each layer) of shape (batch_size, num_heads, num_latents, num_latents)
. Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
cross_attentions (tuple(torch.FloatTensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) — Tuple of torch.FloatTensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
. Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads.
The PerceiverForMaskedLM forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
Examples:
Copied
( config )
Parameters
config (PerceiverConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
Example use of Perceiver for text classification. This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
( inputs: typing.Optional[torch.Tensor] = Noneattention_mask: typing.Optional[torch.Tensor] = Nonehead_mask: typing.Optional[torch.Tensor] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonelabels: typing.Optional[torch.Tensor] = Nonereturn_dict: typing.Optional[bool] = Noneinput_ids: typing.Optional[torch.Tensor] = None ) → transformers.models.perceiver.modeling_perceiver.PerceiverClassifierOutput or tuple(torch.FloatTensor)
Parameters
inputs (torch.FloatTensor
) — Inputs to the perceiver. Can be anything: images, text, audio, video, etc.
attention_mask (torch.FloatTensor
of shape batch_size, sequence_length
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
:
1 for tokens that are not masked,
0 for tokens that are masked.
head_mask (torch.FloatTensor
of shape (num_heads,)
or (num_layers, num_heads)
, optional) — Mask to nullify selected heads of the self-attention modules. Mask values selected in [0, 1]
:
1 indicates the head is not masked,
0 indicates the head is masked.
output_attentions (bool
, optional) — Whether or not to return the attentions tensors of all attention layers. See attentions
under returned tensors for more detail.
output_hidden_states (bool
, optional) — Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for more detail.
return_dict (bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
labels (torch.LongTensor
of shape (batch_size,)
, optional) — Labels for computing the classification/regression loss. Indices should be in [0, ..., config.num_labels - 1]
. If config.num_labels == 1
a regression loss is computed (Mean-Square loss), If config.num_labels > 1
a classification loss is computed (Cross-Entropy).
Returns
transformers.models.perceiver.modeling_perceiver.PerceiverClassifierOutput or tuple(torch.FloatTensor)
A transformers.models.perceiver.modeling_perceiver.PerceiverClassifierOutput or a tuple of torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various elements depending on the configuration (PerceiverConfig) and inputs.
loss (torch.FloatTensor
of shape (1,)
, optional, returned when labels
is provided) — Classification (or regression if config.num_labels==1) loss.
logits (torch.FloatTensor
of shape (batch_size, config.num_labels)
) — Classification (or regression if config.num_labels==1) scores (before SoftMax).
hidden_states (tuple(torch.FloatTensor)
, optional, returned when output_hidden_states=True
is passed or when config.output_hidden_states=True
) — Tuple of torch.FloatTensor
(one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size)
. Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (tuple(torch.FloatTensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) — Tuple of torch.FloatTensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
. Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
cross_attentions (tuple(torch.FloatTensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) — Tuple of torch.FloatTensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
. Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads.
The PerceiverForSequenceClassification forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
Examples:
Copied
( config )
Parameters
config (PerceiverConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
Example use of Perceiver for image classification, for tasks such as ImageNet.
This model uses learned position embeddings. In other words, this model is not given any privileged information about the structure of images. As shown in the paper, this model can achieve a top-1 accuracy of 72.7 on ImageNet.
PerceiverForImageClassificationLearned uses PerceiverImagePreprocessor (with prep_type="conv1x1"
) to preprocess the input images, and PerceiverClassificationDecoder to decode the latent representation of PerceiverModel into classification logits.
This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
( inputs: typing.Optional[torch.Tensor] = Noneattention_mask: typing.Optional[torch.Tensor] = Nonehead_mask: typing.Optional[torch.Tensor] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonelabels: typing.Optional[torch.Tensor] = Nonereturn_dict: typing.Optional[bool] = Nonepixel_values: typing.Optional[torch.Tensor] = None ) → transformers.models.perceiver.modeling_perceiver.PerceiverClassifierOutput or tuple(torch.FloatTensor)
Parameters
inputs (torch.FloatTensor
) — Inputs to the perceiver. Can be anything: images, text, audio, video, etc.
attention_mask (torch.FloatTensor
of shape batch_size, sequence_length
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
:
1 for tokens that are not masked,
0 for tokens that are masked.
head_mask (torch.FloatTensor
of shape (num_heads,)
or (num_layers, num_heads)
, optional) — Mask to nullify selected heads of the self-attention modules. Mask values selected in [0, 1]
:
1 indicates the head is not masked,
0 indicates the head is masked.
output_attentions (bool
, optional) — Whether or not to return the attentions tensors of all attention layers. See attentions
under returned tensors for more detail.
output_hidden_states (bool
, optional) — Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for more detail.
return_dict (bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
labels (torch.LongTensor
of shape (batch_size,)
, optional) — Labels for computing the image classification/regression loss. Indices should be in [0, ..., config.num_labels - 1]
. If config.num_labels == 1
a regression loss is computed (Mean-Square loss), If config.num_labels > 1
a classification loss is computed (Cross-Entropy).
Returns
transformers.models.perceiver.modeling_perceiver.PerceiverClassifierOutput or tuple(torch.FloatTensor)
A transformers.models.perceiver.modeling_perceiver.PerceiverClassifierOutput or a tuple of torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various elements depending on the configuration (PerceiverConfig) and inputs.
loss (torch.FloatTensor
of shape (1,)
, optional, returned when labels
is provided) — Classification (or regression if config.num_labels==1) loss.
logits (torch.FloatTensor
of shape (batch_size, config.num_labels)
) — Classification (or regression if config.num_labels==1) scores (before SoftMax).
hidden_states (tuple(torch.FloatTensor)
, optional, returned when output_hidden_states=True
is passed or when config.output_hidden_states=True
) — Tuple of torch.FloatTensor
(one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size)
. Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (tuple(torch.FloatTensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) — Tuple of torch.FloatTensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
. Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
cross_attentions (tuple(torch.FloatTensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) — Tuple of torch.FloatTensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
. Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads.
The PerceiverForImageClassificationLearned forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
Examples:
Copied
( config )
Parameters
config (PerceiverConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
Example use of Perceiver for image classification, for tasks such as ImageNet.
This model uses fixed 2D Fourier position embeddings. As shown in the paper, this model can achieve a top-1 accuracy of 79.0 on ImageNet, and 84.5 when pre-trained on a large-scale dataset (i.e. JFT).
PerceiverForImageClassificationLearned uses PerceiverImagePreprocessor (with prep_type="pixels"
) to preprocess the input images, and PerceiverClassificationDecoder to decode the latent representation of PerceiverModel into classification logits.
This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
( inputs: typing.Optional[torch.Tensor] = Noneattention_mask: typing.Optional[torch.Tensor] = Nonehead_mask: typing.Optional[torch.Tensor] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonelabels: typing.Optional[torch.Tensor] = Nonereturn_dict: typing.Optional[bool] = Nonepixel_values: typing.Optional[torch.Tensor] = None ) → transformers.models.perceiver.modeling_perceiver.PerceiverClassifierOutput or tuple(torch.FloatTensor)
Parameters
inputs (torch.FloatTensor
) — Inputs to the perceiver. Can be anything: images, text, audio, video, etc.
attention_mask (torch.FloatTensor
of shape batch_size, sequence_length
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
:
1 for tokens that are not masked,
0 for tokens that are masked.
head_mask (torch.FloatTensor
of shape (num_heads,)
or (num_layers, num_heads)
, optional) — Mask to nullify selected heads of the self-attention modules. Mask values selected in [0, 1]
:
1 indicates the head is not masked,
0 indicates the head is masked.
output_attentions (bool
, optional) — Whether or not to return the attentions tensors of all attention layers. See attentions
under returned tensors for more detail.
output_hidden_states (bool
, optional) — Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for more detail.
return_dict (bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
labels (torch.LongTensor
of shape (batch_size,)
, optional) — Labels for computing the image classification/regression loss. Indices should be in [0, ..., config.num_labels - 1]
. If config.num_labels == 1
a regression loss is computed (Mean-Square loss), If config.num_labels > 1
a classification loss is computed (Cross-Entropy).
Returns
transformers.models.perceiver.modeling_perceiver.PerceiverClassifierOutput or tuple(torch.FloatTensor)
A transformers.models.perceiver.modeling_perceiver.PerceiverClassifierOutput or a tuple of torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various elements depending on the configuration (PerceiverConfig) and inputs.
loss (torch.FloatTensor
of shape (1,)
, optional, returned when labels
is provided) — Classification (or regression if config.num_labels==1) loss.
logits (torch.FloatTensor
of shape (batch_size, config.num_labels)
) — Classification (or regression if config.num_labels==1) scores (before SoftMax).
hidden_states (tuple(torch.FloatTensor)
, optional, returned when output_hidden_states=True
is passed or when config.output_hidden_states=True
) — Tuple of torch.FloatTensor
(one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size)
. Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (tuple(torch.FloatTensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) — Tuple of torch.FloatTensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
. Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
cross_attentions (tuple(torch.FloatTensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) — Tuple of torch.FloatTensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
. Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads.
The PerceiverForImageClassificationFourier forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
Examples:
Copied
( config )
Parameters
config (PerceiverConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
Example use of Perceiver for image classification, for tasks such as ImageNet.
This model uses a 2D conv+maxpool preprocessing network. As shown in the paper, this model can achieve a top-1 accuracy of 82.1 on ImageNet.
PerceiverForImageClassificationLearned uses PerceiverImagePreprocessor (with prep_type="conv"
) to preprocess the input images, and PerceiverClassificationDecoder to decode the latent representation of PerceiverModel into classification logits.
This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
( inputs: typing.Optional[torch.Tensor] = Noneattention_mask: typing.Optional[torch.Tensor] = Nonehead_mask: typing.Optional[torch.Tensor] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonelabels: typing.Optional[torch.Tensor] = Nonereturn_dict: typing.Optional[bool] = Nonepixel_values: typing.Optional[torch.Tensor] = None ) → transformers.models.perceiver.modeling_perceiver.PerceiverClassifierOutput or tuple(torch.FloatTensor)
Parameters
inputs (torch.FloatTensor
) — Inputs to the perceiver. Can be anything: images, text, audio, video, etc.
attention_mask (torch.FloatTensor
of shape batch_size, sequence_length
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
:
1 for tokens that are not masked,
0 for tokens that are masked.
head_mask (torch.FloatTensor
of shape (num_heads,)
or (num_layers, num_heads)
, optional) — Mask to nullify selected heads of the self-attention modules. Mask values selected in [0, 1]
:
1 indicates the head is not masked,
0 indicates the head is masked.
output_attentions (bool
, optional) — Whether or not to return the attentions tensors of all attention layers. See attentions
under returned tensors for more detail.
output_hidden_states (bool
, optional) — Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for more detail.
return_dict (bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
labels (torch.LongTensor
of shape (batch_size,)
, optional) — Labels for computing the image classification/regression loss. Indices should be in [0, ..., config.num_labels - 1]
. If config.num_labels == 1
a regression loss is computed (Mean-Square loss), If config.num_labels > 1
a classification loss is computed (Cross-Entropy).
Returns
transformers.models.perceiver.modeling_perceiver.PerceiverClassifierOutput or tuple(torch.FloatTensor)
A transformers.models.perceiver.modeling_perceiver.PerceiverClassifierOutput or a tuple of torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various elements depending on the configuration (PerceiverConfig) and inputs.
loss (torch.FloatTensor
of shape (1,)
, optional, returned when labels
is provided) — Classification (or regression if config.num_labels==1) loss.
logits (torch.FloatTensor
of shape (batch_size, config.num_labels)
) — Classification (or regression if config.num_labels==1) scores (before SoftMax).
hidden_states (tuple(torch.FloatTensor)
, optional, returned when output_hidden_states=True
is passed or when config.output_hidden_states=True
) — Tuple of torch.FloatTensor
(one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size)
. Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (tuple(torch.FloatTensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) — Tuple of torch.FloatTensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
. Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
cross_attentions (tuple(torch.FloatTensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) — Tuple of torch.FloatTensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
. Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads.
The PerceiverForImageClassificationConvProcessing forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
Examples:
Copied
( config )
Parameters
config (PerceiverConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
Example use of Perceiver for optical flow, for tasks such as Sintel and KITTI. PerceiverForOpticalFlow uses PerceiverImagePreprocessor (with prep_type=“patches”) to preprocess the input images, and PerceiverOpticalFlowDecoder to decode the latent representation of PerceiverModel.
As input, one concatenates 2 subsequent frames along the channel dimension and extract a 3 x 3 patch around each pixel (leading to 3 x 3 x 3 x 2 = 54 values for each pixel). Fixed Fourier position encodings are used to encode the position of each pixel in the patch. Next, one applies the Perceiver encoder. To decode, one queries the latent representation using the same encoding used for the input.
This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
( inputs: typing.Optional[torch.Tensor] = Noneattention_mask: typing.Optional[torch.Tensor] = Nonehead_mask: typing.Optional[torch.Tensor] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonelabels: typing.Optional[torch.Tensor] = Nonereturn_dict: typing.Optional[bool] = None ) → transformers.models.perceiver.modeling_perceiver.PerceiverClassifierOutput or tuple(torch.FloatTensor)
Parameters
inputs (torch.FloatTensor
) — Inputs to the perceiver. Can be anything: images, text, audio, video, etc.
attention_mask (torch.FloatTensor
of shape batch_size, sequence_length
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
:
1 for tokens that are not masked,
0 for tokens that are masked.
head_mask (torch.FloatTensor
of shape (num_heads,)
or (num_layers, num_heads)
, optional) — Mask to nullify selected heads of the self-attention modules. Mask values selected in [0, 1]
:
1 indicates the head is not masked,
0 indicates the head is masked.
output_attentions (bool
, optional) — Whether or not to return the attentions tensors of all attention layers. See attentions
under returned tensors for more detail.
output_hidden_states (bool
, optional) — Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for more detail.
return_dict (bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
labels (torch.LongTensor
of shape (batch_size,)
, optional) — Labels for computing the optical flow loss. Indices should be in [0, ..., config.num_labels - 1]
.
Returns
transformers.models.perceiver.modeling_perceiver.PerceiverClassifierOutput or tuple(torch.FloatTensor)
A transformers.models.perceiver.modeling_perceiver.PerceiverClassifierOutput or a tuple of torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various elements depending on the configuration (PerceiverConfig) and inputs.
loss (torch.FloatTensor
of shape (1,)
, optional, returned when labels
is provided) — Classification (or regression if config.num_labels==1) loss.
logits (torch.FloatTensor
of shape (batch_size, config.num_labels)
) — Classification (or regression if config.num_labels==1) scores (before SoftMax).
hidden_states (tuple(torch.FloatTensor)
, optional, returned when output_hidden_states=True
is passed or when config.output_hidden_states=True
) — Tuple of torch.FloatTensor
(one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size)
. Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (tuple(torch.FloatTensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) — Tuple of torch.FloatTensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
. Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
cross_attentions (tuple(torch.FloatTensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) — Tuple of torch.FloatTensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
. Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads.
The PerceiverForOpticalFlow forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
Examples:
Copied
( config: PerceiverConfig )
Parameters
config (PerceiverConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
Example use of Perceiver for multimodal (video) autoencoding, for tasks such as Kinetics-700.
PerceiverForMultimodalAutoencoding uses PerceiverMultimodalPreprocessor to preprocess the 3 modalities: images, audio and class labels. This preprocessor uses modality-specific preprocessors to preprocess every modality separately, after which they are concatenated. Trainable position embeddings are used to pad each modality to the same number of channels to make concatenation along the time dimension possible. Next, one applies the Perceiver encoder.
PerceiverMultimodalDecoder is used to decode the latent representation of PerceiverModel. This decoder uses each modality-specific decoder to construct queries. The decoder queries are created based on the inputs after preprocessing. However, autoencoding an entire video in a single forward pass is computationally infeasible, hence one only uses parts of the decoder queries to do cross-attention with the latent representation. This is determined by the subsampled indices for each modality, which can be provided as additional input to the forward pass of PerceiverForMultimodalAutoencoding.
PerceiverMultimodalDecoder also pads the decoder queries of the different modalities to the same number of channels, in order to concatenate them along the time dimension. Next, cross-attention is performed with the latent representation of PerceiverModel.
Finally, ~models.perceiver.modeling_perceiver.PerceiverMultiModalPostprocessor
is used to turn this tensor into an actual video. It first splits up the output into the different modalities, and then applies the respective postprocessor for each modality.
Note that, by masking the classification label during evaluation (i.e. simply providing a tensor of zeros for the “label” modality), this auto-encoding model becomes a Kinetics 700 video classifier.
This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
( inputs: typing.Optional[torch.Tensor] = Noneattention_mask: typing.Optional[torch.Tensor] = Nonesubsampled_output_points: typing.Union[typing.Dict[str, torch.Tensor], NoneType] = Nonehead_mask: typing.Optional[torch.Tensor] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonelabels: typing.Optional[torch.Tensor] = Nonereturn_dict: typing.Optional[bool] = None ) → transformers.models.perceiver.modeling_perceiver.PerceiverClassifierOutput or tuple(torch.FloatTensor)
Parameters
inputs (torch.FloatTensor
) — Inputs to the perceiver. Can be anything: images, text, audio, video, etc.
attention_mask (torch.FloatTensor
of shape batch_size, sequence_length
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
:
1 for tokens that are not masked,
0 for tokens that are masked.
head_mask (torch.FloatTensor
of shape (num_heads,)
or (num_layers, num_heads)
, optional) — Mask to nullify selected heads of the self-attention modules. Mask values selected in [0, 1]
:
1 indicates the head is not masked,
0 indicates the head is masked.
output_attentions (bool
, optional) — Whether or not to return the attentions tensors of all attention layers. See attentions
under returned tensors for more detail.
output_hidden_states (bool
, optional) — Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for more detail.
return_dict (bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
labels (torch.LongTensor
of shape (batch_size,)
, optional) — Labels for computing the image classification/regression loss. Indices should be in [0, ..., config.num_labels - 1]
. If config.num_labels == 1
a regression loss is computed (Mean-Square loss), If config.num_labels > 1
a classification loss is computed (Cross-Entropy).
Returns
transformers.models.perceiver.modeling_perceiver.PerceiverClassifierOutput or tuple(torch.FloatTensor)
A transformers.models.perceiver.modeling_perceiver.PerceiverClassifierOutput or a tuple of torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various elements depending on the configuration (PerceiverConfig) and inputs.
loss (torch.FloatTensor
of shape (1,)
, optional, returned when labels
is provided) — Classification (or regression if config.num_labels==1) loss.
logits (torch.FloatTensor
of shape (batch_size, config.num_labels)
) — Classification (or regression if config.num_labels==1) scores (before SoftMax).
hidden_states (tuple(torch.FloatTensor)
, optional, returned when output_hidden_states=True
is passed or when config.output_hidden_states=True
) — Tuple of torch.FloatTensor
(one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size)
. Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (tuple(torch.FloatTensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) — Tuple of torch.FloatTensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
. Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
cross_attentions (tuple(torch.FloatTensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) — Tuple of torch.FloatTensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
. Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads.
The PerceiverForMultimodalAutoencoding forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
Examples:
Copied