LayoutLMV3
Last updated
Last updated
The LayoutLMv3 model was proposed in LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking by Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, Furu Wei. LayoutLMv3 simplifies LayoutLMv2 by using patch embeddings (as in ViT) instead of leveraging a CNN backbone, and pre-trains the model on 3 objectives: masked language modeling (MLM), masked image modeling (MIM) and word-patch alignment (WPA).
The abstract from the paper is the following:
Self-supervised pre-training techniques have achieved remarkable progress in Document AI. Most multimodal pre-trained models use a masked language modeling objective to learn bidirectional representations on the text modality, but they differ in pre-training objectives for the image modality. This discrepancy adds difficulty to multimodal representation learning. In this paper, we propose LayoutLMv3 to pre-train multimodal Transformers for Document AI with unified text and image masking. Additionally, LayoutLMv3 is pre-trained with a word-patch alignment objective to learn cross-modal alignment by predicting whether the corresponding image patch of a text word is masked. The simple unified architecture and training objectives make LayoutLMv3 a general-purpose pre-trained model for both text-centric and image-centric Document AI tasks. Experimental results show that LayoutLMv3 achieves state-of-the-art performance not only in text-centric tasks, including form understanding, receipt understanding, and document visual question answering, but also in image-centric tasks such as document image classification and document layout analysis.
Tips:
In terms of data processing, LayoutLMv3 is identical to its predecessor LayoutLMv2, except that:
images need to be resized and normalized with channels in regular RGB format. LayoutLMv2 on the other hand normalizes the images internally and expects the channels in BGR format.
text is tokenized using byte-pair encoding (BPE), as opposed to WordPiece. Due to these differences in data preprocessing, one can use LayoutLMv3Processor which internally combines a LayoutLMv3ImageProcessor (for the image modality) and a LayoutLMv3Tokenizer/LayoutLMv3TokenizerFast (for the text modality) to prepare all data for the model.
Regarding usage of LayoutLMv3Processor, we refer to the usage guide of its predecessor.
Demo notebooks for LayoutLMv3 can be found here.
Demo scripts can be found here.
LayoutLMv3 architecture. Taken from the original paper.
This model was contributed by nielsr. The TensorFlow version of this model was added by chriskoo, tokec, and lre. The original code can be found here.
A list of official BOINC AI and community (indicated by 🌎) resources to help you get started with LayoutLMv3. If you’re interested in submitting a resource to be included here, please feel free to open a Pull Request and we’ll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
LayoutLMv3 is nearly identical to LayoutLMv2, so we’ve also included LayoutLMv2 resources you can adapt for LayoutLMv3 tasks. For these notebooks, take care to use LayoutLMv2Processor instead when preparing data for the model!
Text Classification
LayoutLMv2ForSequenceClassification is supported by this notebook.
Token Classification
LayoutLMv3ForTokenClassification is supported by this example script and notebook.
A notebook for how to perform inference with LayoutLMv2ForTokenClassification and a notebook for how to perform inference when no labels are available with LayoutLMv2ForTokenClassification.
A notebook for how to finetune LayoutLMv2ForTokenClassification with the 🌍 Trainer.
Question Answering
LayoutLMv2ForQuestionAnswering is supported by this notebook.
Document question answering
( vocab_size = 50265hidden_size = 768num_hidden_layers = 12num_attention_heads = 12intermediate_size = 3072hidden_act = 'gelu'hidden_dropout_prob = 0.1attention_probs_dropout_prob = 0.1max_position_embeddings = 512type_vocab_size = 2initializer_range = 0.02layer_norm_eps = 1e-05pad_token_id = 1bos_token_id = 0eos_token_id = 2max_2d_position_embeddings = 1024coordinate_size = 128shape_size = 128has_relative_attention_bias = Truerel_pos_bins = 32max_rel_pos = 128rel_2d_pos_bins = 64max_rel_2d_pos = 256has_spatial_attention_bias = Truetext_embed = Truevisual_embed = Trueinput_size = 224num_channels = 3patch_size = 16classifier_dropout = None**kwargs )
Parameters
vocab_size (int
, optional, defaults to 50265) — Vocabulary size of the LayoutLMv3 model. Defines the number of different tokens that can be represented by the inputs_ids
passed when calling LayoutLMv3Model.
hidden_size (int
, optional, defaults to 768) — Dimension of the encoder layers and the pooler layer.
num_hidden_layers (int
, optional, defaults to 12) — Number of hidden layers in the Transformer encoder.
num_attention_heads (int
, optional, defaults to 12) — Number of attention heads for each attention layer in the Transformer encoder.
intermediate_size (int
, optional, defaults to 3072) — Dimension of the “intermediate” (i.e., feed-forward) layer in the Transformer encoder.
hidden_act (str
or function
, optional, defaults to "gelu"
) — The non-linear activation function (function or string) in the encoder and pooler. If string, "gelu"
, "relu"
, "selu"
and "gelu_new"
are supported.
hidden_dropout_prob (float
, optional, defaults to 0.1) — The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
attention_probs_dropout_prob (float
, optional, defaults to 0.1) — The dropout ratio for the attention probabilities.
max_position_embeddings (int
, optional, defaults to 512) — The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
type_vocab_size (int
, optional, defaults to 2) — The vocabulary size of the token_type_ids
passed when calling LayoutLMv3Model.
initializer_range (float
, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (float
, optional, defaults to 1e-5) — The epsilon used by the layer normalization layers.
max_2d_position_embeddings (int
, optional, defaults to 1024) — The maximum value that the 2D position embedding might ever be used with. Typically set this to something large just in case (e.g., 1024).
coordinate_size (int
, optional, defaults to 128
) — Dimension of the coordinate embeddings.
shape_size (int
, optional, defaults to 128
) — Dimension of the width and height embeddings.
has_relative_attention_bias (bool
, optional, defaults to True
) — Whether or not to use a relative attention bias in the self-attention mechanism.
rel_pos_bins (int
, optional, defaults to 32) — The number of relative position bins to be used in the self-attention mechanism.
max_rel_pos (int
, optional, defaults to 128) — The maximum number of relative positions to be used in the self-attention mechanism.
max_rel_2d_pos (int
, optional, defaults to 256) — The maximum number of relative 2D positions in the self-attention mechanism.
rel_2d_pos_bins (int
, optional, defaults to 64) — The number of 2D relative position bins in the self-attention mechanism.
has_spatial_attention_bias (bool
, optional, defaults to True
) — Whether or not to use a spatial attention bias in the self-attention mechanism.
visual_embed (bool
, optional, defaults to True
) — Whether or not to add patch embeddings.
input_size (int
, optional, defaults to 224
) — The size (resolution) of the images.
num_channels (int
, optional, defaults to 3
) — The number of channels of the images.
patch_size (int
, optional, defaults to 16
) — The size (resolution) of the patches.
classifier_dropout (float
, optional) — The dropout ratio for the classification head.
This is the configuration class to store the configuration of a LayoutLMv3Model. It is used to instantiate an LayoutLMv3 model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the LayoutLMv3 microsoft/layoutlmv3-base architecture.
Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.
Example:
Copied
( *args**kwargs )
__call__
( images**kwargs )
Preprocess an image or a batch of images.
( do_resize: bool = Truesize: typing.Dict[str, int] = Noneresample: Resampling = <Resampling.BILINEAR: 2>do_rescale: bool = Truerescale_value: float = 0.00392156862745098do_normalize: bool = Trueimage_mean: typing.Union[float, typing.Iterable[float]] = Noneimage_std: typing.Union[float, typing.Iterable[float]] = Noneapply_ocr: bool = Trueocr_lang: typing.Optional[str] = Nonetesseract_config: typing.Optional[str] = ''**kwargs )
Parameters
do_resize (bool
, optional, defaults to True
) — Whether to resize the image’s (height, width) dimensions to (size["height"], size["width"])
. Can be overridden by do_resize
in preprocess
.
size (Dict[str, int]
optional, defaults to {"height" -- 224, "width": 224}
): Size of the image after resizing. Can be overridden by size
in preprocess
.
resample (PILImageResampling
, optional, defaults to PILImageResampling.BILINEAR
) — Resampling filter to use if resizing the image. Can be overridden by resample
in preprocess
.
do_rescale (bool
, optional, defaults to True
) — Whether to rescale the image’s pixel values by the specified rescale_value
. Can be overridden by do_rescale
in preprocess
.
rescale_factor (float
, optional, defaults to 1 / 255) — Value by which the image’s pixel values are rescaled. Can be overridden by rescale_factor
in preprocess
.
do_normalize (bool
, optional, defaults to True
) — Whether to normalize the image. Can be overridden by the do_normalize
parameter in the preprocess
method.
image_mean (Iterable[float]
or float
, optional, defaults to IMAGENET_STANDARD_MEAN
) — Mean to use if normalizing the image. This is a float or list of floats the length of the number of channels in the image. Can be overridden by the image_mean
parameter in the preprocess
method.
image_std (Iterable[float]
or float
, optional, defaults to IMAGENET_STANDARD_STD
) — Standard deviation to use if normalizing the image. This is a float or list of floats the length of the number of channels in the image. Can be overridden by the image_std
parameter in the preprocess
method.
apply_ocr (bool
, optional, defaults to True
) — Whether to apply the Tesseract OCR engine to get words + normalized bounding boxes. Can be overridden by the apply_ocr
parameter in the preprocess
method.
ocr_lang (str
, optional) — The language, specified by its ISO code, to be used by the Tesseract OCR engine. By default, English is used. Can be overridden by the ocr_lang
parameter in the preprocess
method.
tesseract_config (str
, optional) — Any additional custom configuration flags that are forwarded to the config
parameter when calling Tesseract. For example: ‘—psm 6’. Can be overridden by the tesseract_config
parameter in the preprocess
method.
Constructs a LayoutLMv3 image processor.
preprocess
( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), typing.List[ForwardRef('PIL.Image.Image')], typing.List[numpy.ndarray], typing.List[ForwardRef('torch.Tensor')]]do_resize: bool = Nonesize: typing.Dict[str, int] = Noneresample = Nonedo_rescale: bool = Nonerescale_factor: float = Nonedo_normalize: bool = Noneimage_mean: typing.Union[float, typing.Iterable[float]] = Noneimage_std: typing.Union[float, typing.Iterable[float]] = Noneapply_ocr: bool = Noneocr_lang: typing.Optional[str] = Nonetesseract_config: typing.Optional[str] = Nonereturn_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = Nonedata_format: ChannelDimension = <ChannelDimension.FIRST: 'channels_first'>input_data_format: typing.Union[transformers.image_utils.ChannelDimension, str, NoneType] = None**kwargs )
Parameters
images (ImageInput
) — Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If passing in images with pixel values between 0 and 1, set do_rescale=False
.
do_resize (bool
, optional, defaults to self.do_resize
) — Whether to resize the image.
size (Dict[str, int]
, optional, defaults to self.size
) — Desired size of the output image after applying resize
.
resample (int
, optional, defaults to self.resample
) — Resampling filter to use if resizing the image. This can be one of the PILImageResampling
filters. Only has an effect if do_resize
is set to True
.
do_rescale (bool
, optional, defaults to self.do_rescale
) — Whether to rescale the image pixel values between [0, 1].
rescale_factor (float
, optional, defaults to self.rescale_factor
) — Rescale factor to apply to the image pixel values. Only has an effect if do_rescale
is set to True
.
do_normalize (bool
, optional, defaults to self.do_normalize
) — Whether to normalize the image.
image_mean (float
or Iterable[float]
, optional, defaults to self.image_mean
) — Mean values to be used for normalization. Only has an effect if do_normalize
is set to True
.
image_std (float
or Iterable[float]
, optional, defaults to self.image_std
) — Standard deviation values to be used for normalization. Only has an effect if do_normalize
is set to True
.
apply_ocr (bool
, optional, defaults to self.apply_ocr
) — Whether to apply the Tesseract OCR engine to get words + normalized bounding boxes.
ocr_lang (str
, optional, defaults to self.ocr_lang
) — The language, specified by its ISO code, to be used by the Tesseract OCR engine. By default, English is used.
tesseract_config (str
, optional, defaults to self.tesseract_config
) — Any additional custom configuration flags that are forwarded to the config
parameter when calling Tesseract.
return_tensors (str
or TensorType
, optional) — The type of tensors to return. Can be one of:
Unset: Return a list of np.ndarray
.
TensorType.TENSORFLOW
or 'tf'
: Return a batch of type tf.Tensor
.
TensorType.PYTORCH
or 'pt'
: Return a batch of type torch.Tensor
.
TensorType.NUMPY
or 'np'
: Return a batch of type np.ndarray
.
TensorType.JAX
or 'jax'
: Return a batch of type jax.numpy.ndarray
.
data_format (ChannelDimension
or str
, optional, defaults to ChannelDimension.FIRST
) — The channel dimension format for the output image. Can be one of:
ChannelDimension.FIRST
: image in (num_channels, height, width) format.
ChannelDimension.LAST
: image in (height, width, num_channels) format.
input_data_format (ChannelDimension
or str
, optional) — The channel dimension format for the input image. If unset, the channel dimension format is inferred from the input image. Can be one of:
"channels_first"
or ChannelDimension.FIRST
: image in (num_channels, height, width) format.
"channels_last"
or ChannelDimension.LAST
: image in (height, width, num_channels) format.
"none"
or ChannelDimension.NONE
: image in (height, width) format.
Preprocess an image or batch of images.
( vocab_filemerges_fileerrors = 'replace'bos_token = '<s>'eos_token = '</s>'sep_token = '</s>'cls_token = '<s>'unk_token = '<unk>'pad_token = '<pad>'mask_token = '<mask>'add_prefix_space = Truecls_token_box = [0, 0, 0, 0]sep_token_box = [0, 0, 0, 0]pad_token_box = [0, 0, 0, 0]pad_token_label = -100only_label_first_subword = True**kwargs )
Parameters
vocab_file (str
) — Path to the vocabulary file.
merges_file (str
) — Path to the merges file.
errors (str
, optional, defaults to "replace"
) — Paradigm to follow when decoding bytes to UTF-8. See bytes.decode for more information.
bos_token (str
, optional, defaults to "<s>"
) — The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
When building a sequence using special tokens, this is not the token that is used for the beginning of sequence. The token used is the cls_token
.
eos_token (str
, optional, defaults to "</s>"
) — The end of sequence token.
When building a sequence using special tokens, this is not the token that is used for the end of sequence. The token used is the sep_token
.
sep_token (str
, optional, defaults to "</s>"
) — The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for sequence classification or for a text and a question for question answering. It is also used as the last token of a sequence built with special tokens.
cls_token (str
, optional, defaults to "<s>"
) — The classifier token which is used when doing sequence classification (classification of the whole sequence instead of per-token classification). It is the first token of the sequence when built with special tokens.
unk_token (str
, optional, defaults to "<unk>"
) — The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead.
pad_token (str
, optional, defaults to "<pad>"
) — The token used for padding, for example when batching sequences of different lengths.
mask_token (str
, optional, defaults to "<mask>"
) — The token used for masking values. This is the token used when training this model with masked language modeling. This is the token which the model will try to predict.
add_prefix_space (bool
, optional, defaults to False
) — Whether or not to add an initial space to the input. This allows to treat the leading word just as any other word. (RoBERTa tokenizer detect beginning of words by the preceding space).
cls_token_box (List[int]
, optional, defaults to [0, 0, 0, 0]
) — The bounding box to use for the special [CLS] token.
sep_token_box (List[int]
, optional, defaults to [0, 0, 0, 0]
) — The bounding box to use for the special [SEP] token.
pad_token_box (List[int]
, optional, defaults to [0, 0, 0, 0]
) — The bounding box to use for the special [PAD] token.
pad_token_label (int
, optional, defaults to -100) — The label to use for padding tokens. Defaults to -100, which is the ignore_index
of PyTorch’s CrossEntropyLoss.
only_label_first_subword (bool
, optional, defaults to True
) — Whether or not to only label the first subword, in case word labels are provided.
Construct a LayoutLMv3 tokenizer. Based on RoBERTatokenizer
(Byte Pair Encoding or BPE). LayoutLMv3Tokenizer can be used to turn words, word-level bounding boxes and optional word labels to token-level input_ids
, attention_mask
, token_type_ids
, bbox
, and optional labels
(for token classification).
This tokenizer inherits from PreTrainedTokenizer which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.
LayoutLMv3Tokenizer runs end-to-end tokenization: punctuation splitting and wordpiece. It also turns the word-level bounding boxes into token-level bounding boxes.
__call__
( text: typing.Union[str, typing.List[str], typing.List[typing.List[str]]]text_pair: typing.Union[typing.List[str], typing.List[typing.List[str]], NoneType] = Noneboxes: typing.Union[typing.List[typing.List[int]], typing.List[typing.List[typing.List[int]]]] = Noneword_labels: typing.Union[typing.List[int], typing.List[typing.List[int]], NoneType] = Noneadd_special_tokens: bool = Truepadding: typing.Union[bool, str, transformers.utils.generic.PaddingStrategy] = Falsetruncation: typing.Union[bool, str, transformers.tokenization_utils_base.TruncationStrategy] = Nonemax_length: typing.Optional[int] = Nonestride: int = 0pad_to_multiple_of: typing.Optional[int] = Nonereturn_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = Nonereturn_token_type_ids: typing.Optional[bool] = Nonereturn_attention_mask: typing.Optional[bool] = Nonereturn_overflowing_tokens: bool = Falsereturn_special_tokens_mask: bool = Falsereturn_offsets_mapping: bool = Falsereturn_length: bool = Falseverbose: bool = True**kwargs )
Parameters
text (str
, List[str]
, List[List[str]]
) — The sequence or batch of sequences to be encoded. Each sequence can be a string, a list of strings (words of a single example or questions of a batch of examples) or a list of list of strings (batch of words).
text_pair (List[str]
, List[List[str]]
) — The sequence or batch of sequences to be encoded. Each sequence should be a list of strings (pretokenized string).
boxes (List[List[int]]
, List[List[List[int]]]
) — Word-level bounding boxes. Each bounding box should be normalized to be on a 0-1000 scale.
word_labels (List[int]
, List[List[int]]
, optional) — Word-level integer labels (for token classification tasks such as FUNSD, CORD).
add_special_tokens (bool
, optional, defaults to True
) — Whether or not to encode the sequences with the special tokens relative to their model.
padding (bool
, str
or PaddingStrategy, optional, defaults to False
) — Activates and controls padding. Accepts the following values:
True
or 'longest'
: Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).
'max_length'
: Pad to a maximum length specified with the argument max_length
or to the maximum acceptable input length for the model if that argument is not provided.
False
or 'do_not_pad'
(default): No padding (i.e., can output a batch with sequences of different lengths).
truncation (bool
, str
or TruncationStrategy, optional, defaults to False
) — Activates and controls truncation. Accepts the following values:
True
or 'longest_first'
: Truncate to a maximum length specified with the argument max_length
or to the maximum acceptable input length for the model if that argument is not provided. This will truncate token by token, removing a token from the longest sequence in the pair if a pair of sequences (or a batch of pairs) is provided.
'only_first'
: Truncate to a maximum length specified with the argument max_length
or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
'only_second'
: Truncate to a maximum length specified with the argument max_length
or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
False
or 'do_not_truncate'
(default): No truncation (i.e., can output batch with sequence lengths greater than the model maximum admissible input size).
max_length (int
, optional) — Controls the maximum length to use by one of the truncation/padding parameters.
If left unset or set to None
, this will use the predefined model maximum length if a maximum length is required by one of the truncation/padding parameters. If the model has no specific maximum input length (like XLNet) truncation/padding to a maximum length will be deactivated.
stride (int
, optional, defaults to 0) — If set to a number along with max_length
, the overflowing tokens returned when return_overflowing_tokens=True
will contain some tokens from the end of the truncated sequence returned to provide some overlap between truncated and overflowing sequences. The value of this argument defines the number of overlapping tokens.
pad_to_multiple_of (int
, optional) — If set will pad the sequence to a multiple of the provided value. This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5
(Volta).
return_tensors (str
or TensorType, optional) — If set, will return tensors instead of list of python integers. Acceptable values are:
'tf'
: Return TensorFlow tf.constant
objects.
'pt'
: Return PyTorch torch.Tensor
objects.
'np'
: Return Numpy np.ndarray
objects.
add_special_tokens (bool
, optional, defaults to True
) — Whether or not to encode the sequences with the special tokens relative to their model.
padding (bool
, str
or PaddingStrategy, optional, defaults to False
) — Activates and controls padding. Accepts the following values:
True
or 'longest'
: Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).
'max_length'
: Pad to a maximum length specified with the argument max_length
or to the maximum acceptable input length for the model if that argument is not provided.
False
or 'do_not_pad'
(default): No padding (i.e., can output a batch with sequences of different lengths).
truncation (bool
, str
or TruncationStrategy, optional, defaults to False
) — Activates and controls truncation. Accepts the following values:
True
or 'longest_first'
: Truncate to a maximum length specified with the argument max_length
or to the maximum acceptable input length for the model if that argument is not provided. This will truncate token by token, removing a token from the longest sequence in the pair if a pair of sequences (or a batch of pairs) is provided.
'only_first'
: Truncate to a maximum length specified with the argument max_length
or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
'only_second'
: Truncate to a maximum length specified with the argument max_length
or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
False
or 'do_not_truncate'
(default): No truncation (i.e., can output batch with sequence lengths greater than the model maximum admissible input size).
max_length (int
, optional) — Controls the maximum length to use by one of the truncation/padding parameters. If left unset or set to None
, this will use the predefined model maximum length if a maximum length is required by one of the truncation/padding parameters. If the model has no specific maximum input length (like XLNet) truncation/padding to a maximum length will be deactivated.
stride (int
, optional, defaults to 0) — If set to a number along with max_length
, the overflowing tokens returned when return_overflowing_tokens=True
will contain some tokens from the end of the truncated sequence returned to provide some overlap between truncated and overflowing sequences. The value of this argument defines the number of overlapping tokens.
pad_to_multiple_of (int
, optional) — If set will pad the sequence to a multiple of the provided value. This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5
(Volta).
return_tensors (str
or TensorType, optional) — If set, will return tensors instead of list of python integers. Acceptable values are:
'tf'
: Return TensorFlow tf.constant
objects.
'pt'
: Return PyTorch torch.Tensor
objects.
'np'
: Return Numpy np.ndarray
objects.
Main method to tokenize and prepare for the model one or several sequence(s) or one or several pair(s) of sequences with word-level normalized bounding boxes and optional labels.
save_vocabulary
( save_directory: strfilename_prefix: typing.Optional[str] = None )
( vocab_file = Nonemerges_file = Nonetokenizer_file = Noneerrors = 'replace'bos_token = '<s>'eos_token = '</s>'sep_token = '</s>'cls_token = '<s>'unk_token = '<unk>'pad_token = '<pad>'mask_token = '<mask>'add_prefix_space = Truetrim_offsets = Truecls_token_box = [0, 0, 0, 0]sep_token_box = [0, 0, 0, 0]pad_token_box = [0, 0, 0, 0]pad_token_label = -100only_label_first_subword = True**kwargs )
Parameters
vocab_file (str
) — Path to the vocabulary file.
merges_file (str
) — Path to the merges file.
errors (str
, optional, defaults to "replace"
) — Paradigm to follow when decoding bytes to UTF-8. See bytes.decode for more information.
bos_token (str
, optional, defaults to "<s>"
) — The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
When building a sequence using special tokens, this is not the token that is used for the beginning of sequence. The token used is the cls_token
.
eos_token (str
, optional, defaults to "</s>"
) — The end of sequence token.
When building a sequence using special tokens, this is not the token that is used for the end of sequence. The token used is the sep_token
.
sep_token (str
, optional, defaults to "</s>"
) — The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for sequence classification or for a text and a question for question answering. It is also used as the last token of a sequence built with special tokens.
cls_token (str
, optional, defaults to "<s>"
) — The classifier token which is used when doing sequence classification (classification of the whole sequence instead of per-token classification). It is the first token of the sequence when built with special tokens.
unk_token (str
, optional, defaults to "<unk>"
) — The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead.
pad_token (str
, optional, defaults to "<pad>"
) — The token used for padding, for example when batching sequences of different lengths.
mask_token (str
, optional, defaults to "<mask>"
) — The token used for masking values. This is the token used when training this model with masked language modeling. This is the token which the model will try to predict.
add_prefix_space (bool
, optional, defaults to False
) — Whether or not to add an initial space to the input. This allows to treat the leading word just as any other word. (RoBERTa tokenizer detect beginning of words by the preceding space).
trim_offsets (bool
, optional, defaults to True
) — Whether the post processing step should trim offsets to avoid including whitespaces.
cls_token_box (List[int]
, optional, defaults to [0, 0, 0, 0]
) — The bounding box to use for the special [CLS] token.
sep_token_box (List[int]
, optional, defaults to [0, 0, 0, 0]
) — The bounding box to use for the special [SEP] token.
pad_token_box (List[int]
, optional, defaults to [0, 0, 0, 0]
) — The bounding box to use for the special [PAD] token.
pad_token_label (int
, optional, defaults to -100) — The label to use for padding tokens. Defaults to -100, which is the ignore_index
of PyTorch’s CrossEntropyLoss.
only_label_first_subword (bool
, optional, defaults to True
) — Whether or not to only label the first subword, in case word labels are provided.
Construct a “fast” LayoutLMv3 tokenizer (backed by BOINC AI’s tokenizers library). Based on BPE.
This tokenizer inherits from PreTrainedTokenizerFast which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.
__call__
( text: typing.Union[str, typing.List[str], typing.List[typing.List[str]]]text_pair: typing.Union[typing.List[str], typing.List[typing.List[str]], NoneType] = Noneboxes: typing.Union[typing.List[typing.List[int]], typing.List[typing.List[typing.List[int]]]] = Noneword_labels: typing.Union[typing.List[int], typing.List[typing.List[int]], NoneType] = Noneadd_special_tokens: bool = Truepadding: typing.Union[bool, str, transformers.utils.generic.PaddingStrategy] = Falsetruncation: typing.Union[bool, str, transformers.tokenization_utils_base.TruncationStrategy] = Nonemax_length: typing.Optional[int] = Nonestride: int = 0pad_to_multiple_of: typing.Optional[int] = Nonereturn_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = Nonereturn_token_type_ids: typing.Optional[bool] = Nonereturn_attention_mask: typing.Optional[bool] = Nonereturn_overflowing_tokens: bool = Falsereturn_special_tokens_mask: bool = Falsereturn_offsets_mapping: bool = Falsereturn_length: bool = Falseverbose: bool = True**kwargs )
Parameters
text (str
, List[str]
, List[List[str]]
) — The sequence or batch of sequences to be encoded. Each sequence can be a string, a list of strings (words of a single example or questions of a batch of examples) or a list of list of strings (batch of words).
text_pair (List[str]
, List[List[str]]
) — The sequence or batch of sequences to be encoded. Each sequence should be a list of strings (pretokenized string).
boxes (List[List[int]]
, List[List[List[int]]]
) — Word-level bounding boxes. Each bounding box should be normalized to be on a 0-1000 scale.
word_labels (List[int]
, List[List[int]]
, optional) — Word-level integer labels (for token classification tasks such as FUNSD, CORD).
add_special_tokens (bool
, optional, defaults to True
) — Whether or not to encode the sequences with the special tokens relative to their model.
padding (bool
, str
or PaddingStrategy, optional, defaults to False
) — Activates and controls padding. Accepts the following values:
True
or 'longest'
: Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).
'max_length'
: Pad to a maximum length specified with the argument max_length
or to the maximum acceptable input length for the model if that argument is not provided.
False
or 'do_not_pad'
(default): No padding (i.e., can output a batch with sequences of different lengths).
truncation (bool
, str
or TruncationStrategy, optional, defaults to False
) — Activates and controls truncation. Accepts the following values:
True
or 'longest_first'
: Truncate to a maximum length specified with the argument max_length
or to the maximum acceptable input length for the model if that argument is not provided. This will truncate token by token, removing a token from the longest sequence in the pair if a pair of sequences (or a batch of pairs) is provided.
'only_first'
: Truncate to a maximum length specified with the argument max_length
or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
'only_second'
: Truncate to a maximum length specified with the argument max_length
or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
False
or 'do_not_truncate'
(default): No truncation (i.e., can output batch with sequence lengths greater than the model maximum admissible input size).
max_length (int
, optional) — Controls the maximum length to use by one of the truncation/padding parameters.
If left unset or set to None
, this will use the predefined model maximum length if a maximum length is required by one of the truncation/padding parameters. If the model has no specific maximum input length (like XLNet) truncation/padding to a maximum length will be deactivated.
stride (int
, optional, defaults to 0) — If set to a number along with max_length
, the overflowing tokens returned when return_overflowing_tokens=True
will contain some tokens from the end of the truncated sequence returned to provide some overlap between truncated and overflowing sequences. The value of this argument defines the number of overlapping tokens.
pad_to_multiple_of (int
, optional) — If set will pad the sequence to a multiple of the provided value. This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5
(Volta).
return_tensors (str
or TensorType, optional) — If set, will return tensors instead of list of python integers. Acceptable values are:
'tf'
: Return TensorFlow tf.constant
objects.
'pt'
: Return PyTorch torch.Tensor
objects.
'np'
: Return Numpy np.ndarray
objects.
add_special_tokens (bool
, optional, defaults to True
) — Whether or not to encode the sequences with the special tokens relative to their model.
padding (bool
, str
or PaddingStrategy, optional, defaults to False
) — Activates and controls padding. Accepts the following values:
True
or 'longest'
: Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).
'max_length'
: Pad to a maximum length specified with the argument max_length
or to the maximum acceptable input length for the model if that argument is not provided.
False
or 'do_not_pad'
(default): No padding (i.e., can output a batch with sequences of different lengths).
truncation (bool
, str
or TruncationStrategy, optional, defaults to False
) — Activates and controls truncation. Accepts the following values:
True
or 'longest_first'
: Truncate to a maximum length specified with the argument max_length
or to the maximum acceptable input length for the model if that argument is not provided. This will truncate token by token, removing a token from the longest sequence in the pair if a pair of sequences (or a batch of pairs) is provided.
'only_first'
: Truncate to a maximum length specified with the argument max_length
or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
'only_second'
: Truncate to a maximum length specified with the argument max_length
or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
False
or 'do_not_truncate'
(default): No truncation (i.e., can output batch with sequence lengths greater than the model maximum admissible input size).
max_length (int
, optional) — Controls the maximum length to use by one of the truncation/padding parameters. If left unset or set to None
, this will use the predefined model maximum length if a maximum length is required by one of the truncation/padding parameters. If the model has no specific maximum input length (like XLNet) truncation/padding to a maximum length will be deactivated.
stride (int
, optional, defaults to 0) — If set to a number along with max_length
, the overflowing tokens returned when return_overflowing_tokens=True
will contain some tokens from the end of the truncated sequence returned to provide some overlap between truncated and overflowing sequences. The value of this argument defines the number of overlapping tokens.
pad_to_multiple_of (int
, optional) — If set will pad the sequence to a multiple of the provided value. This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5
(Volta).
return_tensors (str
or TensorType, optional) — If set, will return tensors instead of list of python integers. Acceptable values are:
'tf'
: Return TensorFlow tf.constant
objects.
'pt'
: Return PyTorch torch.Tensor
objects.
'np'
: Return Numpy np.ndarray
objects.
Main method to tokenize and prepare for the model one or several sequence(s) or one or several pair(s) of sequences with word-level normalized bounding boxes and optional labels.
( image_processor = Nonetokenizer = None**kwargs )
Parameters
image_processor (LayoutLMv3ImageProcessor
) — An instance of LayoutLMv3ImageProcessor. The image processor is a required input.
tokenizer (LayoutLMv3Tokenizer
or LayoutLMv3TokenizerFast
) — An instance of LayoutLMv3Tokenizer or LayoutLMv3TokenizerFast. The tokenizer is a required input.
Constructs a LayoutLMv3 processor which combines a LayoutLMv3 image processor and a LayoutLMv3 tokenizer into a single processor.
LayoutLMv3Processor offers all the functionalities you need to prepare data for the model.
It first uses LayoutLMv3ImageProcessor to resize and normalize document images, and optionally applies OCR to get words and normalized bounding boxes. These are then provided to LayoutLMv3Tokenizer or LayoutLMv3TokenizerFast, which turns the words and bounding boxes into token-level input_ids
, attention_mask
, token_type_ids
, bbox
. Optionally, one can provide integer word_labels
, which are turned into token-level labels
for token classification tasks (such as FUNSD, CORD).
__call__
( imagestext: typing.Union[str, typing.List[str], typing.List[typing.List[str]]] = Nonetext_pair: typing.Union[typing.List[str], typing.List[typing.List[str]], NoneType] = Noneboxes: typing.Union[typing.List[typing.List[int]], typing.List[typing.List[typing.List[int]]]] = Noneword_labels: typing.Union[typing.List[int], typing.List[typing.List[int]], NoneType] = Noneadd_special_tokens: bool = Truepadding: typing.Union[bool, str, transformers.utils.generic.PaddingStrategy] = Falsetruncation: typing.Union[bool, str, transformers.tokenization_utils_base.TruncationStrategy] = Nonemax_length: typing.Optional[int] = Nonestride: int = 0pad_to_multiple_of: typing.Optional[int] = Nonereturn_token_type_ids: typing.Optional[bool] = Nonereturn_attention_mask: typing.Optional[bool] = Nonereturn_overflowing_tokens: bool = Falsereturn_special_tokens_mask: bool = Falsereturn_offsets_mapping: bool = Falsereturn_length: bool = Falseverbose: bool = Truereturn_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None**kwargs )
This method first forwards the images
argument to call(). In case LayoutLMv3ImageProcessor was initialized with apply_ocr
set to True
, it passes the obtained words and bounding boxes along with the additional arguments to call() and returns the output, together with resized and normalized pixel_values
. In case LayoutLMv3ImageProcessor was initialized with apply_ocr
set to False
, it passes the words (text
/`text_pair
) and boxes
specified by the user along with the additional arguments to call() and returns the output, together with resized and normalized pixel_values
.
Please refer to the docstring of the above two methods for more information.
( config )
Parameters
config (LayoutLMv3Config) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The bare LayoutLMv3 Model transformer outputting raw hidden-states without any specific head on top. This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
( input_ids: typing.Optional[torch.LongTensor] = Nonebbox: typing.Optional[torch.LongTensor] = Noneattention_mask: typing.Optional[torch.FloatTensor] = Nonetoken_type_ids: typing.Optional[torch.LongTensor] = Noneposition_ids: typing.Optional[torch.LongTensor] = Nonehead_mask: typing.Optional[torch.FloatTensor] = Noneinputs_embeds: typing.Optional[torch.FloatTensor] = Nonepixel_values: typing.Optional[torch.FloatTensor] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = None ) → transformers.modeling_outputs.BaseModelOutput or tuple(torch.FloatTensor)
Parameters
input_ids (torch.LongTensor
of shape (batch_size, token_sequence_length)
) — Indices of input sequence tokens in the vocabulary.
Note that sequence_length = token_sequence_length + patch_sequence_length + 1
where 1
is for [CLS] token. See pixel_values
for patch_sequence_length
.
Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
bbox (torch.LongTensor
of shape (batch_size, token_sequence_length, 4)
, optional) — Bounding boxes of each input sequence tokens. Selected in the range [0, config.max_2d_position_embeddings-1]
. Each bounding box should be a normalized version in (x0, y0, x1, y1) format, where (x0, y0) corresponds to the position of the upper left corner in the bounding box, and (x1, y1) represents the position of the lower right corner.
Note that sequence_length = token_sequence_length + patch_sequence_length + 1
where 1
is for [CLS] token. See pixel_values
for patch_sequence_length
.
pixel_values (torch.FloatTensor
of shape (batch_size, num_channels, height, width)
) — Batch of document images. Each image is divided into patches of shape (num_channels, config.patch_size, config.patch_size)
and the total number of patches (=patch_sequence_length
) equals to ((height / config.patch_size) * (width / config.patch_size))
.
attention_mask (torch.FloatTensor
of shape (batch_size, token_sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
:
1 for tokens that are not masked,
0 for tokens that are masked.
Note that sequence_length = token_sequence_length + patch_sequence_length + 1
where 1
is for [CLS] token. See pixel_values
for patch_sequence_length
.
token_type_ids (torch.LongTensor
of shape (batch_size, token_sequence_length)
, optional) — Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]
:
0 corresponds to a sentence A token,
1 corresponds to a sentence B token.
Note that sequence_length = token_sequence_length + patch_sequence_length + 1
where 1
is for [CLS] token. See pixel_values
for patch_sequence_length
.
position_ids (torch.LongTensor
of shape (batch_size, token_sequence_length)
, optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.max_position_embeddings - 1]
.
Note that sequence_length = token_sequence_length + patch_sequence_length + 1
where 1
is for [CLS] token. See pixel_values
for patch_sequence_length
.
head_mask (torch.FloatTensor
of shape (num_heads,)
or (num_layers, num_heads)
, optional) — Mask to nullify selected heads of the self-attention modules. Mask values selected in [0, 1]
:
1 indicates the head is not masked,
0 indicates the head is masked.
inputs_embeds (torch.FloatTensor
of shape (batch_size, token_sequence_length, hidden_size)
, optional) — Optionally, instead of passing input_ids
you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix.
output_attentions (bool
, optional) — Whether or not to return the attentions tensors of all attention layers. See attentions
under returned tensors for more detail.
output_hidden_states (bool
, optional) — Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for more detail.
return_dict (bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
Returns
transformers.modeling_outputs.BaseModelOutput or tuple(torch.FloatTensor)
A transformers.modeling_outputs.BaseModelOutput or a tuple of torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various elements depending on the configuration (LayoutLMv3Config) and inputs.
last_hidden_state (torch.FloatTensor
of shape (batch_size, sequence_length, hidden_size)
) — Sequence of hidden-states at the output of the last layer of the model.
hidden_states (tuple(torch.FloatTensor)
, optional, returned when output_hidden_states=True
is passed or when config.output_hidden_states=True
) — Tuple of torch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size)
.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (tuple(torch.FloatTensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) — Tuple of torch.FloatTensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The LayoutLMv3Model forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
Examples:
Copied
( config )
Parameters
config (LayoutLMv3Config) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
LayoutLMv3 Model with a sequence classification head on top (a linear layer on top of the final hidden state of the [CLS] token) e.g. for document image classification tasks such as the RVL-CDIP dataset.
This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
( input_ids: typing.Optional[torch.LongTensor] = Noneattention_mask: typing.Optional[torch.FloatTensor] = Nonetoken_type_ids: typing.Optional[torch.LongTensor] = Noneposition_ids: typing.Optional[torch.LongTensor] = Nonehead_mask: typing.Optional[torch.FloatTensor] = Noneinputs_embeds: typing.Optional[torch.FloatTensor] = Nonelabels: typing.Optional[torch.LongTensor] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = Nonebbox: typing.Optional[torch.LongTensor] = Nonepixel_values: typing.Optional[torch.LongTensor] = None ) → transformers.modeling_outputs.SequenceClassifierOutput or tuple(torch.FloatTensor)
Parameters
input_ids (torch.LongTensor
of shape (batch_size, sequence_length)
) — Indices of input sequence tokens in the vocabulary.
Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
bbox (torch.LongTensor
of shape (batch_size, sequence_length, 4)
, optional) — Bounding boxes of each input sequence tokens. Selected in the range [0, config.max_2d_position_embeddings-1]
. Each bounding box should be a normalized version in (x0, y0, x1, y1) format, where (x0, y0) corresponds to the position of the upper left corner in the bounding box, and (x1, y1) represents the position of the lower right corner.
pixel_values (torch.FloatTensor
of shape (batch_size, num_channels, height, width)
) — Batch of document images. Each image is divided into patches of shape (num_channels, config.patch_size, config.patch_size)
and the total number of patches (=patch_sequence_length
) equals to ((height / config.patch_size) * (width / config.patch_size))
.
attention_mask (torch.FloatTensor
of shape (batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
:
1 for tokens that are not masked,
0 for tokens that are masked.
token_type_ids (torch.LongTensor
of shape (batch_size, sequence_length)
, optional) — Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]
:
0 corresponds to a sentence A token,
1 corresponds to a sentence B token.
position_ids (torch.LongTensor
of shape (batch_size, sequence_length)
, optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.max_position_embeddings - 1]
.
head_mask (torch.FloatTensor
of shape (num_heads,)
or (num_layers, num_heads)
, optional) — Mask to nullify selected heads of the self-attention modules. Mask values selected in [0, 1]
:
1 indicates the head is not masked,
0 indicates the head is masked.
inputs_embeds (torch.FloatTensor
of shape (batch_size, sequence_length, hidden_size)
, optional) — Optionally, instead of passing input_ids
you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix.
output_attentions (bool
, optional) — Whether or not to return the attentions tensors of all attention layers. See attentions
under returned tensors for more detail.
output_hidden_states (bool
, optional) — Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for more detail.
return_dict (bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
Returns
transformers.modeling_outputs.SequenceClassifierOutput or tuple(torch.FloatTensor)
A transformers.modeling_outputs.SequenceClassifierOutput or a tuple of torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various elements depending on the configuration (LayoutLMv3Config) and inputs.
loss (torch.FloatTensor
of shape (1,)
, optional, returned when labels
is provided) — Classification (or regression if config.num_labels==1) loss.
logits (torch.FloatTensor
of shape (batch_size, config.num_labels)
) — Classification (or regression if config.num_labels==1) scores (before SoftMax).
hidden_states (tuple(torch.FloatTensor)
, optional, returned when output_hidden_states=True
is passed or when config.output_hidden_states=True
) — Tuple of torch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size)
.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (tuple(torch.FloatTensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) — Tuple of torch.FloatTensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The LayoutLMv3ForSequenceClassification forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
Examples:
Copied
( config )
Parameters
config (LayoutLMv3Config) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
LayoutLMv3 Model with a token classification head on top (a linear layer on top of the final hidden states) e.g. for sequence labeling (information extraction) tasks such as FUNSD, SROIE, CORD and Kleister-NDA.
This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
( input_ids: typing.Optional[torch.LongTensor] = Nonebbox: typing.Optional[torch.LongTensor] = Noneattention_mask: typing.Optional[torch.FloatTensor] = Nonetoken_type_ids: typing.Optional[torch.LongTensor] = Noneposition_ids: typing.Optional[torch.LongTensor] = Nonehead_mask: typing.Optional[torch.FloatTensor] = Noneinputs_embeds: typing.Optional[torch.FloatTensor] = Nonelabels: typing.Optional[torch.LongTensor] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = Nonepixel_values: typing.Optional[torch.LongTensor] = None ) → transformers.modeling_outputs.TokenClassifierOutput or tuple(torch.FloatTensor)
Parameters
input_ids (torch.LongTensor
of shape (batch_size, sequence_length)
) — Indices of input sequence tokens in the vocabulary.
Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
bbox (torch.LongTensor
of shape (batch_size, sequence_length, 4)
, optional) — Bounding boxes of each input sequence tokens. Selected in the range [0, config.max_2d_position_embeddings-1]
. Each bounding box should be a normalized version in (x0, y0, x1, y1) format, where (x0, y0) corresponds to the position of the upper left corner in the bounding box, and (x1, y1) represents the position of the lower right corner.
pixel_values (torch.FloatTensor
of shape (batch_size, num_channels, height, width)
) — Batch of document images. Each image is divided into patches of shape (num_channels, config.patch_size, config.patch_size)
and the total number of patches (=patch_sequence_length
) equals to ((height / config.patch_size) * (width / config.patch_size))
.
attention_mask (torch.FloatTensor
of shape (batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
:
1 for tokens that are not masked,
0 for tokens that are masked.
token_type_ids (torch.LongTensor
of shape (batch_size, sequence_length)
, optional) — Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]
:
0 corresponds to a sentence A token,
1 corresponds to a sentence B token.
position_ids (torch.LongTensor
of shape (batch_size, sequence_length)
, optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.max_position_embeddings - 1]
.
head_mask (torch.FloatTensor
of shape (num_heads,)
or (num_layers, num_heads)
, optional) — Mask to nullify selected heads of the self-attention modules. Mask values selected in [0, 1]
:
1 indicates the head is not masked,
0 indicates the head is masked.
inputs_embeds (torch.FloatTensor
of shape (batch_size, sequence_length, hidden_size)
, optional) — Optionally, instead of passing input_ids
you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix.
output_attentions (bool
, optional) — Whether or not to return the attentions tensors of all attention layers. See attentions
under returned tensors for more detail.
output_hidden_states (bool
, optional) — Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for more detail.
return_dict (bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
labels (torch.LongTensor
of shape (batch_size, sequence_length)
, optional) — Labels for computing the token classification loss. Indices should be in [0, ..., config.num_labels - 1]
.
Returns
transformers.modeling_outputs.TokenClassifierOutput or tuple(torch.FloatTensor)
A transformers.modeling_outputs.TokenClassifierOutput or a tuple of torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various elements depending on the configuration (LayoutLMv3Config) and inputs.
loss (torch.FloatTensor
of shape (1,)
, optional, returned when labels
is provided) — Classification loss.
logits (torch.FloatTensor
of shape (batch_size, sequence_length, config.num_labels)
) — Classification scores (before SoftMax).
hidden_states (tuple(torch.FloatTensor)
, optional, returned when output_hidden_states=True
is passed or when config.output_hidden_states=True
) — Tuple of torch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size)
.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (tuple(torch.FloatTensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) — Tuple of torch.FloatTensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The LayoutLMv3ForTokenClassification forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
Examples:
Copied
( config )
Parameters
config (LayoutLMv3Config) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
LayoutLMv3 Model with a span classification head on top for extractive question-answering tasks such as DocVQA (a linear layer on top of the text part of the hidden-states output to compute span start logits
and span end logits
).
This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
( input_ids: typing.Optional[torch.LongTensor] = Noneattention_mask: typing.Optional[torch.FloatTensor] = Nonetoken_type_ids: typing.Optional[torch.LongTensor] = Noneposition_ids: typing.Optional[torch.LongTensor] = Nonehead_mask: typing.Optional[torch.FloatTensor] = Noneinputs_embeds: typing.Optional[torch.FloatTensor] = Nonestart_positions: typing.Optional[torch.LongTensor] = Noneend_positions: typing.Optional[torch.LongTensor] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = Nonebbox: typing.Optional[torch.LongTensor] = Nonepixel_values: typing.Optional[torch.LongTensor] = None ) → transformers.modeling_outputs.QuestionAnsweringModelOutput or tuple(torch.FloatTensor)
Parameters
input_ids (torch.LongTensor
of shape (batch_size, sequence_length)
) — Indices of input sequence tokens in the vocabulary.
Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
bbox (torch.LongTensor
of shape (batch_size, sequence_length, 4)
, optional) — Bounding boxes of each input sequence tokens. Selected in the range [0, config.max_2d_position_embeddings-1]
. Each bounding box should be a normalized version in (x0, y0, x1, y1) format, where (x0, y0) corresponds to the position of the upper left corner in the bounding box, and (x1, y1) represents the position of the lower right corner.
pixel_values (torch.FloatTensor
of shape (batch_size, num_channels, height, width)
) — Batch of document images. Each image is divided into patches of shape (num_channels, config.patch_size, config.patch_size)
and the total number of patches (=patch_sequence_length
) equals to ((height / config.patch_size) * (width / config.patch_size))
.
attention_mask (torch.FloatTensor
of shape (batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
:
1 for tokens that are not masked,
0 for tokens that are masked.
token_type_ids (torch.LongTensor
of shape (batch_size, sequence_length)
, optional) — Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]
:
0 corresponds to a sentence A token,
1 corresponds to a sentence B token.
position_ids (torch.LongTensor
of shape (batch_size, sequence_length)
, optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.max_position_embeddings - 1]
.
head_mask (torch.FloatTensor
of shape (num_heads,)
or (num_layers, num_heads)
, optional) — Mask to nullify selected heads of the self-attention modules. Mask values selected in [0, 1]
:
1 indicates the head is not masked,
0 indicates the head is masked.
inputs_embeds (torch.FloatTensor
of shape (batch_size, sequence_length, hidden_size)
, optional) — Optionally, instead of passing input_ids
you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix.
output_attentions (bool
, optional) — Whether or not to return the attentions tensors of all attention layers. See attentions
under returned tensors for more detail.
output_hidden_states (bool
, optional) — Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for more detail.
return_dict (bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
start_positions (torch.LongTensor
of shape (batch_size,)
, optional) — Labels for position (index) of the start of the labelled span for computing the token classification loss. Positions are clamped to the length of the sequence (sequence_length
). Position outside of the sequence are not taken into account for computing the loss.
end_positions (torch.LongTensor
of shape (batch_size,)
, optional) — Labels for position (index) of the end of the labelled span for computing the token classification loss. Positions are clamped to the length of the sequence (sequence_length
). Position outside of the sequence are not taken into account for computing the loss.
Returns
transformers.modeling_outputs.QuestionAnsweringModelOutput or tuple(torch.FloatTensor)
A transformers.modeling_outputs.QuestionAnsweringModelOutput or a tuple of torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various elements depending on the configuration (LayoutLMv3Config) and inputs.
loss (torch.FloatTensor
of shape (1,)
, optional, returned when labels
is provided) — Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.
start_logits (torch.FloatTensor
of shape (batch_size, sequence_length)
) — Span-start scores (before SoftMax).
end_logits (torch.FloatTensor
of shape (batch_size, sequence_length)
) — Span-end scores (before SoftMax).
hidden_states (tuple(torch.FloatTensor)
, optional, returned when output_hidden_states=True
is passed or when config.output_hidden_states=True
) — Tuple of torch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size)
.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (tuple(torch.FloatTensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) — Tuple of torch.FloatTensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The LayoutLMv3ForQuestionAnswering forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
Examples:
Copied
( *args**kwargs )
Parameters
config (LayoutLMv3Config) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The bare LayoutLMv3 Model transformer outputting raw hidden-states without any specific head on top. This model inherits from TFPreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a tf.keras.Model subclass. Use it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general usage and behavior.
TensorFlow models and layers in transformers
accept two formats as input:
having all inputs as keyword arguments (like PyTorch models), or
having all inputs as a list, tuple or dict in the first positional argument.
The reason the second format is supported is that Keras methods prefer this format when passing inputs to models and layers. Because of this support, when using methods like model.fit()
things should “just work” for you - just pass your inputs and labels in any format that model.fit()
supports! If, however, you want to use the second format outside of Keras methods like fit()
and predict()
, such as when creating your own layers or models with the Keras Functional
API, there are three possibilities you can use to gather all the input Tensors in the first positional argument:
a single Tensor with input_ids
only and nothing else: model(input_ids)
a list of varying length with one or several input Tensors IN THE ORDER given in the docstring: model([input_ids, attention_mask])
or model([input_ids, attention_mask, token_type_ids])
a dictionary with one or several input Tensors associated to the input names given in the docstring: model({"input_ids": input_ids, "token_type_ids": token_type_ids})
Note that when creating models and layers with subclassing then you don’t need to worry about any of this, as you can just pass inputs like you would to any other Python function!
call
( input_ids: tf.Tensor | None = Nonebbox: tf.Tensor | None = Noneattention_mask: tf.Tensor | None = Nonetoken_type_ids: tf.Tensor | None = Noneposition_ids: tf.Tensor | None = Nonehead_mask: tf.Tensor | None = Noneinputs_embeds: tf.Tensor | None = Nonepixel_values: tf.Tensor | None = Noneoutput_attentions: Optional[bool] = Noneoutput_hidden_states: Optional[bool] = Nonereturn_dict: Optional[bool] = Nonetraining: bool = False ) → transformers.modeling_tf_outputs.TFBaseModelOutput or tuple(tf.Tensor)
Parameters
input_ids (Numpy array
or tf.Tensor
of shape (batch_size, sequence_length)
) — Indices of input sequence tokens in the vocabulary.
Note that sequence_length = token_sequence_length + patch_sequence_length + 1
where 1
is for [CLS] token. See pixel_values
for patch_sequence_length
.
Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
bbox (Numpy array
or tf.Tensor
of shape (batch_size, sequence_length, 4)
, optional) — Bounding boxes of each input sequence tokens. Selected in the range [0, config.max_2d_position_embeddings-1]
. Each bounding box should be a normalized version in (x0, y0, x1, y1) format, where (x0, y0) corresponds to the position of the upper left corner in the bounding box, and (x1, y1) represents the position of the lower right corner.
Note that sequence_length = token_sequence_length + patch_sequence_length + 1
where 1
is for [CLS] token. See pixel_values
for patch_sequence_length
.
pixel_values (tf.Tensor
of shape (batch_size, num_channels, height, width)
) — Batch of document images. Each image is divided into patches of shape (num_channels, config.patch_size, config.patch_size)
and the total number of patches (=patch_sequence_length
) equals to ((height / config.patch_size) * (width / config.patch_size))
.
attention_mask (tf.Tensor
of shape (batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
:
1 for tokens that are not masked,
0 for tokens that are masked.
Note that sequence_length = token_sequence_length + patch_sequence_length + 1
where 1
is for [CLS] token. See pixel_values
for patch_sequence_length
.
token_type_ids (Numpy array
or tf.Tensor
of shape (batch_size, sequence_length)
, optional) — Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]
:
0 corresponds to a sentence A token,
1 corresponds to a sentence B token.
Note that sequence_length = token_sequence_length + patch_sequence_length + 1
where 1
is for [CLS] token. See pixel_values
for patch_sequence_length
.
position_ids (Numpy array
or tf.Tensor
of shape (batch_size, sequence_length)
, optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.max_position_embeddings - 1]
.
Note that sequence_length = token_sequence_length + patch_sequence_length + 1
where 1
is for [CLS] token. See pixel_values
for patch_sequence_length
.
head_mask (tf.Tensor
of shape (num_heads,)
or (num_layers, num_heads)
, optional) — Mask to nullify selected heads of the self-attention modules. Mask values selected in [0, 1]
:
1 indicates the head is not masked,
0 indicates the head is masked.
inputs_embeds (tf.Tensor
of shape (batch_size, sequence_length, hidden_size)
, optional) — Optionally, instead of passing input_ids
you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix.
output_attentions (bool
, optional) — Whether or not to return the attentions tensors of all attention layers. See attentions
under returned tensors for more detail.
output_hidden_states (bool
, optional) — Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for more detail.
return_dict (bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
Returns
transformers.modeling_tf_outputs.TFBaseModelOutput or tuple(tf.Tensor)
A transformers.modeling_tf_outputs.TFBaseModelOutput or a tuple of tf.Tensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various elements depending on the configuration (LayoutLMv3Config) and inputs.
last_hidden_state (tf.Tensor
of shape (batch_size, sequence_length, hidden_size)
) — Sequence of hidden-states at the output of the last layer of the model.
hidden_states (tuple(tf.FloatTensor)
, optional, returned when output_hidden_states=True
is passed or when config.output_hidden_states=True
) — Tuple of tf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size)
.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (tuple(tf.Tensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) — Tuple of tf.Tensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The TFLayoutLMv3Model forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
Examples:
Copied
( *args**kwargs )
Parameters
config (LayoutLMv3Config) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
LayoutLMv3 Model with a sequence classification head on top (a linear layer on top of the final hidden state of the [CLS] token) e.g. for document image classification tasks such as the RVL-CDIP dataset.
This model inherits from TFPreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a tf.keras.Model subclass. Use it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general usage and behavior.
TensorFlow models and layers in transformers
accept two formats as input:
having all inputs as keyword arguments (like PyTorch models), or
having all inputs as a list, tuple or dict in the first positional argument.
The reason the second format is supported is that Keras methods prefer this format when passing inputs to models and layers. Because of this support, when using methods like model.fit()
things should “just work” for you - just pass your inputs and labels in any format that model.fit()
supports! If, however, you want to use the second format outside of Keras methods like fit()
and predict()
, such as when creating your own layers or models with the Keras Functional
API, there are three possibilities you can use to gather all the input Tensors in the first positional argument:
a single Tensor with input_ids
only and nothing else: model(input_ids)
a list of varying length with one or several input Tensors IN THE ORDER given in the docstring: model([input_ids, attention_mask])
or model([input_ids, attention_mask, token_type_ids])
a dictionary with one or several input Tensors associated to the input names given in the docstring: model({"input_ids": input_ids, "token_type_ids": token_type_ids})
Note that when creating models and layers with subclassing then you don’t need to worry about any of this, as you can just pass inputs like you would to any other Python function!
call
( input_ids: tf.Tensor | None = Noneattention_mask: tf.Tensor | None = Nonetoken_type_ids: tf.Tensor | None = Noneposition_ids: tf.Tensor | None = Nonehead_mask: tf.Tensor | None = Noneinputs_embeds: tf.Tensor | None = Nonelabels: tf.Tensor | None = Noneoutput_attentions: Optional[bool] = Noneoutput_hidden_states: Optional[bool] = Nonereturn_dict: Optional[bool] = Nonebbox: tf.Tensor | None = Nonepixel_values: tf.Tensor | None = Nonetraining: Optional[bool] = False ) → transformers.modeling_tf_outputs.TFSequenceClassifierOutput or tuple(tf.Tensor)
Parameters
input_ids (Numpy array
or tf.Tensor
of shape (batch_size, sequence_length)
) — Indices of input sequence tokens in the vocabulary.
Note that sequence_length = token_sequence_length + patch_sequence_length + 1
where 1
is for [CLS] token. See pixel_values
for patch_sequence_length
.
Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
bbox (Numpy array
or tf.Tensor
of shape (batch_size, sequence_length, 4)
, optional) — Bounding boxes of each input sequence tokens. Selected in the range [0, config.max_2d_position_embeddings-1]
. Each bounding box should be a normalized version in (x0, y0, x1, y1) format, where (x0, y0) corresponds to the position of the upper left corner in the bounding box, and (x1, y1) represents the position of the lower right corner.
Note that sequence_length = token_sequence_length + patch_sequence_length + 1
where 1
is for [CLS] token. See pixel_values
for patch_sequence_length
.
pixel_values (tf.Tensor
of shape (batch_size, num_channels, height, width)
) — Batch of document images. Each image is divided into patches of shape (num_channels, config.patch_size, config.patch_size)
and the total number of patches (=patch_sequence_length
) equals to ((height / config.patch_size) * (width / config.patch_size))
.
attention_mask (tf.Tensor
of shape (batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
:
1 for tokens that are not masked,
0 for tokens that are masked.
Note that sequence_length = token_sequence_length + patch_sequence_length + 1
where 1
is for [CLS] token. See pixel_values
for patch_sequence_length
.
token_type_ids (Numpy array
or tf.Tensor
of shape (batch_size, sequence_length)
, optional) — Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]
:
0 corresponds to a sentence A token,
1 corresponds to a sentence B token.
Note that sequence_length = token_sequence_length + patch_sequence_length + 1
where 1
is for [CLS] token. See pixel_values
for patch_sequence_length
.
position_ids (Numpy array
or tf.Tensor
of shape (batch_size, sequence_length)
, optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.max_position_embeddings - 1]
.
Note that sequence_length = token_sequence_length + patch_sequence_length + 1
where 1
is for [CLS] token. See pixel_values
for patch_sequence_length
.
head_mask (tf.Tensor
of shape (num_heads,)
or (num_layers, num_heads)
, optional) — Mask to nullify selected heads of the self-attention modules. Mask values selected in [0, 1]
:
1 indicates the head is not masked,
0 indicates the head is masked.
inputs_embeds (tf.Tensor
of shape (batch_size, sequence_length, hidden_size)
, optional) — Optionally, instead of passing input_ids
you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix.
output_attentions (bool
, optional) — Whether or not to return the attentions tensors of all attention layers. See attentions
under returned tensors for more detail.
output_hidden_states (bool
, optional) — Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for more detail.
return_dict (bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
Returns
transformers.modeling_tf_outputs.TFSequenceClassifierOutput or tuple(tf.Tensor)
A transformers.modeling_tf_outputs.TFSequenceClassifierOutput or a tuple of tf.Tensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various elements depending on the configuration (LayoutLMv3Config) and inputs.
loss (tf.Tensor
of shape (batch_size, )
, optional, returned when labels
is provided) — Classification (or regression if config.num_labels==1) loss.
logits (tf.Tensor
of shape (batch_size, config.num_labels)
) — Classification (or regression if config.num_labels==1) scores (before SoftMax).
hidden_states (tuple(tf.Tensor)
, optional, returned when output_hidden_states=True
is passed or when config.output_hidden_states=True
) — Tuple of tf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size)
.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (tuple(tf.Tensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) — Tuple of tf.Tensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The TFLayoutLMv3ForSequenceClassification forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
Examples:
Copied
( *args**kwargs )
Parameters
config (LayoutLMv3Config) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
LayoutLMv3 Model with a token classification head on top (a linear layer on top of the final hidden states) e.g. for sequence labeling (information extraction) tasks such as FUNSD, SROIE, CORD and Kleister-NDA.
This model inherits from TFPreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a tf.keras.Model subclass. Use it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general usage and behavior.
TensorFlow models and layers in transformers
accept two formats as input:
having all inputs as keyword arguments (like PyTorch models), or
having all inputs as a list, tuple or dict in the first positional argument.
The reason the second format is supported is that Keras methods prefer this format when passing inputs to models and layers. Because of this support, when using methods like model.fit()
things should “just work” for you - just pass your inputs and labels in any format that model.fit()
supports! If, however, you want to use the second format outside of Keras methods like fit()
and predict()
, such as when creating your own layers or models with the Keras Functional
API, there are three possibilities you can use to gather all the input Tensors in the first positional argument:
a single Tensor with input_ids
only and nothing else: model(input_ids)
a list of varying length with one or several input Tensors IN THE ORDER given in the docstring: model([input_ids, attention_mask])
or model([input_ids, attention_mask, token_type_ids])
a dictionary with one or several input Tensors associated to the input names given in the docstring: model({"input_ids": input_ids, "token_type_ids": token_type_ids})
Note that when creating models and layers with subclassing then you don’t need to worry about any of this, as you can just pass inputs like you would to any other Python function!
call
( input_ids: tf.Tensor | None = Nonebbox: tf.Tensor | None = Noneattention_mask: tf.Tensor | None = Nonetoken_type_ids: tf.Tensor | None = Noneposition_ids: tf.Tensor | None = Nonehead_mask: tf.Tensor | None = Noneinputs_embeds: tf.Tensor | None = Nonelabels: tf.Tensor | None = Noneoutput_attentions: Optional[bool] = Noneoutput_hidden_states: Optional[bool] = Nonereturn_dict: Optional[bool] = Nonepixel_values: tf.Tensor | None = Nonetraining: Optional[bool] = False ) → transformers.modeling_tf_outputs.TFTokenClassifierOutput or tuple(tf.Tensor)
Parameters
input_ids (Numpy array
or tf.Tensor
of shape (batch_size, sequence_length)
) — Indices of input sequence tokens in the vocabulary.
Note that sequence_length = token_sequence_length + patch_sequence_length + 1
where 1
is for [CLS] token. See pixel_values
for patch_sequence_length
.
Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
bbox (Numpy array
or tf.Tensor
of shape (batch_size, sequence_length, 4)
, optional) — Bounding boxes of each input sequence tokens. Selected in the range [0, config.max_2d_position_embeddings-1]
. Each bounding box should be a normalized version in (x0, y0, x1, y1) format, where (x0, y0) corresponds to the position of the upper left corner in the bounding box, and (x1, y1) represents the position of the lower right corner.
Note that sequence_length = token_sequence_length + patch_sequence_length + 1
where 1
is for [CLS] token. See pixel_values
for patch_sequence_length
.
pixel_values (tf.Tensor
of shape (batch_size, num_channels, height, width)
) — Batch of document images. Each image is divided into patches of shape (num_channels, config.patch_size, config.patch_size)
and the total number of patches (=patch_sequence_length
) equals to ((height / config.patch_size) * (width / config.patch_size))
.
attention_mask (tf.Tensor
of shape (batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
:
1 for tokens that are not masked,
0 for tokens that are masked.
Note that sequence_length = token_sequence_length + patch_sequence_length + 1
where 1
is for [CLS] token. See pixel_values
for patch_sequence_length
.
token_type_ids (Numpy array
or tf.Tensor
of shape (batch_size, sequence_length)
, optional) — Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]
:
0 corresponds to a sentence A token,
1 corresponds to a sentence B token.
Note that sequence_length = token_sequence_length + patch_sequence_length + 1
where 1
is for [CLS] token. See pixel_values
for patch_sequence_length
.
position_ids (Numpy array
or tf.Tensor
of shape (batch_size, sequence_length)
, optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.max_position_embeddings - 1]
.
Note that sequence_length = token_sequence_length + patch_sequence_length + 1
where 1
is for [CLS] token. See pixel_values
for patch_sequence_length
.
head_mask (tf.Tensor
of shape (num_heads,)
or (num_layers, num_heads)
, optional) — Mask to nullify selected heads of the self-attention modules. Mask values selected in [0, 1]
:
1 indicates the head is not masked,
0 indicates the head is masked.
inputs_embeds (tf.Tensor
of shape (batch_size, sequence_length, hidden_size)
, optional) — Optionally, instead of passing input_ids
you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix.
output_attentions (bool
, optional) — Whether or not to return the attentions tensors of all attention layers. See attentions
under returned tensors for more detail.
output_hidden_states (bool
, optional) — Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for more detail.
return_dict (bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
labels (tf.Tensor
of shape (batch_size, sequence_length)
, optional) — Labels for computing the token classification loss. Indices should be in [0, ..., config.num_labels - 1]
.
Returns
transformers.modeling_tf_outputs.TFTokenClassifierOutput or tuple(tf.Tensor)
A transformers.modeling_tf_outputs.TFTokenClassifierOutput or a tuple of tf.Tensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various elements depending on the configuration (LayoutLMv3Config) and inputs.
loss (tf.Tensor
of shape (n,)
, optional, where n is the number of unmasked labels, returned when labels
is provided) — Classification loss.
logits (tf.Tensor
of shape (batch_size, sequence_length, config.num_labels)
) — Classification scores (before SoftMax).
hidden_states (tuple(tf.Tensor)
, optional, returned when output_hidden_states=True
is passed or when config.output_hidden_states=True
) — Tuple of tf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size)
.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (tuple(tf.Tensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) — Tuple of tf.Tensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The TFLayoutLMv3ForTokenClassification forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
Examples:
Copied
( *args**kwargs )
Parameters
config (LayoutLMv3Config) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
LayoutLMv3 Model with a span classification head on top for extractive question-answering tasks such as DocVQA (a linear layer on top of the text part of the hidden-states output to compute span start logits
and span end logits
).
This model inherits from TFPreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a tf.keras.Model subclass. Use it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general usage and behavior.
TensorFlow models and layers in transformers
accept two formats as input:
having all inputs as keyword arguments (like PyTorch models), or
having all inputs as a list, tuple or dict in the first positional argument.
The reason the second format is supported is that Keras methods prefer this format when passing inputs to models and layers. Because of this support, when using methods like model.fit()
things should “just work” for you - just pass your inputs and labels in any format that model.fit()
supports! If, however, you want to use the second format outside of Keras methods like fit()
and predict()
, such as when creating your own layers or models with the Keras Functional
API, there are three possibilities you can use to gather all the input Tensors in the first positional argument:
a single Tensor with input_ids
only and nothing else: model(input_ids)
a list of varying length with one or several input Tensors IN THE ORDER given in the docstring: model([input_ids, attention_mask])
or model([input_ids, attention_mask, token_type_ids])
a dictionary with one or several input Tensors associated to the input names given in the docstring: model({"input_ids": input_ids, "token_type_ids": token_type_ids})
Note that when creating models and layers with subclassing then you don’t need to worry about any of this, as you can just pass inputs like you would to any other Python function!
call
( input_ids: tf.Tensor | None = Noneattention_mask: tf.Tensor | None = Nonetoken_type_ids: tf.Tensor | None = Noneposition_ids: tf.Tensor | None = Nonehead_mask: tf.Tensor | None = Noneinputs_embeds: tf.Tensor | None = Nonestart_positions: tf.Tensor | None = Noneend_positions: tf.Tensor | None = Noneoutput_attentions: Optional[bool] = Noneoutput_hidden_states: Optional[bool] = Nonebbox: tf.Tensor | None = Nonepixel_values: tf.Tensor | None = Nonereturn_dict: Optional[bool] = Nonetraining: bool = False ) → transformers.modeling_tf_outputs.TFQuestionAnsweringModelOutput or tuple(tf.Tensor)
Parameters
input_ids (Numpy array
or tf.Tensor
of shape (batch_size, sequence_length)
) — Indices of input sequence tokens in the vocabulary.
Note that sequence_length = token_sequence_length + patch_sequence_length + 1
where 1
is for [CLS] token. See pixel_values
for patch_sequence_length
.
Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
bbox (Numpy array
or tf.Tensor
of shape (batch_size, sequence_length, 4)
, optional) — Bounding boxes of each input sequence tokens. Selected in the range [0, config.max_2d_position_embeddings-1]
. Each bounding box should be a normalized version in (x0, y0, x1, y1) format, where (x0, y0) corresponds to the position of the upper left corner in the bounding box, and (x1, y1) represents the position of the lower right corner.
Note that sequence_length = token_sequence_length + patch_sequence_length + 1
where 1
is for [CLS] token. See pixel_values
for patch_sequence_length
.
pixel_values (tf.Tensor
of shape (batch_size, num_channels, height, width)
) — Batch of document images. Each image is divided into patches of shape (num_channels, config.patch_size, config.patch_size)
and the total number of patches (=patch_sequence_length
) equals to ((height / config.patch_size) * (width / config.patch_size))
.
attention_mask (tf.Tensor
of shape (batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
:
1 for tokens that are not masked,
0 for tokens that are masked.
Note that sequence_length = token_sequence_length + patch_sequence_length + 1
where 1
is for [CLS] token. See pixel_values
for patch_sequence_length
.
token_type_ids (Numpy array
or tf.Tensor
of shape (batch_size, sequence_length)
, optional) — Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]
:
0 corresponds to a sentence A token,
1 corresponds to a sentence B token.
Note that sequence_length = token_sequence_length + patch_sequence_length + 1
where 1
is for [CLS] token. See pixel_values
for patch_sequence_length
.
position_ids (Numpy array
or tf.Tensor
of shape (batch_size, sequence_length)
, optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.max_position_embeddings - 1]
.
Note that sequence_length = token_sequence_length + patch_sequence_length + 1
where 1
is for [CLS] token. See pixel_values
for patch_sequence_length
.
head_mask (tf.Tensor
of shape (num_heads,)
or (num_layers, num_heads)
, optional) — Mask to nullify selected heads of the self-attention modules. Mask values selected in [0, 1]
:
1 indicates the head is not masked,
0 indicates the head is masked.
inputs_embeds (tf.Tensor
of shape (batch_size, sequence_length, hidden_size)
, optional) — Optionally, instead of passing input_ids
you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix.
output_attentions (bool
, optional) — Whether or not to return the attentions tensors of all attention layers. See attentions
under returned tensors for more detail.
output_hidden_states (bool
, optional) — Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for more detail.
return_dict (bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
start_positions (tf.Tensor
of shape (batch_size,)
, optional) — Labels for position (index) of the start of the labelled span for computing the token classification loss. Positions are clamped to the length of the sequence (sequence_length
). Position outside of the sequence are not taken into account for computing the loss.
end_positions (tf.Tensor
of shape (batch_size,)
, optional) — Labels for position (index) of the end of the labelled span for computing the token classification loss. Positions are clamped to the length of the sequence (sequence_length
). Position outside of the sequence are not taken into account for computing the loss.
Returns
transformers.modeling_tf_outputs.TFQuestionAnsweringModelOutput or tuple(tf.Tensor)
A transformers.modeling_tf_outputs.TFQuestionAnsweringModelOutput or a tuple of tf.Tensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various elements depending on the configuration (LayoutLMv3Config) and inputs.
loss (tf.Tensor
of shape (batch_size, )
, optional, returned when start_positions
and end_positions
are provided) — Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.
start_logits (tf.Tensor
of shape (batch_size, sequence_length)
) — Span-start scores (before SoftMax).
end_logits (tf.Tensor
of shape (batch_size, sequence_length)
) — Span-end scores (before SoftMax).
hidden_states (tuple(tf.Tensor)
, optional, returned when output_hidden_states=True
is passed or when config.output_hidden_states=True
) — Tuple of tf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size)
.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (tuple(tf.Tensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) — Tuple of tf.Tensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The TFLayoutLMv3ForQuestionAnswering forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
Examples:
Copied