Transformers
  • 🌍GET STARTED
    • Transformers
    • Quick tour
    • Installation
  • 🌍TUTORIALS
    • Run inference with pipelines
    • Write portable code with AutoClass
    • Preprocess data
    • Fine-tune a pretrained model
    • Train with a script
    • Set up distributed training with BOINC AI Accelerate
    • Load and train adapters with BOINC AI PEFT
    • Share your model
    • Agents
    • Generation with LLMs
  • 🌍TASK GUIDES
    • 🌍NATURAL LANGUAGE PROCESSING
      • Text classification
      • Token classification
      • Question answering
      • Causal language modeling
      • Masked language modeling
      • Translation
      • Summarization
      • Multiple choice
    • 🌍AUDIO
      • Audio classification
      • Automatic speech recognition
    • 🌍COMPUTER VISION
      • Image classification
      • Semantic segmentation
      • Video classification
      • Object detection
      • Zero-shot object detection
      • Zero-shot image classification
      • Depth estimation
    • 🌍MULTIMODAL
      • Image captioning
      • Document Question Answering
      • Visual Question Answering
      • Text to speech
    • 🌍GENERATION
      • Customize the generation strategy
    • 🌍PROMPTING
      • Image tasks with IDEFICS
  • 🌍DEVELOPER GUIDES
    • Use fast tokenizers from BOINC AI Tokenizers
    • Run inference with multilingual models
    • Use model-specific APIs
    • Share a custom model
    • Templates for chat models
    • Run training on Amazon SageMaker
    • Export to ONNX
    • Export to TFLite
    • Export to TorchScript
    • Benchmarks
    • Notebooks with examples
    • Community resources
    • Custom Tools and Prompts
    • Troubleshoot
  • 🌍PERFORMANCE AND SCALABILITY
    • Overview
    • 🌍EFFICIENT TRAINING TECHNIQUES
      • Methods and tools for efficient training on a single GPU
      • Multiple GPUs and parallelism
      • Efficient training on CPU
      • Distributed CPU training
      • Training on TPUs
      • Training on TPU with TensorFlow
      • Training on Specialized Hardware
      • Custom hardware for training
      • Hyperparameter Search using Trainer API
    • 🌍OPTIMIZING INFERENCE
      • Inference on CPU
      • Inference on one GPU
      • Inference on many GPUs
      • Inference on Specialized Hardware
    • Instantiating a big model
    • Troubleshooting
    • XLA Integration for TensorFlow Models
    • Optimize inference using `torch.compile()`
  • 🌍CONTRIBUTE
    • How to contribute to transformers?
    • How to add a model to BOINC AI Transformers?
    • How to convert a BOINC AI Transformers model to TensorFlow?
    • How to add a pipeline to BOINC AI Transformers?
    • Testing
    • Checks on a Pull Request
  • 🌍CONCEPTUAL GUIDES
    • Philosophy
    • Glossary
    • What BOINC AI Transformers can do
    • How BOINC AI Transformers solve tasks
    • The Transformer model family
    • Summary of the tokenizers
    • Attention mechanisms
    • Padding and truncation
    • BERTology
    • Perplexity of fixed-length models
    • Pipelines for webserver inference
    • Model training anatomy
  • 🌍API
    • 🌍MAIN CLASSES
      • Agents and Tools
      • 🌍Auto Classes
        • Extending the Auto Classes
        • AutoConfig
        • AutoTokenizer
        • AutoFeatureExtractor
        • AutoImageProcessor
        • AutoProcessor
        • Generic model classes
          • AutoModel
          • TFAutoModel
          • FlaxAutoModel
        • Generic pretraining classes
          • AutoModelForPreTraining
          • TFAutoModelForPreTraining
          • FlaxAutoModelForPreTraining
        • Natural Language Processing
          • AutoModelForCausalLM
          • TFAutoModelForCausalLM
          • FlaxAutoModelForCausalLM
          • AutoModelForMaskedLM
          • TFAutoModelForMaskedLM
          • FlaxAutoModelForMaskedLM
          • AutoModelForMaskGenerationge
          • TFAutoModelForMaskGeneration
          • AutoModelForSeq2SeqLM
          • TFAutoModelForSeq2SeqLM
          • FlaxAutoModelForSeq2SeqLM
          • AutoModelForSequenceClassification
          • TFAutoModelForSequenceClassification
          • FlaxAutoModelForSequenceClassification
          • AutoModelForMultipleChoice
          • TFAutoModelForMultipleChoice
          • FlaxAutoModelForMultipleChoice
          • AutoModelForNextSentencePrediction
          • TFAutoModelForNextSentencePrediction
          • FlaxAutoModelForNextSentencePrediction
          • AutoModelForTokenClassification
          • TFAutoModelForTokenClassification
          • FlaxAutoModelForTokenClassification
          • AutoModelForQuestionAnswering
          • TFAutoModelForQuestionAnswering
          • FlaxAutoModelForQuestionAnswering
          • AutoModelForTextEncoding
          • TFAutoModelForTextEncoding
        • Computer vision
          • AutoModelForDepthEstimation
          • AutoModelForImageClassification
          • TFAutoModelForImageClassification
          • FlaxAutoModelForImageClassification
          • AutoModelForVideoClassification
          • AutoModelForMaskedImageModeling
          • TFAutoModelForMaskedImageModeling
          • AutoModelForObjectDetection
          • AutoModelForImageSegmentation
          • AutoModelForImageToImage
          • AutoModelForSemanticSegmentation
          • TFAutoModelForSemanticSegmentation
          • AutoModelForInstanceSegmentation
          • AutoModelForUniversalSegmentation
          • AutoModelForZeroShotImageClassification
          • TFAutoModelForZeroShotImageClassification
          • AutoModelForZeroShotObjectDetection
        • Audio
          • AutoModelForAudioClassification
          • AutoModelForAudioFrameClassification
          • TFAutoModelForAudioFrameClassification
          • AutoModelForCTC
          • AutoModelForSpeechSeq2Seq
          • TFAutoModelForSpeechSeq2Seq
          • FlaxAutoModelForSpeechSeq2Seq
          • AutoModelForAudioXVector
          • AutoModelForTextToSpectrogram
          • AutoModelForTextToWaveform
        • Multimodal
          • AutoModelForTableQuestionAnswering
          • TFAutoModelForTableQuestionAnswering
          • AutoModelForDocumentQuestionAnswering
          • TFAutoModelForDocumentQuestionAnswering
          • AutoModelForVisualQuestionAnswering
          • AutoModelForVision2Seq
          • TFAutoModelForVision2Seq
          • FlaxAutoModelForVision2Seq
      • Callbacks
      • Configuration
      • Data Collator
      • Keras callbacks
      • Logging
      • Models
      • Text Generation
      • ONNX
      • Optimization
      • Model outputs
      • Pipelines
      • Processors
      • Quantization
      • Tokenizer
      • Trainer
      • DeepSpeed Integration
      • Feature Extractor
      • Image Processor
    • 🌍MODELS
      • 🌍TEXT MODELS
        • ALBERT
        • BART
        • BARThez
        • BARTpho
        • BERT
        • BertGeneration
        • BertJapanese
        • Bertweet
        • BigBird
        • BigBirdPegasus
        • BioGpt
        • Blenderbot
        • Blenderbot Small
        • BLOOM
        • BORT
        • ByT5
        • CamemBERT
        • CANINE
        • CodeGen
        • CodeLlama
        • ConvBERT
        • CPM
        • CPMANT
        • CTRL
        • DeBERTa
        • DeBERTa-v2
        • DialoGPT
        • DistilBERT
        • DPR
        • ELECTRA
        • Encoder Decoder Models
        • ERNIE
        • ErnieM
        • ESM
        • Falcon
        • FLAN-T5
        • FLAN-UL2
        • FlauBERT
        • FNet
        • FSMT
        • Funnel Transformer
        • GPT
        • GPT Neo
        • GPT NeoX
        • GPT NeoX Japanese
        • GPT-J
        • GPT2
        • GPTBigCode
        • GPTSAN Japanese
        • GPTSw3
        • HerBERT
        • I-BERT
        • Jukebox
        • LED
        • LLaMA
        • LLama2
        • Longformer
        • LongT5
        • LUKE
        • M2M100
        • MarianMT
        • MarkupLM
        • MBart and MBart-50
        • MEGA
        • MegatronBERT
        • MegatronGPT2
        • Mistral
        • mLUKE
        • MobileBERT
        • MPNet
        • MPT
        • MRA
        • MT5
        • MVP
        • NEZHA
        • NLLB
        • NLLB-MoE
        • Nyströmformer
        • Open-Llama
        • OPT
        • Pegasus
        • PEGASUS-X
        • Persimmon
        • PhoBERT
        • PLBart
        • ProphetNet
        • QDQBert
        • RAG
        • REALM
        • Reformer
        • RemBERT
        • RetriBERT
        • RoBERTa
        • RoBERTa-PreLayerNorm
        • RoCBert
        • RoFormer
        • RWKV
        • Splinter
        • SqueezeBERT
        • SwitchTransformers
        • T5
        • T5v1.1
        • TAPEX
        • Transformer XL
        • UL2
        • UMT5
        • X-MOD
        • XGLM
        • XLM
        • XLM-ProphetNet
        • XLM-RoBERTa
        • XLM-RoBERTa-XL
        • XLM-V
        • XLNet
        • YOSO
      • 🌍VISION MODELS
        • BEiT
        • BiT
        • Conditional DETR
        • ConvNeXT
        • ConvNeXTV2
        • CvT
        • Deformable DETR
        • DeiT
        • DETA
        • DETR
        • DiNAT
        • DINO V2
        • DiT
        • DPT
        • EfficientFormer
        • EfficientNet
        • FocalNet
        • GLPN
        • ImageGPT
        • LeViT
        • Mask2Former
        • MaskFormer
        • MobileNetV1
        • MobileNetV2
        • MobileViT
        • MobileViTV2
        • NAT
        • PoolFormer
        • Pyramid Vision Transformer (PVT)
        • RegNet
        • ResNet
        • SegFormer
        • SwiftFormer
        • Swin Transformer
        • Swin Transformer V2
        • Swin2SR
        • Table Transformer
        • TimeSformer
        • UperNet
        • VAN
        • VideoMAE
        • Vision Transformer (ViT)
        • ViT Hybrid
        • ViTDet
        • ViTMAE
        • ViTMatte
        • ViTMSN
        • ViViT
        • YOLOS
      • 🌍AUDIO MODELS
        • Audio Spectrogram Transformer
        • Bark
        • CLAP
        • EnCodec
        • Hubert
        • MCTCT
        • MMS
        • MusicGen
        • Pop2Piano
        • SEW
        • SEW-D
        • Speech2Text
        • Speech2Text2
        • SpeechT5
        • UniSpeech
        • UniSpeech-SAT
        • VITS
        • Wav2Vec2
        • Wav2Vec2-Conformer
        • Wav2Vec2Phoneme
        • WavLM
        • Whisper
        • XLS-R
        • XLSR-Wav2Vec2
      • 🌍MULTIMODAL MODELS
        • ALIGN
        • AltCLIP
        • BLIP
        • BLIP-2
        • BridgeTower
        • BROS
        • Chinese-CLIP
        • CLIP
        • CLIPSeg
        • Data2Vec
        • DePlot
        • Donut
        • FLAVA
        • GIT
        • GroupViT
        • IDEFICS
        • InstructBLIP
        • LayoutLM
        • LayoutLMV2
        • LayoutLMV3
        • LayoutXLM
        • LiLT
        • LXMERT
        • MatCha
        • MGP-STR
        • Nougat
        • OneFormer
        • OWL-ViT
        • Perceiver
        • Pix2Struct
        • Segment Anything
        • Speech Encoder Decoder Models
        • TAPAS
        • TrOCR
        • TVLT
        • ViLT
        • Vision Encoder Decoder Models
        • Vision Text Dual Encoder
        • VisualBERT
        • X-CLIP
      • 🌍REINFORCEMENT LEARNING MODELS
        • Decision Transformer
        • Trajectory Transformer
      • 🌍TIME SERIES MODELS
        • Autoformer
        • Informer
        • Time Series Transformer
      • 🌍GRAPH MODELS
        • Graphormer
  • 🌍INTERNAL HELPERS
    • Custom Layers and Utilities
    • Utilities for pipelines
    • Utilities for Tokenizers
    • Utilities for Trainer
    • Utilities for Generation
    • Utilities for Image Processors
    • Utilities for Audio processing
    • General Utilities
    • Utilities for Time Series
Powered by GitBook
On this page
  • MGP-STR
  • Overview
  • Inference
  • MgpstrConfig
  • MgpstrTokenizer
  • MgpstrProcessor
  • MgpstrModel
  • MgpstrForSceneTextRecognition
  1. API
  2. MODELS
  3. MULTIMODAL MODELS

MGP-STR

PreviousMatChaNextNougat

Last updated 1 year ago

MGP-STR

Overview

The MGP-STR model was proposed in by Peng Wang, Cheng Da, and Cong Yao. MGP-STR is a conceptually simple yet powerful vision Scene Text Recognition (STR) model, which is built upon the . To integrate linguistic knowledge, Multi-Granularity Prediction (MGP) strategy is proposed to inject information from the language modality into the model in an implicit way.

The abstract from the paper is the following:

Scene text recognition (STR) has been an active research topic in computer vision for years. To tackle this challenging problem, numerous innovative methods have been successively proposed and incorporating linguistic knowledge into STR models has recently become a prominent trend. In this work, we first draw inspiration from the recent progress in Vision Transformer (ViT) to construct a conceptually simple yet powerful vision STR model, which is built upon ViT and outperforms previous state-of-the-art models for scene text recognition, including both pure vision models and language-augmented methods. To integrate linguistic knowledge, we further propose a Multi-Granularity Prediction strategy to inject information from the language modality into the model in an implicit way, i.e. , subword representations (BPE and WordPiece) widely-used in NLP are introduced into the output space, in addition to the conventional character level representation, while no independent language model (LM) is adopted. The resultant algorithm (termed MGP-STR) is able to push the performance envelop of STR to an even higher level. Specifically, it achieves an average recognition accuracy of 93.35% on standard benchmarks.

MGP-STR architecture. Taken from the .

Tips:

  • MGP-STR is trained on two synthetic datasets (MJ) and SynthText() (ST) without fine-tuning on other datasets. It achieves state-of-the-art results on six standard Latin scene text benchmarks, including 3 regular text datasets (IC13, SVT, IIIT) and 3 irregular ones (IC15, SVTP, CUTE).

  • This model was contributed by . The original code can be found .

Inference

accepts images as input and generates three types of predictions, which represent textual information at different granularities. The three types of predictions are fused to give the final prediction result.

The class is responsible for preprocessing the input image and decodes the generated character tokens to the target string. The wraps and into a single instance to both extract the input features and decode the predicted token ids.

  • Step-by-step Optical Character Recognition (OCR)

Copied

>>> from transformers import MgpstrProcessor, MgpstrForSceneTextRecognition
>>> import requests
>>> from PIL import Image

>>> processor = MgpstrProcessor.from_pretrained('alibaba-damo/mgp-str-base')
>>> model = MgpstrForSceneTextRecognition.from_pretrained('alibaba-damo/mgp-str-base')

>>> # load image from the IIIT-5k dataset
>>> url = "https://i.postimg.cc/ZKwLg2Gw/367-14.png"
>>> image = Image.open(requests.get(url, stream=True).raw).convert("RGB")

>>> pixel_values = processor(images=image, return_tensors="pt").pixel_values
>>> outputs = model(pixel_values)

>>> generated_text = processor.batch_decode(outputs.logits)['generated_text']

MgpstrConfig

class transformers.MgpstrConfig

( image_size = [32, 128]patch_size = 4num_channels = 3max_token_length = 27num_character_labels = 38num_bpe_labels = 50257num_wordpiece_labels = 30522hidden_size = 768num_hidden_layers = 12num_attention_heads = 12mlp_ratio = 4.0qkv_bias = Truedistilled = Falselayer_norm_eps = 1e-05drop_rate = 0.0attn_drop_rate = 0.0drop_path_rate = 0.0output_a3_attentions = Falseinitializer_range = 0.02**kwargs )

Parameters

  • image_size (List[int], optional, defaults to [32, 128]) — The size (resolution) of each image.

  • patch_size (int, optional, defaults to 4) — The size (resolution) of each patch.

  • num_channels (int, optional, defaults to 3) — The number of input channels.

  • max_token_length (int, optional, defaults to 27) — The max number of output tokens.

  • num_character_labels (int, optional, defaults to 38) — The number of classes for character head .

  • num_bpe_labels (int, optional, defaults to 50257) — The number of classes for bpe head .

  • num_wordpiece_labels (int, optional, defaults to 30522) — The number of classes for wordpiece head .

  • hidden_size (int, optional, defaults to 768) — The embedding dimension.

  • num_hidden_layers (int, optional, defaults to 12) — Number of hidden layers in the Transformer encoder.

  • num_attention_heads (int, optional, defaults to 12) — Number of attention heads for each attention layer in the Transformer encoder.

  • mlp_ratio (float, optional, defaults to 4.0) — The ratio of mlp hidden dim to embedding dim.

  • qkv_bias (bool, optional, defaults to True) — Whether to add a bias to the queries, keys and values.

  • distilled (bool, optional, defaults to False) — Model includes a distillation token and head as in DeiT models.

  • layer_norm_eps (float, optional, defaults to 1e-5) — The epsilon used by the layer normalization layers.

  • drop_rate (float, optional, defaults to 0.0) — The dropout probability for all fully connected layers in the embeddings, encoder.

  • attn_drop_rate (float, optional, defaults to 0.0) — The dropout ratio for the attention probabilities.

  • drop_path_rate (float, optional, defaults to 0.0) — The stochastic depth rate.

  • output_a3_attentions (bool, optional, defaults to False) — Whether or not the model should returns A^3 module attentions.

  • initializer_range (float, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

Example:

Copied

>>> from transformers import MgpstrConfig, MgpstrForSceneTextRecognition

>>> # Initializing a Mgpstr mgp-str-base style configuration
>>> configuration = MgpstrConfig()

>>> # Initializing a model (with random weights) from the mgp-str-base style configuration
>>> model = MgpstrForSceneTextRecognition(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config

MgpstrTokenizer

class transformers.MgpstrTokenizer

( vocab_fileunk_token = '[GO]'bos_token = '[GO]'eos_token = '[s]'pad_token = '[GO]'**kwargs )

Parameters

  • vocab_file (str) — Path to the vocabulary file.

  • unk_token (str, optional, defaults to "[GO]") — The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead.

  • bos_token (str, optional, defaults to "[GO]") — The beginning of sequence token.

  • eos_token (str, optional, defaults to "[s]") — The end of sequence token.

  • pad_token (str or tokenizers.AddedToken, optional, , defaults to "[GO]") — A special token used to make arrays of tokens the same size for batching purpose. Will then be ignored by attention mechanisms or loss computation.

Construct a MGP-STR char tokenizer.

save_vocabulary

( save_directory: strfilename_prefix: typing.Optional[str] = None )

MgpstrProcessor

class transformers.MgpstrProcessor

( image_processor = Nonetokenizer = None**kwargs )

Parameters

  • image_processor (ViTImageProcessor) — An instance of ViTImageProcessor. The image processor is a required input.

Constructs a MGP-STR processor which wraps an image processor and MGP-STR tokenizers into a single

__call__

( text = Noneimages = Nonereturn_tensors = None**kwargs )

batch_decode

( sequences ) → Dict[str, any]

Parameters

  • sequences (torch.Tensor) — List of tokenized input ids.

Returns

Dict[str, any]

Dictionary of all the outputs of the decoded results. generated_text (List[str]): The final results after fusion of char, bpe, and wp. scores (List[float]): The final scores after fusion of char, bpe, and wp. char_preds (List[str]): The list of character decoded sentences. bpe_preds (List[str]): The list of bpe decoded sentences. wp_preds (List[str]): The list of wp decoded sentences.

Convert a list of lists of token ids into a list of strings by calling decode.

MgpstrModel

class transformers.MgpstrModel

( config: MgpstrConfig )

Parameters

forward

( pixel_values: FloatTensoroutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = None )

Parameters

  • output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more detail.

  • output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more detail.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

MgpstrForSceneTextRecognition

class transformers.MgpstrForSceneTextRecognition

( config: MgpstrConfig )

Parameters

MGP-STR Model transformer with three classification heads on top (three A^3 modules and three linear layer on top of the transformer encoder output) for scene text recognition (STR) .

forward

( pixel_values: FloatTensoroutput_attentions: typing.Optional[bool] = Noneoutput_a3_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = None ) → transformers.models.mgp_str.modeling_mgp_str.MgpstrModelOutput or tuple(torch.FloatTensor)

Parameters

  • output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more detail.

  • output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more detail.

  • output_a3_attentions (bool, optional) — Whether or not to return the attentions tensors of a3 modules. See a3_attentions under returned tensors for more detail.

Returns

transformers.models.mgp_str.modeling_mgp_str.MgpstrModelOutput or tuple(torch.FloatTensor)

A transformers.models.mgp_str.modeling_mgp_str.MgpstrModelOutput or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (<class 'transformers.models.mgp_str.configuration_mgp_str.MgpstrConfig'>) and inputs.

  • logits (tuple(torch.FloatTensor) of shape (batch_size, config.num_character_labels)) — Tuple of torch.FloatTensor (one for the output of character of shape (batch_size, config.max_token_length, config.num_character_labels), + one for the output of bpe of shape (batch_size, config.max_token_length, config.num_bpe_labels), + one for the output of wordpiece of shape (batch_size, config.max_token_length, config.num_wordpiece_labels)) .

    Classification scores (before SoftMax) of character, bpe and wordpiece.

  • hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.

  • attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, config.max_token_length, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

  • a3_attentions (tuple(torch.FloatTensor), optional, returned when output_a3_attentions=True is passed or when config.output_a3_attentions=True) — Tuple of torch.FloatTensor (one for the attention of character, + one for the attention of bpe, + one for the attention of wordpiece) of shape (batch_size, config.max_token_length, sequence_length)`.

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example:

Copied

>>> from transformers import (
...     MgpstrProcessor,
...     MgpstrForSceneTextRecognition,
... )
>>> import requests
>>> from PIL import Image

>>> # load image from the IIIT-5k dataset
>>> url = "https://i.postimg.cc/ZKwLg2Gw/367-14.png"
>>> image = Image.open(requests.get(url, stream=True).raw).convert("RGB")

>>> processor = MgpstrProcessor.from_pretrained("alibaba-damo/mgp-str-base")
>>> pixel_values = processor(images=image, return_tensors="pt").pixel_values

>>> model = MgpstrForSceneTextRecognition.from_pretrained("alibaba-damo/mgp-str-base")

>>> # inference
>>> outputs = model(pixel_values)
>>> out_strs = processor.batch_decode(outputs.logits)
>>> out_strs["generated_text"]
'["ticket"]'

This is the configuration class to store the configuration of an . It is used to instantiate an MGP-STR model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the MGP-STR architecture.

Configuration objects inherit from and can be used to control the model outputs. Read the documentation from for more information.

This tokenizer inherits from which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.

tokenizer () — The tokenizer is a required input.

offers all the functionalities of ViTImageProcessor] and . See the and for more information.

When used in normal mode, this method forwards all its arguments to ViTImageProcessor’s and returns its output. This method also forwards the text and kwargs arguments to MgpstrTokenizer’s if text is not None to encode the text. Please refer to the doctsring of the above methods for more information.

This method forwards all its arguments to PreTrainedTokenizer’s . Please refer to the docstring of this method for more information.

config () — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the method to load the model weights.

The bare MGP-STR Model transformer outputting raw hidden-states without any specific head on top. This model is a PyTorch subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

pixel_values (torch.FloatTensor of shape (batch_size, num_channels, height, width)) — Pixel values. Pixel values can be obtained using . See for details.

return_dict (bool, optional) — Whether or not to return a instead of a plain tuple.

The forward method, overrides the __call__ special method.

config () — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the method to load the model weights.

This model is a PyTorch subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

pixel_values (torch.FloatTensor of shape (batch_size, num_channels, height, width)) — Pixel values. Pixel values can be obtained using . See for details.

return_dict (bool, optional) — Whether or not to return a instead of a plain tuple.

The forward method, overrides the __call__ special method.

🌍
🌍
🌍
<source>
MgpstrModel
alibaba-damo/mgp-str-base
PretrainedConfig
PretrainedConfig
<source>
PreTrainedTokenizer
<source>
<source>
MgpstrTokenizer
MgpstrProcessor
MgpstrTokenizer
call()
batch_decode()
<source>
call()
call()
<source>
batch_decode()
<source>
MgpstrConfig
from_pretrained()
torch.nn.Module
<source>
AutoImageProcessor
ViTImageProcessor.call()
ModelOutput
MgpstrModel
<source>
MgpstrConfig
from_pretrained()
torch.nn.Module
<source>
AutoImageProcessor
ViTImageProcessor.call()
ModelOutput
MgpstrForSceneTextRecognition
Multi-Granularity Prediction for Scene Text Recognition
Vision Transformer (ViT)
original paper
MJSynth
http://www.robots.ox.ac.uk/~vgg/data/scenetext/
yuekun
here
MgpstrModel
ViTImageProcessor
MgpstrTokenizer
MgpstrProcessor
ViTImageProcessor
MgpstrTokenizer