Transformers
  • 🌍GET STARTED
    • Transformers
    • Quick tour
    • Installation
  • 🌍TUTORIALS
    • Run inference with pipelines
    • Write portable code with AutoClass
    • Preprocess data
    • Fine-tune a pretrained model
    • Train with a script
    • Set up distributed training with BOINC AI Accelerate
    • Load and train adapters with BOINC AI PEFT
    • Share your model
    • Agents
    • Generation with LLMs
  • 🌍TASK GUIDES
    • 🌍NATURAL LANGUAGE PROCESSING
      • Text classification
      • Token classification
      • Question answering
      • Causal language modeling
      • Masked language modeling
      • Translation
      • Summarization
      • Multiple choice
    • 🌍AUDIO
      • Audio classification
      • Automatic speech recognition
    • 🌍COMPUTER VISION
      • Image classification
      • Semantic segmentation
      • Video classification
      • Object detection
      • Zero-shot object detection
      • Zero-shot image classification
      • Depth estimation
    • 🌍MULTIMODAL
      • Image captioning
      • Document Question Answering
      • Visual Question Answering
      • Text to speech
    • 🌍GENERATION
      • Customize the generation strategy
    • 🌍PROMPTING
      • Image tasks with IDEFICS
  • 🌍DEVELOPER GUIDES
    • Use fast tokenizers from BOINC AI Tokenizers
    • Run inference with multilingual models
    • Use model-specific APIs
    • Share a custom model
    • Templates for chat models
    • Run training on Amazon SageMaker
    • Export to ONNX
    • Export to TFLite
    • Export to TorchScript
    • Benchmarks
    • Notebooks with examples
    • Community resources
    • Custom Tools and Prompts
    • Troubleshoot
  • 🌍PERFORMANCE AND SCALABILITY
    • Overview
    • 🌍EFFICIENT TRAINING TECHNIQUES
      • Methods and tools for efficient training on a single GPU
      • Multiple GPUs and parallelism
      • Efficient training on CPU
      • Distributed CPU training
      • Training on TPUs
      • Training on TPU with TensorFlow
      • Training on Specialized Hardware
      • Custom hardware for training
      • Hyperparameter Search using Trainer API
    • 🌍OPTIMIZING INFERENCE
      • Inference on CPU
      • Inference on one GPU
      • Inference on many GPUs
      • Inference on Specialized Hardware
    • Instantiating a big model
    • Troubleshooting
    • XLA Integration for TensorFlow Models
    • Optimize inference using `torch.compile()`
  • 🌍CONTRIBUTE
    • How to contribute to transformers?
    • How to add a model to BOINC AI Transformers?
    • How to convert a BOINC AI Transformers model to TensorFlow?
    • How to add a pipeline to BOINC AI Transformers?
    • Testing
    • Checks on a Pull Request
  • 🌍CONCEPTUAL GUIDES
    • Philosophy
    • Glossary
    • What BOINC AI Transformers can do
    • How BOINC AI Transformers solve tasks
    • The Transformer model family
    • Summary of the tokenizers
    • Attention mechanisms
    • Padding and truncation
    • BERTology
    • Perplexity of fixed-length models
    • Pipelines for webserver inference
    • Model training anatomy
  • 🌍API
    • 🌍MAIN CLASSES
      • Agents and Tools
      • 🌍Auto Classes
        • Extending the Auto Classes
        • AutoConfig
        • AutoTokenizer
        • AutoFeatureExtractor
        • AutoImageProcessor
        • AutoProcessor
        • Generic model classes
          • AutoModel
          • TFAutoModel
          • FlaxAutoModel
        • Generic pretraining classes
          • AutoModelForPreTraining
          • TFAutoModelForPreTraining
          • FlaxAutoModelForPreTraining
        • Natural Language Processing
          • AutoModelForCausalLM
          • TFAutoModelForCausalLM
          • FlaxAutoModelForCausalLM
          • AutoModelForMaskedLM
          • TFAutoModelForMaskedLM
          • FlaxAutoModelForMaskedLM
          • AutoModelForMaskGenerationge
          • TFAutoModelForMaskGeneration
          • AutoModelForSeq2SeqLM
          • TFAutoModelForSeq2SeqLM
          • FlaxAutoModelForSeq2SeqLM
          • AutoModelForSequenceClassification
          • TFAutoModelForSequenceClassification
          • FlaxAutoModelForSequenceClassification
          • AutoModelForMultipleChoice
          • TFAutoModelForMultipleChoice
          • FlaxAutoModelForMultipleChoice
          • AutoModelForNextSentencePrediction
          • TFAutoModelForNextSentencePrediction
          • FlaxAutoModelForNextSentencePrediction
          • AutoModelForTokenClassification
          • TFAutoModelForTokenClassification
          • FlaxAutoModelForTokenClassification
          • AutoModelForQuestionAnswering
          • TFAutoModelForQuestionAnswering
          • FlaxAutoModelForQuestionAnswering
          • AutoModelForTextEncoding
          • TFAutoModelForTextEncoding
        • Computer vision
          • AutoModelForDepthEstimation
          • AutoModelForImageClassification
          • TFAutoModelForImageClassification
          • FlaxAutoModelForImageClassification
          • AutoModelForVideoClassification
          • AutoModelForMaskedImageModeling
          • TFAutoModelForMaskedImageModeling
          • AutoModelForObjectDetection
          • AutoModelForImageSegmentation
          • AutoModelForImageToImage
          • AutoModelForSemanticSegmentation
          • TFAutoModelForSemanticSegmentation
          • AutoModelForInstanceSegmentation
          • AutoModelForUniversalSegmentation
          • AutoModelForZeroShotImageClassification
          • TFAutoModelForZeroShotImageClassification
          • AutoModelForZeroShotObjectDetection
        • Audio
          • AutoModelForAudioClassification
          • AutoModelForAudioFrameClassification
          • TFAutoModelForAudioFrameClassification
          • AutoModelForCTC
          • AutoModelForSpeechSeq2Seq
          • TFAutoModelForSpeechSeq2Seq
          • FlaxAutoModelForSpeechSeq2Seq
          • AutoModelForAudioXVector
          • AutoModelForTextToSpectrogram
          • AutoModelForTextToWaveform
        • Multimodal
          • AutoModelForTableQuestionAnswering
          • TFAutoModelForTableQuestionAnswering
          • AutoModelForDocumentQuestionAnswering
          • TFAutoModelForDocumentQuestionAnswering
          • AutoModelForVisualQuestionAnswering
          • AutoModelForVision2Seq
          • TFAutoModelForVision2Seq
          • FlaxAutoModelForVision2Seq
      • Callbacks
      • Configuration
      • Data Collator
      • Keras callbacks
      • Logging
      • Models
      • Text Generation
      • ONNX
      • Optimization
      • Model outputs
      • Pipelines
      • Processors
      • Quantization
      • Tokenizer
      • Trainer
      • DeepSpeed Integration
      • Feature Extractor
      • Image Processor
    • 🌍MODELS
      • 🌍TEXT MODELS
        • ALBERT
        • BART
        • BARThez
        • BARTpho
        • BERT
        • BertGeneration
        • BertJapanese
        • Bertweet
        • BigBird
        • BigBirdPegasus
        • BioGpt
        • Blenderbot
        • Blenderbot Small
        • BLOOM
        • BORT
        • ByT5
        • CamemBERT
        • CANINE
        • CodeGen
        • CodeLlama
        • ConvBERT
        • CPM
        • CPMANT
        • CTRL
        • DeBERTa
        • DeBERTa-v2
        • DialoGPT
        • DistilBERT
        • DPR
        • ELECTRA
        • Encoder Decoder Models
        • ERNIE
        • ErnieM
        • ESM
        • Falcon
        • FLAN-T5
        • FLAN-UL2
        • FlauBERT
        • FNet
        • FSMT
        • Funnel Transformer
        • GPT
        • GPT Neo
        • GPT NeoX
        • GPT NeoX Japanese
        • GPT-J
        • GPT2
        • GPTBigCode
        • GPTSAN Japanese
        • GPTSw3
        • HerBERT
        • I-BERT
        • Jukebox
        • LED
        • LLaMA
        • LLama2
        • Longformer
        • LongT5
        • LUKE
        • M2M100
        • MarianMT
        • MarkupLM
        • MBart and MBart-50
        • MEGA
        • MegatronBERT
        • MegatronGPT2
        • Mistral
        • mLUKE
        • MobileBERT
        • MPNet
        • MPT
        • MRA
        • MT5
        • MVP
        • NEZHA
        • NLLB
        • NLLB-MoE
        • Nyströmformer
        • Open-Llama
        • OPT
        • Pegasus
        • PEGASUS-X
        • Persimmon
        • PhoBERT
        • PLBart
        • ProphetNet
        • QDQBert
        • RAG
        • REALM
        • Reformer
        • RemBERT
        • RetriBERT
        • RoBERTa
        • RoBERTa-PreLayerNorm
        • RoCBert
        • RoFormer
        • RWKV
        • Splinter
        • SqueezeBERT
        • SwitchTransformers
        • T5
        • T5v1.1
        • TAPEX
        • Transformer XL
        • UL2
        • UMT5
        • X-MOD
        • XGLM
        • XLM
        • XLM-ProphetNet
        • XLM-RoBERTa
        • XLM-RoBERTa-XL
        • XLM-V
        • XLNet
        • YOSO
      • 🌍VISION MODELS
        • BEiT
        • BiT
        • Conditional DETR
        • ConvNeXT
        • ConvNeXTV2
        • CvT
        • Deformable DETR
        • DeiT
        • DETA
        • DETR
        • DiNAT
        • DINO V2
        • DiT
        • DPT
        • EfficientFormer
        • EfficientNet
        • FocalNet
        • GLPN
        • ImageGPT
        • LeViT
        • Mask2Former
        • MaskFormer
        • MobileNetV1
        • MobileNetV2
        • MobileViT
        • MobileViTV2
        • NAT
        • PoolFormer
        • Pyramid Vision Transformer (PVT)
        • RegNet
        • ResNet
        • SegFormer
        • SwiftFormer
        • Swin Transformer
        • Swin Transformer V2
        • Swin2SR
        • Table Transformer
        • TimeSformer
        • UperNet
        • VAN
        • VideoMAE
        • Vision Transformer (ViT)
        • ViT Hybrid
        • ViTDet
        • ViTMAE
        • ViTMatte
        • ViTMSN
        • ViViT
        • YOLOS
      • 🌍AUDIO MODELS
        • Audio Spectrogram Transformer
        • Bark
        • CLAP
        • EnCodec
        • Hubert
        • MCTCT
        • MMS
        • MusicGen
        • Pop2Piano
        • SEW
        • SEW-D
        • Speech2Text
        • Speech2Text2
        • SpeechT5
        • UniSpeech
        • UniSpeech-SAT
        • VITS
        • Wav2Vec2
        • Wav2Vec2-Conformer
        • Wav2Vec2Phoneme
        • WavLM
        • Whisper
        • XLS-R
        • XLSR-Wav2Vec2
      • 🌍MULTIMODAL MODELS
        • ALIGN
        • AltCLIP
        • BLIP
        • BLIP-2
        • BridgeTower
        • BROS
        • Chinese-CLIP
        • CLIP
        • CLIPSeg
        • Data2Vec
        • DePlot
        • Donut
        • FLAVA
        • GIT
        • GroupViT
        • IDEFICS
        • InstructBLIP
        • LayoutLM
        • LayoutLMV2
        • LayoutLMV3
        • LayoutXLM
        • LiLT
        • LXMERT
        • MatCha
        • MGP-STR
        • Nougat
        • OneFormer
        • OWL-ViT
        • Perceiver
        • Pix2Struct
        • Segment Anything
        • Speech Encoder Decoder Models
        • TAPAS
        • TrOCR
        • TVLT
        • ViLT
        • Vision Encoder Decoder Models
        • Vision Text Dual Encoder
        • VisualBERT
        • X-CLIP
      • 🌍REINFORCEMENT LEARNING MODELS
        • Decision Transformer
        • Trajectory Transformer
      • 🌍TIME SERIES MODELS
        • Autoformer
        • Informer
        • Time Series Transformer
      • 🌍GRAPH MODELS
        • Graphormer
  • 🌍INTERNAL HELPERS
    • Custom Layers and Utilities
    • Utilities for pipelines
    • Utilities for Tokenizers
    • Utilities for Trainer
    • Utilities for Generation
    • Utilities for Image Processors
    • Utilities for Audio processing
    • General Utilities
    • Utilities for Time Series
Powered by GitBook
On this page
  • Nougat
  • Overview
  • Inference
  • NougatImageProcessor
  • NougatTokenizerFast
  • NougatProcessor
  1. API
  2. MODELS
  3. MULTIMODAL MODELS

Nougat

PreviousMGP-STRNextOneFormer

Last updated 1 year ago

Nougat

Overview

The Nougat model was proposed in by Lukas Blecher, Guillem Cucurull, Thomas Scialom, Robert Stojnic. Nougat uses the same architecture as , meaning an image Transformer encoder and an autoregressive text Transformer decoder to translate scientific PDFs to markdown, enabling easier access to them.

The abstract from the paper is the following:

Scientific knowledge is predominantly stored in books and scientific journals, often in the form of PDFs. However, the PDF format leads to a loss of semantic information, particularly for mathematical expressions. We propose Nougat (Neural Optical Understanding for Academic Documents), a Visual Transformer model that performs an Optical Character Recognition (OCR) task for processing scientific documents into a markup language, and demonstrate the effectiveness of our model on a new dataset of scientific documents. The proposed approach offers a promising solution to enhance the accessibility of scientific knowledge in the digital age, by bridging the gap between human-readable documents and machine-readable text. We release the models and code to accelerate future work on scientific text recognition.

Nougat high-level overview. Taken from the .

This model was contributed by . The original code can be found .

Tips:

  • The quickest way to get started with Nougat is by checking the , which show how to use the model at inference time as well as fine-tuning on custom data.

  • Nougat is always used within the framework. The model is identical to in terms of architecture.

Inference

Nougat’s VisionEncoderDecoder model accepts images as input and makes use of to autoregressively generate text given the input image.

The class is responsible for preprocessing the input image and decodes the generated target tokens to the target string. The wraps and classes into a single instance to both extract the input features and decode the predicted token ids.

  • Step-by-step PDF transcription

Copied

>>> from boincai_hub import hf_hub_download
>>> import re
>>> from PIL import Image

>>> from transformers import NougatProcessor, VisionEncoderDecoderModel
>>> from datasets import load_dataset
>>> import torch

>>> processor = NougatProcessor.from_pretrained("facebook/nougat-base")
>>> model = VisionEncoderDecoderModel.from_pretrained("facebook/nougat-base")

>>> device = "cuda" if torch.cuda.is_available() else "cpu"
>>> model.to(device)
>>> # prepare PDF image for the model
>>> filepath = hf_hub_download(repo_id="hf-internal-testing/fixtures_docvqa", filename="nougat_paper.png", repo_type="dataset")
>>> image = Image.open(filepath)
>>> pixel_values = processor(image, return_tensors="pt").pixel_values

>>> # generate transcription (here we only generate 30 tokens)
>>> outputs = model.generate(
...     pixel_values.to(device),
...     min_length=1,
...     max_new_tokens=30,
...     bad_words_ids=[[processor.tokenizer.unk_token_id]],
... )

>>> sequence = processor.batch_decode(outputs, skip_special_tokens=True)[0]
>>> sequence = processor.post_process_generation(sequence, fix_markdown=False)
>>> # note: we're using repr here such for the sake of printing the \n characters, feel free to just print the sequence
>>> print(repr(sequence))
'\n\n# Nougat: Neural Optical Understanding for Academic Documents\n\n Lukas Blecher\n\nCorrespondence to: lblecher@'

NougatImageProcessor

class transformers.NougatImageProcessor

( do_crop_margin: bool = Truedo_resize: bool = Truesize: typing.Dict[str, int] = Noneresample: Resampling = <Resampling.BILINEAR: 2>do_thumbnail: bool = Truedo_align_long_axis: bool = Falsedo_pad: bool = Truedo_rescale: bool = Truerescale_factor: typing.Union[int, float] = 0.00392156862745098do_normalize: bool = Trueimage_mean: typing.Union[float, typing.List[float], NoneType] = Noneimage_std: typing.Union[float, typing.List[float], NoneType] = None**kwargs )

Parameters

  • do_crop_margin (bool, optional, defaults to True) — Whether to crop the image margins.

  • do_resize (bool, optional, defaults to True) — Whether to resize the image’s (height, width) dimensions to the specified size. Can be overridden by do_resize in the preprocess method.

  • size (Dict[str, int] optional, defaults to {"height" -- 896, "width": 672}): Size of the image after resizing. Can be overridden by size in the preprocess method.

  • resample (PILImageResampling, optional, defaults to PILImageResampling.BILINEAR) — Resampling filter to use if resizing the image. Can be overridden by resample in the preprocess method.

  • do_thumbnail (bool, optional, defaults to True) — Whether to resize the image using thumbnail method.

  • do_align_long_axis (bool, optional, defaults to False) — Whether to align the long axis of the image with the long axis of size by rotating by 90 degrees.

  • do_pad (bool, optional, defaults to True) — Whether to pad the images to the largest image size in the batch.

  • do_rescale (bool, optional, defaults to True) — Whether to rescale the image by the specified scale rescale_factor. Can be overridden by the do_rescale parameter in the preprocess method.

  • rescale_factor (int or float, optional, defaults to 1/255) — Scale factor to use if rescaling the image. Can be overridden by the rescale_factor parameter in the preprocess method.

  • do_normalize (bool, optional, defaults to True) — Whether to normalize the image. Can be overridden by do_normalize in the preprocess method.

  • image_mean (float or List[float], optional, defaults to IMAGENET_DEFAULT_MEAN) — Mean to use if normalizing the image. This is a float or list of floats the length of the number of channels in the image. Can be overridden by the image_mean parameter in the preprocess method.

  • image_std (float or List[float], optional, defaults to IMAGENET_DEFAULT_STD) — Image standard deviation.

Constructs a Nougat image processor.

preprocess

( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), typing.List[ForwardRef('PIL.Image.Image')], typing.List[numpy.ndarray], typing.List[ForwardRef('torch.Tensor')]]do_crop_margin: bool = Nonedo_resize: bool = Nonesize: typing.Dict[str, int] = Noneresample: Resampling = Nonedo_thumbnail: bool = Nonedo_align_long_axis: bool = Nonedo_pad: bool = Nonedo_rescale: bool = Nonerescale_factor: typing.Union[int, float] = Nonedo_normalize: bool = Noneimage_mean: typing.Union[float, typing.List[float], NoneType] = Noneimage_std: typing.Union[float, typing.List[float], NoneType] = Nonereturn_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = Nonedata_format: typing.Optional[transformers.image_utils.ChannelDimension] = <ChannelDimension.FIRST: 'channels_first'>input_data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None**kwargs )

Parameters

  • images (ImageInput) — Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255.

  • do_crop_margin (bool, optional, defaults to self.do_crop_margin) — Whether to crop the image margins.

  • do_resize (bool, optional, defaults to self.do_resize) — Whether to resize the image.

  • size (Dict[str, int], optional, defaults to self.size) — Size of the image after resizing. Shortest edge of the image is resized to min(size[“height”], size[“width”]) with the longest edge resized to keep the input aspect ratio.

  • resample (int, optional, defaults to self.resample) — Resampling filter to use if resizing the image. This can be one of the enum PILImageResampling. Only has an effect if do_resize is set to True.

  • do_thumbnail (bool, optional, defaults to self.do_thumbnail) — Whether to resize the image using thumbnail method.

  • do_align_long_axis (bool, optional, defaults to self.do_align_long_axis) — Whether to align the long axis of the image with the long axis of size by rotating by 90 degrees.

  • do_pad (bool, optional, defaults to self.do_pad) — Whether to pad the images to the largest image size in the batch.

  • do_rescale (bool, optional, defaults to self.do_rescale) — Whether to rescale the image by the specified scale rescale_factor.

  • rescale_factor (int or float, optional, defaults to self.rescale_factor) — Scale factor to use if rescaling the image.

  • do_normalize (bool, optional, defaults to self.do_normalize) — Whether to normalize the image.

  • image_mean (float or List[float], optional, defaults to self.image_mean) — Image mean to use for normalization.

  • image_std (float or List[float], optional, defaults to self.image_std) — Image standard deviation to use for normalization.

  • return_tensors (str or TensorType, optional) — The type of tensors to return. Can be one of:

    • Unset: Return a list of np.ndarray.

    • TensorType.TENSORFLOW or 'tf': Return a batch of type tf.Tensor.

    • TensorType.PYTORCH or 'pt': Return a batch of type torch.Tensor.

    • TensorType.NUMPY or 'np': Return a batch of type np.ndarray.

    • TensorType.JAX or 'jax': Return a batch of type jax.numpy.ndarray.

  • data_format (ChannelDimension or str, optional, defaults to ChannelDimension.FIRST) — The channel dimension format for the output image. Can be one of:

    • ChannelDimension.FIRST: image in (num_channels, height, width) format.

    • ChannelDimension.LAST: image in (height, width, num_channels) format.

    • Unset: defaults to the channel dimension format of the input image.

  • input_data_format (ChannelDimension or str, optional) — The channel dimension format for the input image. If unset, the channel dimension format is inferred from the input image. Can be one of:

    • "channels_first" or ChannelDimension.FIRST: image in (num_channels, height, width) format.

    • "channels_last" or ChannelDimension.LAST: image in (height, width, num_channels) format.

    • "none" or ChannelDimension.NONE: image in (height, width) format.

Preprocess an image or batch of images.

NougatTokenizerFast

class transformers.NougatTokenizerFast

( vocab_file = Nonetokenizer_file = Noneclean_up_tokenization_spaces = Falseunk_token = '<unk>'bos_token = '<s>'eos_token = '</s>'pad_token = '<pad>'**kwargs )

Parameters

  • clean_up_tokenization_spaces (str, optional, defaults to False) — Wether to cleanup spaces after decoding, cleanup consists in removing potential artifacts like extra spaces.

  • bos_token (str, optional, defaults to "<s>") — The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.

  • eos_token (str, optional, defaults to "</s>") — The end of sequence token.

  • unk_token (str, optional, defaults to "<unk>") — The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead.

  • pad_token (str, optional, defaults to "<pad>") — The token used for padding, for example when batching sequences of different lengths.

  • padding_side (str, optional) — The side on which the model should have padding applied. Should be selected between [‘right’, ‘left’]. Default value is picked from the class attribute of the same name.

  • truncation_side (str, optional) — The side on which the model should have truncation applied. Should be selected between [‘right’, ‘left’]. Default value is picked from the class attribute of the same name.

  • model_input_names (List[string], optional) — The list of inputs accepted by the forward pass of the model (like "token_type_ids" or "attention_mask"). Default value is picked from the class attribute of the same name.

  • bos_token (str or tokenizers.AddedToken, optional) — A special token representing the beginning of a sentence. Will be associated to self.bos_token and self.bos_token_id.

  • eos_token (str or tokenizers.AddedToken, optional) — A special token representing the end of a sentence. Will be associated to self.eos_token and self.eos_token_id.

  • unk_token (str or tokenizers.AddedToken, optional) — A special token representing an out-of-vocabulary token. Will be associated to self.unk_token and self.unk_token_id.

  • sep_token (str or tokenizers.AddedToken, optional) — A special token separating two different sentences in the same input (used by BERT for instance). Will be associated to self.sep_token and self.sep_token_id.

  • pad_token (str or tokenizers.AddedToken, optional) — A special token used to make arrays of tokens the same size for batching purpose. Will then be ignored by attention mechanisms or loss computation. Will be associated to self.pad_token and self.pad_token_id.

  • cls_token (str or tokenizers.AddedToken, optional) — A special token representing the class of the input (used by BERT for instance). Will be associated to self.cls_token and self.cls_token_id.

  • mask_token (str or tokenizers.AddedToken, optional) — A special token representing a masked token (used by masked-language modeling pretraining objectives, like BERT). Will be associated to self.mask_token and self.mask_token_id.

  • additional_special_tokens (tuple or list of str or tokenizers.AddedToken, optional) — A tuple or a list of additional special tokens. Add them here to ensure they are skipped when decoding with skip_special_tokens is set to True. If they are not part of the vocabulary, they will be added at the end of the vocabulary.

  • clean_up_tokenization_spaces (bool, optional, defaults to True) — Whether or not the model should cleanup the spaces that were added when splitting the input text during the tokenization process.

  • split_special_tokens (bool, optional, defaults to False) — Whether or not the special tokens should be split during the tokenization process. The default behavior is to not split special tokens. This means that if <s> is the bos_token, then tokenizer.tokenize("<s>") = ['<s>]. Otherwise, if split_special_tokens=True, then tokenizer.tokenize("<s>") will be give ['<', 's', '>']. This argument is only supported for slow tokenizers for the moment.

  • tokenizer_file (str) — A path to a local JSON file representing a previously serialized tokenizers.Tokenizer object from 🌍 tokenizers.

Fast tokenizer for Nougat (backed by BOINC AI tokenizers library).

Class attributes (overridden by derived classes)

  • vocab_files_names (Dict[str, str]) — A dictionary with, as keys, the __init__ keyword name of each vocabulary file required by the model, and as associated values, the filename for saving the associated file (string).

  • pretrained_vocab_files_map (Dict[str, Dict[str, str]]) — A dictionary of dictionaries, with the high-level keys being the __init__ keyword name of each vocabulary file required by the model, the low-level being the short-cut-names of the pretrained models with, as associated values, the url to the associated pretrained vocabulary file.

  • max_model_input_sizes (Dict[str, Optional[int]]) — A dictionary with, as keys, the short-cut-names of the pretrained models, and as associated values, the maximum length of the sequence inputs of this model, or None if the model has no maximum input size.

  • model_input_names (List[str]) — A list of inputs expected in the forward pass of the model.

  • padding_side (str) — The default value for the side on which the model should have padding applied. Should be 'right' or 'left'.

  • truncation_side (str) — The default value for the side on which the model should have truncation applied. Should be 'right' or 'left'.

correct_tables

( generation: str ) → str

Parameters

  • generation (str) — The generated text to be postprocessed.

Returns

str

The postprocessed text.

Takes a generated string and fixes tables/tabulars to make them match the markdown format needed.

Example:

Copied

correct_tables("\begin{table} \begin{tabular}{l l} & \ \end{tabular} \end{table}")
"\begin{table}
abular}{l l} & \ \end{tabular}
le}"

post_process_generation

( generation: typing.Union[str, typing.List[str]]fix_markdown: bool = Truenum_workers: int = None ) → Union[str, List[str]]

Parameters

  • generation (Union[str, List[str]]) — The generated text or a list of generated texts.

  • fix_markdown (bool, optional, defaults to True) — Whether to perform Markdown formatting fixes.

  • num_workers (int, optional) — Optional number of workers to pass to leverage multiprocessing (postprocessing several texts in parallel).

Returns

Union[str, List[str]]

The postprocessed text or list of postprocessed texts.

Postprocess a generated text or a list of generated texts.

This function can be used to perform postprocessing on generated text, such as fixing Markdown formatting.

Postprocessing is quite slow so it is recommended to use multiprocessing to speed up the process.

post_process_single

( generation: strfix_markdown: bool = True ) → str

Parameters

  • generation (str) — The generated text to be postprocessed.

  • fix_markdown (bool, optional) — Whether to perform Markdown formatting fixes. Default is True.

Returns

str

The postprocessed text.

Postprocess a single generated text. Regular expressions used here are taken directly from the Nougat article authors. These expressions are commented for clarity and tested end-to-end in most cases.

remove_hallucinated_references

( text: str ) → str

Parameters

  • text (str) — The input text containing references.

Returns

str

The text with hallucinated references removed.

Remove hallucinated or missing references from the text.

This function identifies and removes references that are marked as missing or hallucinated from the input text.

NougatProcessor

class transformers.NougatProcessor

( image_processortokenizer )

Parameters

Constructs a Nougat processor which wraps a Nougat image processor and a Nougat tokenizer into a single processor.

__call__

( images = Nonetext = Nonedo_crop_margin: bool = Nonedo_resize: bool = Nonesize: typing.Dict[str, int] = Noneresample: PILImageResampling = Nonedo_thumbnail: bool = Nonedo_align_long_axis: bool = Nonedo_pad: bool = Nonedo_rescale: bool = Nonerescale_factor: typing.Union[int, float] = Nonedo_normalize: bool = Noneimage_mean: typing.Union[float, typing.List[float], NoneType] = Noneimage_std: typing.Union[float, typing.List[float], NoneType] = Nonedata_format: typing.Optional[ForwardRef('ChannelDimension')] = 'ChannelDimension.FIRST'input_data_format: typing.Union[str, ForwardRef('ChannelDimension'), NoneType] = Nonetext_pair: typing.Union[str, typing.List[str], typing.List[typing.List[str]], NoneType] = Nonetext_target: typing.Union[str, typing.List[str], typing.List[typing.List[str]]] = Nonetext_pair_target: typing.Union[str, typing.List[str], typing.List[typing.List[str]], NoneType] = Noneadd_special_tokens: bool = Truepadding: typing.Union[bool, str, transformers.utils.generic.PaddingStrategy] = Falsetruncation: typing.Union[bool, str, transformers.tokenization_utils_base.TruncationStrategy] = Nonemax_length: typing.Optional[int] = Nonestride: int = 0is_split_into_words: bool = Falsepad_to_multiple_of: typing.Optional[int] = Nonereturn_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = Nonereturn_token_type_ids: typing.Optional[bool] = Nonereturn_attention_mask: typing.Optional[bool] = Nonereturn_overflowing_tokens: bool = Falsereturn_special_tokens_mask: bool = Falsereturn_offsets_mapping: bool = Falsereturn_length: bool = Falseverbose: bool = True )

from_pretrained

( pretrained_model_name_or_path: typing.Union[str, os.PathLike]cache_dir: typing.Union[str, os.PathLike, NoneType] = Noneforce_download: bool = Falselocal_files_only: bool = Falsetoken: typing.Union[bool, str, NoneType] = Nonerevision: str = 'main'**kwargs )

Parameters

  • pretrained_model_name_or_path (str or os.PathLike) — This can be either:

    • a string, the model id of a pretrained feature_extractor hosted inside a model repo on boincai.com. Valid model ids can be located at the root-level, like bert-base-uncased, or namespaced under a user or organization name, like dbmdz/bert-base-german-cased.

Instantiate a processor associated with a pretrained model.

save_pretrained

( save_directorypush_to_hub: bool = False**kwargs )

Parameters

  • save_directory (str or os.PathLike) — Directory where the feature extractor JSON file and the tokenizer files will be saved (directory will be created if it does not exist).

  • push_to_hub (bool, optional, defaults to False) — Whether or not to push your model to the BOINC AI model hub after saving it. You can specify the repository you want to push to with repo_id (will default to the name of save_directory in your namespace).

batch_decode

( *args**kwargs )

decode

( *args**kwargs )

post_process_generation

( *args**kwargs )

This method forwards all its arguments to NougatTokenizer’s ~PreTrainedTokenizer.post_process_generation. Please refer to the docstring of this method for more information.

See the to look for Nougat checkpoints.

vocab_file (str) — file (generally has a .model extension) that contains the vocabulary necessary to instantiate a tokenizer.

tokenizer_file (str) — file (generally has a .json extension) that contains everything needed to load the tokenizer.

model_max_length (int, optional) — The maximum length (in number of tokens) for the inputs to the transformer model. When the tokenizer is loaded with , this will be set to the value stored for the associated model in max_model_input_sizes (see above). If no value is provided, will default to VERY_LARGE_INTEGER (int(1e30)).

chat_template (str, optional) — A Jinja template string that will be used to format lists of chat messages. See for a full description.

tokenizer_object (tokenizers.Tokenizer) — A tokenizers.Tokenizer object from 🌍tokenizers to instantiate from. See 🌍 for more information.

This tokenizer inherits from which contains most of the main methods. Users should refer to this superclass for more information regarding those methods. This class mainly adds Nougat-specific methods for postprocessing the generated text.

pretrained_init_configuration (Dict[str, Dict[str, Any]]) — A dictionary with, as keys, the short-cut-names of the pretrained models, and as associated values, a dictionary of specific arguments to pass to the __init__ method of the tokenizer class for this pretrained model when loading the tokenizer with the method.

image_processor () — An instance of . The image processor is a required input.

tokenizer () — An instance of . The tokenizer is a required input.

offers all the functionalities of and . See the and for more information.

a path to a directory containing a feature extractor file saved using the method, e.g., ./my_model_directory/.

a path or url to a saved feature extractor JSON file, e.g., ./my_model_directory/preprocessor_config.json. **kwargs — Additional keyword arguments passed along to both and ~tokenization_utils_base.PreTrainedTokenizer.from_pretrained.

This class method is simply calling the feature extractor , image processor and the tokenizer ~tokenization_utils_base.PreTrainedTokenizer.from_pretrained methods. Please refer to the docstrings of the methods above for more information.

kwargs (Dict[str, Any], optional) — Additional key word arguments passed along to the method.

Saves the attributes of this processor (feature extractor, tokenizer…) in the specified directory so that it can be reloaded using the method.

This class method is simply calling and . Please refer to the docstrings of the methods above for more information.

This method forwards all its arguments to NougatTokenizer’s . Please refer to the docstring of this method for more information.

This method forwards all its arguments to NougatTokenizer’s . Please refer to the docstring of this method for more information.

🌍
🌍
🌍
model hub
<source>
<source>
<source>
SentencePiece
tokenizers
from_pretrained()
https://boincai.com/docs/transformers/chat_templating
Using tokenizers from
tokenizers
PreTrainedTokenizerFast
from_pretrained()
<source>
<source>
<source>
<source>
<source>
NougatImageProcessor
NougatImageProcessor
NougatTokenizerFast
NougatTokenizerFast
NougatProcessor
NougatImageProcessor
NougatTokenizerFast
call()
decode()
<source>
<source>
save_pretrained()
from_pretrained()
from_pretrained()
ImageProcessingMixin
<source>
push_to_hub()
from_pretrained()
save_pretrained()
save_pretrained()
<source>
batch_decode()
<source>
decode()
<source>
Nougat: Neural Optical Understanding for Academic Documents
Donut
original paper
nielsr
here
tutorial notebooks
VisionEncoderDecoder
Donut
generate()
NougatImageProcessor
NougatTokenizerFast
NougatProcessor
NougatImageProcessor
NougatTokenizerFast