Transformers
  • 🌍GET STARTED
    • Transformers
    • Quick tour
    • Installation
  • 🌍TUTORIALS
    • Run inference with pipelines
    • Write portable code with AutoClass
    • Preprocess data
    • Fine-tune a pretrained model
    • Train with a script
    • Set up distributed training with BOINC AI Accelerate
    • Load and train adapters with BOINC AI PEFT
    • Share your model
    • Agents
    • Generation with LLMs
  • 🌍TASK GUIDES
    • 🌍NATURAL LANGUAGE PROCESSING
      • Text classification
      • Token classification
      • Question answering
      • Causal language modeling
      • Masked language modeling
      • Translation
      • Summarization
      • Multiple choice
    • 🌍AUDIO
      • Audio classification
      • Automatic speech recognition
    • 🌍COMPUTER VISION
      • Image classification
      • Semantic segmentation
      • Video classification
      • Object detection
      • Zero-shot object detection
      • Zero-shot image classification
      • Depth estimation
    • 🌍MULTIMODAL
      • Image captioning
      • Document Question Answering
      • Visual Question Answering
      • Text to speech
    • 🌍GENERATION
      • Customize the generation strategy
    • 🌍PROMPTING
      • Image tasks with IDEFICS
  • 🌍DEVELOPER GUIDES
    • Use fast tokenizers from BOINC AI Tokenizers
    • Run inference with multilingual models
    • Use model-specific APIs
    • Share a custom model
    • Templates for chat models
    • Run training on Amazon SageMaker
    • Export to ONNX
    • Export to TFLite
    • Export to TorchScript
    • Benchmarks
    • Notebooks with examples
    • Community resources
    • Custom Tools and Prompts
    • Troubleshoot
  • 🌍PERFORMANCE AND SCALABILITY
    • Overview
    • 🌍EFFICIENT TRAINING TECHNIQUES
      • Methods and tools for efficient training on a single GPU
      • Multiple GPUs and parallelism
      • Efficient training on CPU
      • Distributed CPU training
      • Training on TPUs
      • Training on TPU with TensorFlow
      • Training on Specialized Hardware
      • Custom hardware for training
      • Hyperparameter Search using Trainer API
    • 🌍OPTIMIZING INFERENCE
      • Inference on CPU
      • Inference on one GPU
      • Inference on many GPUs
      • Inference on Specialized Hardware
    • Instantiating a big model
    • Troubleshooting
    • XLA Integration for TensorFlow Models
    • Optimize inference using `torch.compile()`
  • 🌍CONTRIBUTE
    • How to contribute to transformers?
    • How to add a model to BOINC AI Transformers?
    • How to convert a BOINC AI Transformers model to TensorFlow?
    • How to add a pipeline to BOINC AI Transformers?
    • Testing
    • Checks on a Pull Request
  • 🌍CONCEPTUAL GUIDES
    • Philosophy
    • Glossary
    • What BOINC AI Transformers can do
    • How BOINC AI Transformers solve tasks
    • The Transformer model family
    • Summary of the tokenizers
    • Attention mechanisms
    • Padding and truncation
    • BERTology
    • Perplexity of fixed-length models
    • Pipelines for webserver inference
    • Model training anatomy
  • 🌍API
    • 🌍MAIN CLASSES
      • Agents and Tools
      • 🌍Auto Classes
        • Extending the Auto Classes
        • AutoConfig
        • AutoTokenizer
        • AutoFeatureExtractor
        • AutoImageProcessor
        • AutoProcessor
        • Generic model classes
          • AutoModel
          • TFAutoModel
          • FlaxAutoModel
        • Generic pretraining classes
          • AutoModelForPreTraining
          • TFAutoModelForPreTraining
          • FlaxAutoModelForPreTraining
        • Natural Language Processing
          • AutoModelForCausalLM
          • TFAutoModelForCausalLM
          • FlaxAutoModelForCausalLM
          • AutoModelForMaskedLM
          • TFAutoModelForMaskedLM
          • FlaxAutoModelForMaskedLM
          • AutoModelForMaskGenerationge
          • TFAutoModelForMaskGeneration
          • AutoModelForSeq2SeqLM
          • TFAutoModelForSeq2SeqLM
          • FlaxAutoModelForSeq2SeqLM
          • AutoModelForSequenceClassification
          • TFAutoModelForSequenceClassification
          • FlaxAutoModelForSequenceClassification
          • AutoModelForMultipleChoice
          • TFAutoModelForMultipleChoice
          • FlaxAutoModelForMultipleChoice
          • AutoModelForNextSentencePrediction
          • TFAutoModelForNextSentencePrediction
          • FlaxAutoModelForNextSentencePrediction
          • AutoModelForTokenClassification
          • TFAutoModelForTokenClassification
          • FlaxAutoModelForTokenClassification
          • AutoModelForQuestionAnswering
          • TFAutoModelForQuestionAnswering
          • FlaxAutoModelForQuestionAnswering
          • AutoModelForTextEncoding
          • TFAutoModelForTextEncoding
        • Computer vision
          • AutoModelForDepthEstimation
          • AutoModelForImageClassification
          • TFAutoModelForImageClassification
          • FlaxAutoModelForImageClassification
          • AutoModelForVideoClassification
          • AutoModelForMaskedImageModeling
          • TFAutoModelForMaskedImageModeling
          • AutoModelForObjectDetection
          • AutoModelForImageSegmentation
          • AutoModelForImageToImage
          • AutoModelForSemanticSegmentation
          • TFAutoModelForSemanticSegmentation
          • AutoModelForInstanceSegmentation
          • AutoModelForUniversalSegmentation
          • AutoModelForZeroShotImageClassification
          • TFAutoModelForZeroShotImageClassification
          • AutoModelForZeroShotObjectDetection
        • Audio
          • AutoModelForAudioClassification
          • AutoModelForAudioFrameClassification
          • TFAutoModelForAudioFrameClassification
          • AutoModelForCTC
          • AutoModelForSpeechSeq2Seq
          • TFAutoModelForSpeechSeq2Seq
          • FlaxAutoModelForSpeechSeq2Seq
          • AutoModelForAudioXVector
          • AutoModelForTextToSpectrogram
          • AutoModelForTextToWaveform
        • Multimodal
          • AutoModelForTableQuestionAnswering
          • TFAutoModelForTableQuestionAnswering
          • AutoModelForDocumentQuestionAnswering
          • TFAutoModelForDocumentQuestionAnswering
          • AutoModelForVisualQuestionAnswering
          • AutoModelForVision2Seq
          • TFAutoModelForVision2Seq
          • FlaxAutoModelForVision2Seq
      • Callbacks
      • Configuration
      • Data Collator
      • Keras callbacks
      • Logging
      • Models
      • Text Generation
      • ONNX
      • Optimization
      • Model outputs
      • Pipelines
      • Processors
      • Quantization
      • Tokenizer
      • Trainer
      • DeepSpeed Integration
      • Feature Extractor
      • Image Processor
    • 🌍MODELS
      • 🌍TEXT MODELS
        • ALBERT
        • BART
        • BARThez
        • BARTpho
        • BERT
        • BertGeneration
        • BertJapanese
        • Bertweet
        • BigBird
        • BigBirdPegasus
        • BioGpt
        • Blenderbot
        • Blenderbot Small
        • BLOOM
        • BORT
        • ByT5
        • CamemBERT
        • CANINE
        • CodeGen
        • CodeLlama
        • ConvBERT
        • CPM
        • CPMANT
        • CTRL
        • DeBERTa
        • DeBERTa-v2
        • DialoGPT
        • DistilBERT
        • DPR
        • ELECTRA
        • Encoder Decoder Models
        • ERNIE
        • ErnieM
        • ESM
        • Falcon
        • FLAN-T5
        • FLAN-UL2
        • FlauBERT
        • FNet
        • FSMT
        • Funnel Transformer
        • GPT
        • GPT Neo
        • GPT NeoX
        • GPT NeoX Japanese
        • GPT-J
        • GPT2
        • GPTBigCode
        • GPTSAN Japanese
        • GPTSw3
        • HerBERT
        • I-BERT
        • Jukebox
        • LED
        • LLaMA
        • LLama2
        • Longformer
        • LongT5
        • LUKE
        • M2M100
        • MarianMT
        • MarkupLM
        • MBart and MBart-50
        • MEGA
        • MegatronBERT
        • MegatronGPT2
        • Mistral
        • mLUKE
        • MobileBERT
        • MPNet
        • MPT
        • MRA
        • MT5
        • MVP
        • NEZHA
        • NLLB
        • NLLB-MoE
        • Nyströmformer
        • Open-Llama
        • OPT
        • Pegasus
        • PEGASUS-X
        • Persimmon
        • PhoBERT
        • PLBart
        • ProphetNet
        • QDQBert
        • RAG
        • REALM
        • Reformer
        • RemBERT
        • RetriBERT
        • RoBERTa
        • RoBERTa-PreLayerNorm
        • RoCBert
        • RoFormer
        • RWKV
        • Splinter
        • SqueezeBERT
        • SwitchTransformers
        • T5
        • T5v1.1
        • TAPEX
        • Transformer XL
        • UL2
        • UMT5
        • X-MOD
        • XGLM
        • XLM
        • XLM-ProphetNet
        • XLM-RoBERTa
        • XLM-RoBERTa-XL
        • XLM-V
        • XLNet
        • YOSO
      • 🌍VISION MODELS
        • BEiT
        • BiT
        • Conditional DETR
        • ConvNeXT
        • ConvNeXTV2
        • CvT
        • Deformable DETR
        • DeiT
        • DETA
        • DETR
        • DiNAT
        • DINO V2
        • DiT
        • DPT
        • EfficientFormer
        • EfficientNet
        • FocalNet
        • GLPN
        • ImageGPT
        • LeViT
        • Mask2Former
        • MaskFormer
        • MobileNetV1
        • MobileNetV2
        • MobileViT
        • MobileViTV2
        • NAT
        • PoolFormer
        • Pyramid Vision Transformer (PVT)
        • RegNet
        • ResNet
        • SegFormer
        • SwiftFormer
        • Swin Transformer
        • Swin Transformer V2
        • Swin2SR
        • Table Transformer
        • TimeSformer
        • UperNet
        • VAN
        • VideoMAE
        • Vision Transformer (ViT)
        • ViT Hybrid
        • ViTDet
        • ViTMAE
        • ViTMatte
        • ViTMSN
        • ViViT
        • YOLOS
      • 🌍AUDIO MODELS
        • Audio Spectrogram Transformer
        • Bark
        • CLAP
        • EnCodec
        • Hubert
        • MCTCT
        • MMS
        • MusicGen
        • Pop2Piano
        • SEW
        • SEW-D
        • Speech2Text
        • Speech2Text2
        • SpeechT5
        • UniSpeech
        • UniSpeech-SAT
        • VITS
        • Wav2Vec2
        • Wav2Vec2-Conformer
        • Wav2Vec2Phoneme
        • WavLM
        • Whisper
        • XLS-R
        • XLSR-Wav2Vec2
      • 🌍MULTIMODAL MODELS
        • ALIGN
        • AltCLIP
        • BLIP
        • BLIP-2
        • BridgeTower
        • BROS
        • Chinese-CLIP
        • CLIP
        • CLIPSeg
        • Data2Vec
        • DePlot
        • Donut
        • FLAVA
        • GIT
        • GroupViT
        • IDEFICS
        • InstructBLIP
        • LayoutLM
        • LayoutLMV2
        • LayoutLMV3
        • LayoutXLM
        • LiLT
        • LXMERT
        • MatCha
        • MGP-STR
        • Nougat
        • OneFormer
        • OWL-ViT
        • Perceiver
        • Pix2Struct
        • Segment Anything
        • Speech Encoder Decoder Models
        • TAPAS
        • TrOCR
        • TVLT
        • ViLT
        • Vision Encoder Decoder Models
        • Vision Text Dual Encoder
        • VisualBERT
        • X-CLIP
      • 🌍REINFORCEMENT LEARNING MODELS
        • Decision Transformer
        • Trajectory Transformer
      • 🌍TIME SERIES MODELS
        • Autoformer
        • Informer
        • Time Series Transformer
      • 🌍GRAPH MODELS
        • Graphormer
  • 🌍INTERNAL HELPERS
    • Custom Layers and Utilities
    • Utilities for pipelines
    • Utilities for Tokenizers
    • Utilities for Trainer
    • Utilities for Generation
    • Utilities for Image Processors
    • Utilities for Audio processing
    • General Utilities
    • Utilities for Time Series
Powered by GitBook
On this page
  • Mask2Former
  • Overview
  • Resources
  • MaskFormer specific outputs
  • Mask2FormerConfig
  • Mask2FormerModel
  • Mask2FormerForUniversalSegmentation
  • Mask2FormerImageProcessor
  1. API
  2. MODELS
  3. VISION MODELS

Mask2Former

PreviousLeViTNextMaskFormer

Last updated 1 year ago

Mask2Former

Overview

The Mask2Former model was proposed in by Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, Rohit Girdhar. Mask2Former is a unified framework for panoptic, instance and semantic segmentation and features significant performance and efficiency improvements over .

The abstract from the paper is the following:

Image segmentation groups pixels with different semantics, e.g., category or instance membership. Each choice of semantics defines a task. While only the semantics of each task differ, current research focuses on designing specialized architectures for each task. We present Masked-attention Mask Transformer (Mask2Former), a new architecture capable of addressing any image segmentation task (panoptic, instance or semantic). Its key components include masked attention, which extracts localized features by constraining cross-attention within predicted mask regions. In addition to reducing the research effort by at least three times, it outperforms the best specialized architectures by a significant margin on four popular datasets. Most notably, Mask2Former sets a new state-of-the-art for panoptic segmentation (57.8 PQ on COCO), instance segmentation (50.1 AP on COCO) and semantic segmentation (57.7 mIoU on ADE20K).

Tips:

  • Mask2Former uses the same preprocessing and postprocessing steps as . Use or to prepare images and optional targets for the model.

  • To get the final segmentation, depending on the task, you can call or or . All three tasks can be solved using output, panoptic segmentation accepts an optional label_ids_to_fuse argument to fuse instances of the target object/s (e.g. sky) together.

Resources

A list of official BOINC AI and community (indicated by 🌎) resources to help you get started with Mask2Former.

If you’re interested in submitting a resource to be included here, please feel free to open a Pull Request and we will review it. The resource should ideally demonstrate something new instead of duplicating an existing resource.

MaskFormer specific outputs

class transformers.models.mask2former.modeling_mask2former.Mask2FormerModelOutput

( encoder_last_hidden_state: FloatTensor = Nonepixel_decoder_last_hidden_state: FloatTensor = Nonetransformer_decoder_last_hidden_state: FloatTensor = Noneencoder_hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = Nonepixel_decoder_hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = Nonetransformer_decoder_hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = Nonetransformer_decoder_intermediate_states: typing.Tuple[torch.FloatTensor] = Nonemasks_queries_logits: typing.Tuple[torch.FloatTensor] = Noneattentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None )

Parameters

  • encoder_last_hidden_state (torch.FloatTensor of shape (batch_size, num_channels, height, width), optional) — Last hidden states (final feature map) of the last stage of the encoder model (backbone). Returned when output_hidden_states=True is passed.

  • encoder_hidden_states (tuple(torch.FloatTensor), optional) — Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each stage) of shape (batch_size, num_channels, height, width). Hidden-states (also called feature maps) of the encoder model at the output of each stage. Returned when output_hidden_states=True is passed.

  • pixel_decoder_last_hidden_state (torch.FloatTensor of shape (batch_size, num_channels, height, width), optional) — Last hidden states (final feature map) of the last stage of the pixel decoder model.

  • pixel_decoder_hidden_states (tuple(torch.FloatTensor), , optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each stage) of shape (batch_size, num_channels, height, width). Hidden-states (also called feature maps) of the pixel decoder model at the output of each stage. Returned when output_hidden_states=True is passed.

  • transformer_decoder_last_hidden_state (tuple(torch.FloatTensor)) — Final output of the transformer decoder (batch_size, sequence_length, hidden_size).

  • transformer_decoder_hidden_states (tuple(torch.FloatTensor), optional) — Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each stage) of shape (batch_size, sequence_length, hidden_size). Hidden-states (also called feature maps) of the transformer decoder at the output of each stage. Returned when output_hidden_states=True is passed.

  • transformer_decoder_intermediate_states (tuple(torch.FloatTensor) of shape (num_queries, 1, hidden_size)) — Intermediate decoder activations, i.e. the output of each decoder layer, each of them gone through a layernorm.

  • masks_queries_logits (tuple(torch.FloatTensor) of shape (batch_size, num_queries, height, width)) — Mask Predictions from each layer in the transformer decoder.

  • attentions (tuple(tuple(torch.FloatTensor)), optional, returned when output_attentions=True is passed) — Tuple of tuple(torch.FloatTensor) (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length). Self attentions weights from transformer decoder.

class transformers.models.mask2former.modeling_mask2former.Mask2FormerForUniversalSegmentationOutput

( loss: typing.Optional[torch.FloatTensor] = Noneclass_queries_logits: FloatTensor = Nonemasks_queries_logits: FloatTensor = Noneauxiliary_logits: typing.Union[typing.List[typing.Dict[str, torch.FloatTensor]], NoneType] = Noneencoder_last_hidden_state: FloatTensor = Nonepixel_decoder_last_hidden_state: FloatTensor = Nonetransformer_decoder_last_hidden_state: FloatTensor = Noneencoder_hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = Nonepixel_decoder_hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = Nonetransformer_decoder_hidden_states: typing.Optional[torch.FloatTensor] = Noneattentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None )

Parameters

  • loss (torch.Tensor, optional) — The computed loss, returned when labels are present.

  • class_queries_logits (torch.FloatTensor) — A tensor of shape (batch_size, num_queries, num_labels + 1) representing the proposed classes for each query. Note the + 1 is needed because we incorporate the null class.

  • masks_queries_logits (torch.FloatTensor) — A tensor of shape (batch_size, num_queries, height, width) representing the proposed masks for each query.

  • auxiliary_logits (List[Dict(str, torch.FloatTensor)], optional) — List of class and mask predictions from each layer of the transformer decoder.

  • encoder_last_hidden_state (torch.FloatTensor of shape (batch_size, num_channels, height, width)) — Last hidden states (final feature map) of the last stage of the encoder model (backbone).

  • encoder_hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each stage) of shape (batch_size, num_channels, height, width). Hidden-states (also called feature maps) of the encoder model at the output of each stage.

  • pixel_decoder_last_hidden_state (torch.FloatTensor of shape (batch_size, num_channels, height, width)) — Last hidden states (final feature map) of the last stage of the pixel decoder model.

  • pixel_decoder_hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each stage) of shape (batch_size, num_channels, height, width). Hidden-states (also called feature maps) of the pixel decoder model at the output of each stage.

  • transformer_decoder_last_hidden_state (tuple(torch.FloatTensor)) — Final output of the transformer decoder (batch_size, sequence_length, hidden_size).

  • transformer_decoder_hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each stage) of shape (batch_size, sequence_length, hidden_size). Hidden-states (also called feature maps) of the transformer decoder at the output of each stage.

  • attentions (tuple(tuple(torch.FloatTensor)), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of tuple(torch.FloatTensor) (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length). Self and Cross Attentions weights from transformer decoder.

Class for outputs of Mask2FormerForUniversalSegmentationOutput.

Mask2FormerConfig

class transformers.Mask2FormerConfig

( backbone_config: typing.Optional[typing.Dict] = Nonefeature_size: int = 256mask_feature_size: int = 256hidden_dim: int = 256encoder_feedforward_dim: int = 1024activation_function: str = 'relu'encoder_layers: int = 6decoder_layers: int = 10num_attention_heads: int = 8dropout: float = 0.0dim_feedforward: int = 2048pre_norm: bool = Falseenforce_input_projection: bool = Falsecommon_stride: int = 4ignore_value: int = 255num_queries: int = 100no_object_weight: float = 0.1class_weight: float = 2.0mask_weight: float = 5.0dice_weight: float = 5.0train_num_points: int = 12544oversample_ratio: float = 3.0importance_sample_ratio: float = 0.75init_std: float = 0.02init_xavier_std: float = 1.0use_auxiliary_loss: bool = Truefeature_strides: typing.List[int] = [4, 8, 16, 32]output_auxiliary_logits: bool = None**kwargs )

Parameters

  • backbone_config (PretrainedConfig or dict, optional, defaults to SwinConfig()) — The configuration of the backbone model. If unset, the configuration corresponding to swin-base-patch4-window12-384 will be used.

  • feature_size (int, optional, defaults to 256) — The features (channels) of the resulting feature maps.

  • mask_feature_size (int, optional, defaults to 256) — The masks’ features size, this value will also be used to specify the Feature Pyramid Network features’ size.

  • hidden_dim (int, optional, defaults to 256) — Dimensionality of the encoder layers.

  • encoder_feedforward_dim (int, optional, defaults to 1024) — Dimension of feedforward network for deformable detr encoder used as part of pixel decoder.

  • encoder_layers (int, optional, defaults to 6) — Number of layers in the deformable detr encoder used as part of pixel decoder.

  • decoder_layers (int, optional, defaults to 10) — Number of layers in the Transformer decoder.

  • num_attention_heads (int, optional, defaults to 8) — Number of attention heads for each attention layer.

  • dropout (float, optional, defaults to 0.1) — The dropout probability for all fully connected layers in the embeddings, encoder.

  • dim_feedforward (int, optional, defaults to 2048) — Feature dimension in feedforward network for transformer decoder.

  • pre_norm (bool, optional, defaults to False) — Whether to use pre-LayerNorm or not for transformer decoder.

  • enforce_input_projection (bool, optional, defaults to False) — Whether to add an input projection 1x1 convolution even if the input channels and hidden dim are identical in the Transformer decoder.

  • common_stride (int, optional, defaults to 4) — Parameter used for determining number of FPN levels used as part of pixel decoder.

  • ignore_value (int, optional, defaults to 255) — Category id to be ignored during training.

  • num_queries (int, optional, defaults to 100) — Number of queries for the decoder.

  • no_object_weight (int, optional, defaults to 0.1) — The weight to apply to the null (no object) class.

  • class_weight (int, optional, defaults to 2.0) — The weight for the cross entropy loss.

  • mask_weight (int, optional, defaults to 5.0) — The weight for the mask loss.

  • dice_weight (int, optional, defaults to 5.0) — The weight for the dice loss.

  • train_num_points (str or function, optional, defaults to 12544) — Number of points used for sampling during loss calculation.

  • oversample_ratio (float, optional, defaults to 3.0) — Oversampling parameter used for calculating no. of sampled points

  • importance_sample_ratio (float, optional, defaults to 0.75) — Ratio of points that are sampled via importance sampling.

  • init_std (float, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

  • init_xavier_std (`float“, optional, defaults to 1.0) — The scaling factor used for the Xavier initialization gain in the HM Attention map module.

  • use_auxiliary_loss (boolean``, *optional*, defaults to True) -- If True Mask2FormerForUniversalSegmentationOutput` will contain the auxiliary losses computed using the logits from each decoder’s stage.

  • feature_strides (List[int], optional, defaults to [4, 8, 16, 32]) — Feature strides corresponding to features generated from backbone network.

  • output_auxiliary_logits (bool, optional) — Should the model output its auxiliary_logits or not.

Examples:

Copied

>>> from transformers import Mask2FormerConfig, Mask2FormerModel

>>> # Initializing a Mask2Former facebook/mask2former-swin-small-coco-instance configuration
>>> configuration = Mask2FormerConfig()

>>> # Initializing a model (with random weights) from the facebook/mask2former-swin-small-coco-instance style configuration
>>> model = Mask2FormerModel(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config

from_backbone_config

Parameters

Returns

An instance of a configuration object

Mask2FormerModel

class transformers.Mask2FormerModel

( config: Mask2FormerConfig )

Parameters

forward

Parameters

  • pixel_mask (torch.LongTensor of shape (batch_size, height, width), optional) — Mask to avoid performing attention on padding pixel values. Mask values selected in [0, 1]:

    • 1 for pixels that are real (i.e. not masked),

    • 0 for pixels that are padding (i.e. masked).

  • output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more detail.

  • output_attentions (bool, optional) — Whether or not to return the attentions tensors of Detr’s decoder attention layers.

  • return_dict (bool, optional) — Whether or not to return a ~Mask2FormerModelOutput instead of a plain tuple.

Returns

  • encoder_last_hidden_state (torch.FloatTensor of shape (batch_size, num_channels, height, width), optional) — Last hidden states (final feature map) of the last stage of the encoder model (backbone). Returned when output_hidden_states=True is passed.

  • encoder_hidden_states (tuple(torch.FloatTensor), optional) — Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each stage) of shape (batch_size, num_channels, height, width). Hidden-states (also called feature maps) of the encoder model at the output of each stage. Returned when output_hidden_states=True is passed.

  • pixel_decoder_last_hidden_state (torch.FloatTensor of shape (batch_size, num_channels, height, width), optional) — Last hidden states (final feature map) of the last stage of the pixel decoder model.

  • pixel_decoder_hidden_states (tuple(torch.FloatTensor), , optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each stage) of shape (batch_size, num_channels, height, width). Hidden-states (also called feature maps) of the pixel decoder model at the output of each stage. Returned when output_hidden_states=True is passed.

  • transformer_decoder_last_hidden_state (tuple(torch.FloatTensor)) — Final output of the transformer decoder (batch_size, sequence_length, hidden_size).

  • transformer_decoder_hidden_states (tuple(torch.FloatTensor), optional) — Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each stage) of shape (batch_size, sequence_length, hidden_size). Hidden-states (also called feature maps) of the transformer decoder at the output of each stage. Returned when output_hidden_states=True is passed.

  • transformer_decoder_intermediate_states (tuple(torch.FloatTensor) of shape (num_queries, 1, hidden_size)) — Intermediate decoder activations, i.e. the output of each decoder layer, each of them gone through a layernorm.

  • masks_queries_logits (tuple(torch.FloatTensor) of shape (batch_size, num_queries, height, width)) Mask Predictions from each layer in the transformer decoder.

  • attentions (tuple(tuple(torch.FloatTensor)), optional, returned when output_attentions=True is passed) — Tuple of tuple(torch.FloatTensor) (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length). Self attentions weights from transformer decoder.

Mask2FormerModelOutput

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Examples:

Copied

>>> import torch
>>> from PIL import Image
>>> import requests
>>> from transformers import AutoImageProcessor, Mask2FormerModel

>>> # load image
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> # load image preprocessor and Mask2FormerModel trained on COCO instance segmentation dataset
>>> image_processor = AutoImageProcessor.from_pretrained("facebook/mask2former-swin-small-coco-instance")
>>> model = Mask2FormerModel.from_pretrained("facebook/mask2former-swin-small-coco-instance")
>>> inputs = image_processor(image, return_tensors="pt")

>>> # forward pass
>>> with torch.no_grad():
...     outputs = model(**inputs)

>>> # model outputs last hidden states of shape (batch_size, num_queries, hidden_size)
>>> print(outputs.transformer_decoder_last_hidden_state.shape)
torch.Size([1, 100, 256])

Mask2FormerForUniversalSegmentation

class transformers.Mask2FormerForUniversalSegmentation

( config: Mask2FormerConfig )

Parameters

forward

Parameters

  • pixel_mask (torch.LongTensor of shape (batch_size, height, width), optional) — Mask to avoid performing attention on padding pixel values. Mask values selected in [0, 1]:

    • 1 for pixels that are real (i.e. not masked),

    • 0 for pixels that are padding (i.e. masked).

  • output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more detail.

  • output_attentions (bool, optional) — Whether or not to return the attentions tensors of Detr’s decoder attention layers.

  • return_dict (bool, optional) — Whether or not to return a ~Mask2FormerModelOutput instead of a plain tuple.

  • mask_labels (List[torch.Tensor], optional) — List of mask labels of shape (num_labels, height, width) to be fed to a model

  • class_labels (List[torch.LongTensor], optional) — list of target class labels of shape (num_labels, height, width) to be fed to a model. They identify the labels of mask_labels, e.g. the label of mask_labels[i][j] if class_labels[i][j].

Returns

  • loss (torch.Tensor, optional) — The computed loss, returned when labels are present.

  • class_queries_logits (torch.FloatTensor) — A tensor of shape (batch_size, num_queries, num_labels + 1) representing the proposed classes for each query. Note the + 1 is needed because we incorporate the null class.

  • masks_queries_logits (torch.FloatTensor) — A tensor of shape (batch_size, num_queries, height, width) representing the proposed masks for each query.

  • auxiliary_logits (List[Dict(str, torch.FloatTensor)], optional) — List of class and mask predictions from each layer of the transformer decoder.

  • encoder_last_hidden_state (torch.FloatTensor of shape (batch_size, num_channels, height, width)) — Last hidden states (final feature map) of the last stage of the encoder model (backbone).

  • encoder_hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each stage) of shape (batch_size, num_channels, height, width). Hidden-states (also called feature maps) of the encoder model at the output of each stage.

  • pixel_decoder_last_hidden_state (torch.FloatTensor of shape (batch_size, num_channels, height, width)) — Last hidden states (final feature map) of the last stage of the pixel decoder model.

  • pixel_decoder_hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each stage) of shape (batch_size, num_channels, height, width). Hidden-states (also called feature maps) of the pixel decoder model at the output of each stage.

  • transformer_decoder_last_hidden_state (tuple(torch.FloatTensor)) — Final output of the transformer decoder (batch_size, sequence_length, hidden_size).

  • transformer_decoder_hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each stage) of shape (batch_size, sequence_length, hidden_size). Hidden-states (also called feature maps) of the transformer decoder at the output of each stage.

  • attentions (tuple(tuple(torch.FloatTensor)), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of tuple(torch.FloatTensor) (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length). Self and Cross Attentions weights from transformer decoder.

Mask2FormerUniversalSegmentationOutput

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Examples:

Instance segmentation example:

Copied

>>> from transformers import AutoImageProcessor, Mask2FormerForUniversalSegmentation
>>> from PIL import Image
>>> import requests
>>> import torch

>>> # Load Mask2Former trained on COCO instance segmentation dataset
>>> image_processor = AutoImageProcessor.from_pretrained("facebook/mask2former-swin-small-coco-instance")
>>> model = Mask2FormerForUniversalSegmentation.from_pretrained(
...     "facebook/mask2former-swin-small-coco-instance"
... )

>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
>>> inputs = image_processor(image, return_tensors="pt")

>>> with torch.no_grad():
...     outputs = model(**inputs)

>>> # Model predicts class_queries_logits of shape `(batch_size, num_queries)`
>>> # and masks_queries_logits of shape `(batch_size, num_queries, height, width)`
>>> class_queries_logits = outputs.class_queries_logits
>>> masks_queries_logits = outputs.masks_queries_logits

>>> # Perform post-processing to get instance segmentation map
>>> pred_instance_map = image_processor.post_process_semantic_segmentation(
...     outputs, target_sizes=[image.size[::-1]]
... )[0]
>>> print(pred_instance_map.shape)
torch.Size([480, 640])

Semantic segmentation example:

Copied

>>> from transformers import AutoImageProcessor, Mask2FormerForUniversalSegmentation
>>> from PIL import Image
>>> import requests
>>> import torch

>>> # Load Mask2Former trained on ADE20k semantic segmentation dataset
>>> image_processor = AutoImageProcessor.from_pretrained("facebook/mask2former-swin-small-ade-semantic")
>>> model = Mask2FormerForUniversalSegmentation.from_pretrained("facebook/mask2former-swin-small-ade-semantic")

>>> url = (
...     "https://huggingface.co/datasets/hf-internal-testing/fixtures_ade20k/resolve/main/ADE_val_00000001.jpg"
... )
>>> image = Image.open(requests.get(url, stream=True).raw)
>>> inputs = image_processor(image, return_tensors="pt")

>>> with torch.no_grad():
...     outputs = model(**inputs)

>>> # Model predicts class_queries_logits of shape `(batch_size, num_queries)`
>>> # and masks_queries_logits of shape `(batch_size, num_queries, height, width)`
>>> class_queries_logits = outputs.class_queries_logits
>>> masks_queries_logits = outputs.masks_queries_logits

>>> # Perform post-processing to get semantic segmentation map
>>> pred_semantic_map = image_processor.post_process_semantic_segmentation(
...     outputs, target_sizes=[image.size[::-1]]
... )[0]
>>> print(pred_semantic_map.shape)
torch.Size([512, 683])

Panoptic segmentation example:

Copied

>>> from transformers import AutoImageProcessor, Mask2FormerForUniversalSegmentation
>>> from PIL import Image
>>> import requests
>>> import torch

>>> # Load Mask2Former trained on CityScapes panoptic segmentation dataset
>>> image_processor = AutoImageProcessor.from_pretrained("facebook/mask2former-swin-small-cityscapes-panoptic")
>>> model = Mask2FormerForUniversalSegmentation.from_pretrained(
...     "facebook/mask2former-swin-small-cityscapes-panoptic"
... )

>>> url = "https://cdn-media.huggingface.co/Inference-API/Sample-results-on-the-Cityscapes-dataset-The-above-images-show-how-our-method-can-handle.png"
>>> image = Image.open(requests.get(url, stream=True).raw)
>>> inputs = image_processor(image, return_tensors="pt")

>>> with torch.no_grad():
...     outputs = model(**inputs)

>>> # Model predicts class_queries_logits of shape `(batch_size, num_queries)`
>>> # and masks_queries_logits of shape `(batch_size, num_queries, height, width)`
>>> class_queries_logits = outputs.class_queries_logits
>>> masks_queries_logits = outputs.masks_queries_logits

>>> # Perform post-processing to get panoptic segmentation map
>>> pred_panoptic_map = image_processor.post_process_panoptic_segmentation(
...     outputs, target_sizes=[image.size[::-1]]
... )[0]["segmentation"]
>>> print(pred_panoptic_map.shape)
torch.Size([338, 676])

Mask2FormerImageProcessor

class transformers.Mask2FormerImageProcessor

( do_resize: bool = Truesize: typing.Dict[str, int] = Nonesize_divisor: int = 32resample: Resampling = <Resampling.BILINEAR: 2>do_rescale: bool = Truerescale_factor: float = 0.00392156862745098do_normalize: bool = Trueimage_mean: typing.Union[float, typing.List[float]] = Noneimage_std: typing.Union[float, typing.List[float]] = Noneignore_index: typing.Optional[int] = Nonereduce_labels: bool = False**kwargs )

Parameters

  • do_resize (bool, optional, defaults to True) — Whether to resize the input to a certain size.

  • size (int, optional, defaults to 800) — Resize the input to the given size. Only has an effect if do_resize is set to True. If size is a sequence like (width, height), output size will be matched to this. If size is an int, smaller edge of the image will be matched to this number. i.e, if height > width, then image will be rescaled to (size * height / width, size).

  • max_size (int, optional, defaults to 1333) — The largest size an image dimension can have (otherwise it’s capped). Only has an effect if do_resize is set to True.

  • resample (int, optional, defaults to PIL.Image.Resampling.BILINEAR) — An optional resampling filter. This can be one of PIL.Image.Resampling.NEAREST, PIL.Image.Resampling.BOX, PIL.Image.Resampling.BILINEAR, PIL.Image.Resampling.HAMMING, PIL.Image.Resampling.BICUBIC or PIL.Image.Resampling.LANCZOS. Only has an effect if do_resize is set to True.

  • size_divisor (int, optional, defaults to 32) — Some backbones need images divisible by a certain number. If not passed, it defaults to the value used in Swin Transformer.

  • do_rescale (bool, optional, defaults to True) — Whether to rescale the input to a certain scale.

  • rescale_factor (float, optional, defaults to 1/ 255) — Rescale the input by the given factor. Only has an effect if do_rescale is set to True.

  • do_normalize (bool, optional, defaults to True) — Whether or not to normalize the input with mean and standard deviation.

  • image_mean (int, optional, defaults to [0.485, 0.456, 0.406]) — The sequence of means for each channel, to be used when normalizing images. Defaults to the ImageNet mean.

  • image_std (int, optional, defaults to [0.229, 0.224, 0.225]) — The sequence of standard deviations for each channel, to be used when normalizing images. Defaults to the ImageNet std.

  • ignore_index (int, optional) — Label to be assigned to background pixels in segmentation maps. If provided, segmentation map pixels denoted with 0 (background) will be replaced with ignore_index.

  • reduce_labels (bool, optional, defaults to False) — Whether or not to decrement all label values of segmentation maps by 1. Usually used for datasets where 0 is used for background, and background itself is not included in all classes of a dataset (e.g. ADE20k). The background label will be replaced by ignore_index.

Constructs a Mask2Former image processor. The image processor can be used to prepare image(s) and optional targets for the model.

This image processor inherits from BaseImageProcessor which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.

preprocess

( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), typing.List[ForwardRef('PIL.Image.Image')], typing.List[numpy.ndarray], typing.List[ForwardRef('torch.Tensor')]]segmentation_maps: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), typing.List[ForwardRef('PIL.Image.Image')], typing.List[numpy.ndarray], typing.List[ForwardRef('torch.Tensor')], NoneType] = Noneinstance_id_to_semantic_id: typing.Union[typing.Dict[int, int], NoneType] = Nonedo_resize: typing.Optional[bool] = Nonesize: typing.Union[typing.Dict[str, int], NoneType] = Nonesize_divisor: typing.Optional[int] = Noneresample: Resampling = Nonedo_rescale: typing.Optional[bool] = Nonerescale_factor: typing.Optional[float] = Nonedo_normalize: typing.Optional[bool] = Noneimage_mean: typing.Union[float, typing.List[float], NoneType] = Noneimage_std: typing.Union[float, typing.List[float], NoneType] = Noneignore_index: typing.Optional[int] = Nonereduce_labels: typing.Optional[bool] = Nonereturn_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = Nonedata_format: typing.Union[str, transformers.image_utils.ChannelDimension] = <ChannelDimension.FIRST: 'channels_first'>input_data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None**kwargs )

encode_inputs

Parameters

  • pixel_values_list (List[ImageInput]) — List of images (pixel values) to be padded. Each image should be a tensor of shape (channels, height, width).

  • segmentation_maps (ImageInput, optional) — The corresponding semantic segmentation maps with the pixel-wise annotations.

    (bool, optional, defaults to True): Whether or not to pad images up to the largest image in a batch and create a pixel mask.

    If left to the default, will return a pixel mask that is:

    • 1 for pixels that are real (i.e. not masked),

    • 0 for pixels that are padding (i.e. masked).

  • instance_id_to_semantic_id (List[Dict[int, int]] or Dict[int, int], optional) — A mapping between object instance ids and class ids. If passed, segmentation_maps is treated as an instance segmentation map where each pixel represents an instance id. Can be provided as a single dictionary with a global/dataset-level mapping or as a list of dictionaries (one per image), to map instance ids in each image separately.

  • input_data_format (ChannelDimension or str, optional) — The channel dimension format of the input image. If not provided, it will be inferred.

Returns

  • pixel_values — Pixel values to be fed to a model.

  • pixel_mask — Pixel mask to be fed to a model (when =True or if pixel_mask is in self.model_input_names).

  • mask_labels — Optional list of mask labels of shape (labels, height, width) to be fed to a model (when annotations are provided).

  • class_labels — Optional list of class labels of shape (labels) to be fed to a model (when annotations are provided). They identify the labels of mask_labels, e.g. the label of mask_labels[i][j] if class_labels[i][j].

Pad images up to the largest image in a batch and create a corresponding pixel_mask.

Mask2Former addresses semantic segmentation with a mask classification paradigm, thus input segmentation maps will be converted to lists of binary masks and their respective labels. Let’s see an example, assuming segmentation_maps = [[2,6,7,9]], the output will contain mask_labels = [[1,0,0,0],[0,1,0,0],[0,0,1,0],[0,0,0,1]] (four binary masks) and class_labels = [2,6,7,9], the labels for each mask.

post_process_semantic_segmentation

( outputstarget_sizes: typing.Union[typing.List[typing.Tuple[int, int]], NoneType] = None ) → List[torch.Tensor]

Parameters

  • target_sizes (List[Tuple[int, int]], optional) — List of length (batch_size), where each list item (Tuple[int, int]]) corresponds to the requested final size (height, width) of each prediction. If left to None, predictions will not be resized.

Returns

List[torch.Tensor]

A list of length batch_size, where each item is a semantic segmentation map of shape (height, width) corresponding to the target_sizes entry (if target_sizes is specified). Each entry of each torch.Tensor correspond to a semantic class id.

post_process_instance_segmentation

( outputsthreshold: float = 0.5mask_threshold: float = 0.5overlap_mask_area_threshold: float = 0.8target_sizes: typing.Union[typing.List[typing.Tuple[int, int]], NoneType] = Nonereturn_coco_annotation: typing.Optional[bool] = Falsereturn_binary_maps: typing.Optional[bool] = False ) → List[Dict]

Parameters

  • threshold (float, optional, defaults to 0.5) — The probability score threshold to keep predicted instance masks.

  • mask_threshold (float, optional, defaults to 0.5) — Threshold to use when turning the predicted masks into binary values.

  • overlap_mask_area_threshold (float, optional, defaults to 0.8) — The overlap mask area threshold to merge or discard small disconnected parts within each binary instance mask.

  • target_sizes (List[Tuple], optional) — List of length (batch_size), where each list item (Tuple[int, int]]) corresponds to the requested final size (height, width) of each prediction. If left to None, predictions will not be resized.

  • return_coco_annotation (bool, optional, defaults to False) — If set to True, segmentation maps are returned in COCO run-length encoding (RLE) format.

  • return_binary_maps (bool, optional, defaults to False) — If set to True, segmentation maps are returned as a concatenated tensor of binary segmentation maps (one per detected instance).

Returns

List[Dict]

A list of dictionaries, one per image, each dictionary containing two keys:

  • segmentation — A tensor of shape (height, width) where each pixel represents a segment_id or List[List] run-length encoding (RLE) of the segmentation map if return_coco_annotation is set to True. Set to None if no mask if found above threshold.

  • segments_info — A dictionary that contains additional information on each segment.

    • id — An integer representing the segment_id.

    • label_id — An integer representing the label / semantic class id corresponding to segment_id.

    • score — Prediction score of segment with segment_id.

Converts the output of Mask2FormerForUniversalSegmentationOutput into instance segmentation predictions. Only supports PyTorch.

post_process_panoptic_segmentation

( outputsthreshold: float = 0.5mask_threshold: float = 0.5overlap_mask_area_threshold: float = 0.8label_ids_to_fuse: typing.Optional[typing.Set[int]] = Nonetarget_sizes: typing.Union[typing.List[typing.Tuple[int, int]], NoneType] = None ) → List[Dict]

Parameters

  • threshold (float, optional, defaults to 0.5) — The probability score threshold to keep predicted instance masks.

  • mask_threshold (float, optional, defaults to 0.5) — Threshold to use when turning the predicted masks into binary values.

  • overlap_mask_area_threshold (float, optional, defaults to 0.8) — The overlap mask area threshold to merge or discard small disconnected parts within each binary instance mask.

  • label_ids_to_fuse (Set[int], optional) — The labels in this state will have all their instances be fused together. For instance we could say there can only be one sky in an image, but several persons, so the label ID for sky would be in that set, but not the one for person.

  • target_sizes (List[Tuple], optional) — List of length (batch_size), where each list item (Tuple[int, int]]) corresponds to the requested final size (height, width) of each prediction in batch. If left to None, predictions will not be resized.

Returns

List[Dict]

A list of dictionaries, one per image, each dictionary containing two keys:

  • segmentation — a tensor of shape (height, width) where each pixel represents a segment_id, set to None if no mask if found above threshold. If target_sizes is specified, segmentation is resized to the corresponding target_sizes entry.

  • segments_info — A dictionary that contains additional information on each segment.

    • id — an integer representing the segment_id.

    • label_id — An integer representing the label / semantic class id corresponding to segment_id.

    • was_fused — a boolean, True if label_id was in label_ids_to_fuse, False otherwise. Multiple instances of the same class / label were fused and assigned a single segment_id.

    • score — Prediction score of segment with segment_id.

Converts the output of Mask2FormerForUniversalSegmentationOutput into image panoptic segmentation predictions. Only supports PyTorch.

Mask2Former architecture. Taken from the

This model was contributed by and . The original code can be found .

Demo notebooks regarding inference + fine-tuning Mask2Former on custom data can be found .

Class for outputs of . This class returns all the needed hidden states to compute the logits.

This output can be directly passed to or or to compute final segmentation maps. Please, see [`~Mask2FormerImageProcessor] for details regarding usage.

This is the configuration class to store the configuration of a . It is used to instantiate a Mask2Former model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the Mask2Former architecture.

Configuration objects inherit from and can be used to control the model outputs. Read the documentation from for more information.

Currently, Mask2Former only supports the as backbone.

( backbone_config: PretrainedConfig**kwargs ) →

backbone_config () — The backbone configuration.

Instantiate a (or a derived class) from a pre-trained backbone model configuration.

config () — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the method to load the model weights.

The bare Mask2Former Model outputting raw hidden-states without any specific head on top. This model is a PyTorch sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

( pixel_values: Tensorpixel_mask: typing.Optional[torch.Tensor] = Noneoutput_hidden_states: typing.Optional[bool] = Noneoutput_attentions: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = None ) → or tuple(torch.FloatTensor)

pixel_values (torch.FloatTensor of shape (batch_size, num_channels, height, width)) — Pixel values. Pixel values can be obtained using . See AutoImageProcessor.preprocess for details.

or tuple(torch.FloatTensor)

A or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration () and inputs.

The forward method, overrides the __call__ special method.

config () — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the method to load the model weights.

The Mask2Former Model with heads on top for instance/semantic/panoptic segmentation. This model is a PyTorch sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

( pixel_values: Tensormask_labels: typing.Optional[typing.List[torch.Tensor]] = Noneclass_labels: typing.Optional[typing.List[torch.Tensor]] = Nonepixel_mask: typing.Optional[torch.Tensor] = Noneoutput_hidden_states: typing.Optional[bool] = Noneoutput_auxiliary_logits: typing.Optional[bool] = Noneoutput_attentions: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = None ) → or tuple(torch.FloatTensor)

pixel_values (torch.FloatTensor of shape (batch_size, num_channels, height, width)) — Pixel values. Pixel values can be obtained using . See AutoImageProcessor.preprocess for details.

or tuple(torch.FloatTensor)

A or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration () and inputs.

The forward method, overrides the __call__ special method.

( pixel_values_list: typing.List[typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), typing.List[ForwardRef('PIL.Image.Image')], typing.List[numpy.ndarray], typing.List[ForwardRef('torch.Tensor')]]]segmentation_maps: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), typing.List[ForwardRef('PIL.Image.Image')], typing.List[numpy.ndarray], typing.List[ForwardRef('torch.Tensor')]] = Noneinstance_id_to_semantic_id: typing.Union[typing.List[typing.Dict[int, int]], typing.Dict[int, int], NoneType] = Noneignore_index: typing.Optional[int] = Nonereduce_labels: bool = Falsereturn_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = Noneinput_data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None ) →

return_tensors (str or , optional) — If set, will return tensors instead of NumPy arrays. If set to 'pt', return PyTorch torch.Tensor objects.

A with the following fields:

outputs () — Raw outputs of the model.

Converts the output of into semantic segmentation maps. Only supports PyTorch.

outputs () — Raw outputs of the model.

outputs (Mask2FormerForUniversalSegmentationOutput) — The outputs from .

🌍
🌍
🌍
original paper.
Shivalika Singh
Alara Dirik
here
here
<source>
Mask2FormerModel
<source>
post_process_semantic_segmentation()
post_process_instance_segmentation()
post_process_panoptic_segmentation()
<source>
Mask2FormerModel
facebook/mask2former-swin-small-coco-instance
PretrainedConfig
PretrainedConfig
Swin Transformer
<source>
Mask2FormerConfig
PretrainedConfig
Mask2FormerConfig
Mask2FormerConfig
<source>
Mask2FormerConfig
from_pretrained()
torch.nn.Module
<source>
transformers.models.mask2former.modeling_mask2former.Mask2FormerModelOutput
AutoImageProcessor
What are attention masks?
transformers.models.mask2former.modeling_mask2former.Mask2FormerModelOutput
transformers.models.mask2former.modeling_mask2former.Mask2FormerModelOutput
Mask2FormerConfig
Mask2FormerModel
<source>
Mask2FormerConfig
from_pretrained()
torch.nn.Module
<source>
transformers.models.mask2former.modeling_mask2former.Mask2FormerForUniversalSegmentationOutput
AutoImageProcessor
What are attention masks?
transformers.models.mask2former.modeling_mask2former.Mask2FormerForUniversalSegmentationOutput
transformers.models.mask2former.modeling_mask2former.Mask2FormerForUniversalSegmentationOutput
Mask2FormerConfig
Mask2FormerForUniversalSegmentation
<source>
<source>
<source>
BatchFeature
TensorType
BatchFeature
BatchFeature
<source>
Mask2FormerForUniversalSegmentation
Mask2FormerForUniversalSegmentation
<source>
Mask2FormerForUniversalSegmentation
<source>
Mask2FormerForUniversalSegmentation
Masked-attention Mask Transformer for Universal Image Segmentation
MaskFormer
MaskFormer
Mask2FormerImageProcessor
AutoImageProcessor
post_process_semantic_segmentation()
post_process_instance_segmentation()
post_process_panoptic_segmentation()
Mask2FormerForUniversalSegmentation