MaskFormer

MaskFormer

This is a recently introduced model so the API hasn’t been tested extensively. There may be some bugs or slight breaking changes to fix it in the future. If you see something strange, file a Github Issuearrow-up-right.

Overview

The MaskFormer model was proposed in Per-Pixel Classification is Not All You Need for Semantic Segmentationarrow-up-right by Bowen Cheng, Alexander G. Schwing, Alexander Kirillov. MaskFormer addresses semantic segmentation with a mask classification paradigm instead of performing classic pixel-level classification.

The abstract from the paper is the following:

Modern approaches typically formulate semantic segmentation as a per-pixel classification task, while instance-level segmentation is handled with an alternative mask classification. Our key insight: mask classification is sufficiently general to solve both semantic- and instance-level segmentation tasks in a unified manner using the exact same model, loss, and training procedure. Following this observation, we propose MaskFormer, a simple mask classification model which predicts a set of binary masks, each associated with a single global class label prediction. Overall, the proposed mask classification-based method simplifies the landscape of effective approaches to semantic and panoptic segmentation tasks and shows excellent empirical results. In particular, we observe that MaskFormer outperforms per-pixel classification baselines when the number of classes is large. Our mask classification-based method outperforms both current state-of-the-art semantic (55.6 mIoU on ADE20K) and panoptic segmentation (52.7 PQ on COCO) models.

Tips:

  • MaskFormer’s Transformer decoder is identical to the decoder of DETRarrow-up-right. During training, the authors of DETR did find it helpful to use auxiliary losses in the decoder, especially to help the model output the correct number of objects of each class. If you set the parameter use_auxilary_loss of MaskFormerConfigarrow-up-right to True, then prediction feedforward neural networks and Hungarian losses are added after each decoder layer (with the FFNs sharing parameters).

  • If you want to train the model in a distributed environment across multiple nodes, then one should update the get_num_masks function inside in the MaskFormerLoss class of modeling_maskformer.py. When training on multiple nodes, this should be set to the average number of target masks across all nodes, as can be seen in the original implementation herearrow-up-right.

  • One can use MaskFormerImageProcessorarrow-up-right to prepare images for the model and optional targets for the model.

  • To get the final segmentation, depending on the task, you can call post_process_semantic_segmentation()arrow-up-right or post_process_panoptic_segmentation()arrow-up-right. Both tasks can be solved using MaskFormerForInstanceSegmentationarrow-up-right output, panoptic segmentation accepts an optional label_ids_to_fuse argument to fuse instances of the target object/s (e.g. sky) together.

The figure below illustrates the architecture of MaskFormer. Taken from the original paperarrow-up-right.

This model was contributed by francescoarrow-up-right. The original code can be found herearrow-up-right.

Resources

Image Segmentation

  • All notebooks that illustrate inference as well as fine-tuning on custom data with MaskFormer can be found herearrow-up-right.

MaskFormer specific outputs

class transformers.models.maskformer.modeling_maskformer.MaskFormerModelOutput

<source>arrow-up-right

( encoder_last_hidden_state: typing.Optional[torch.FloatTensor] = Nonepixel_decoder_last_hidden_state: typing.Optional[torch.FloatTensor] = Nonetransformer_decoder_last_hidden_state: typing.Optional[torch.FloatTensor] = Noneencoder_hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = Nonepixel_decoder_hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = Nonetransformer_decoder_hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = Nonehidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = Noneattentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None )

Parameters

  • encoder_last_hidden_state (torch.FloatTensor of shape (batch_size, num_channels, height, width)) β€” Last hidden states (final feature map) of the last stage of the encoder model (backbone).

  • pixel_decoder_last_hidden_state (torch.FloatTensor of shape (batch_size, num_channels, height, width)) β€” Last hidden states (final feature map) of the last stage of the pixel decoder model (FPN).

  • transformer_decoder_last_hidden_state (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size)) β€” Last hidden states (final feature map) of the last stage of the transformer decoder model.

  • encoder_hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) β€” Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each stage) of shape (batch_size, num_channels, height, width). Hidden-states (also called feature maps) of the encoder model at the output of each stage.

  • pixel_decoder_hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) β€” Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each stage) of shape (batch_size, num_channels, height, width). Hidden-states (also called feature maps) of the pixel decoder model at the output of each stage.

  • transformer_decoder_hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) β€” Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each stage) of shape (batch_size, sequence_length, hidden_size). Hidden-states (also called feature maps) of the transformer decoder at the output of each stage.

  • hidden_states tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) β€” Tuple of torch.FloatTensor containing encoder_hidden_states, pixel_decoder_hidden_states and decoder_hidden_states

  • attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) β€” Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length). Attentions weights from Detr’s decoder after the attention softmax, used to compute the weighted average in the self-attention heads.

Class for outputs of MaskFormerModelarrow-up-right. This class returns all the needed hidden states to compute the logits.

class transformers.models.maskformer.modeling_maskformer.MaskFormerForInstanceSegmentationOutput

<source>arrow-up-right

( loss: typing.Optional[torch.FloatTensor] = Noneclass_queries_logits: FloatTensor = Nonemasks_queries_logits: FloatTensor = Noneauxiliary_logits: FloatTensor = Noneencoder_last_hidden_state: typing.Optional[torch.FloatTensor] = Nonepixel_decoder_last_hidden_state: typing.Optional[torch.FloatTensor] = Nonetransformer_decoder_last_hidden_state: typing.Optional[torch.FloatTensor] = Noneencoder_hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = Nonepixel_decoder_hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = Nonetransformer_decoder_hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = Nonehidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = Noneattentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None )

Parameters

  • loss (torch.Tensor, optional) β€” The computed loss, returned when labels are present.

  • class_queries_logits (torch.FloatTensor) β€” A tensor of shape (batch_size, num_queries, num_labels + 1) representing the proposed classes for each query. Note the + 1 is needed because we incorporate the null class.

  • masks_queries_logits (torch.FloatTensor) β€” A tensor of shape (batch_size, num_queries, height, width) representing the proposed masks for each query.

  • encoder_last_hidden_state (torch.FloatTensor of shape (batch_size, num_channels, height, width)) β€” Last hidden states (final feature map) of the last stage of the encoder model (backbone).

  • pixel_decoder_last_hidden_state (torch.FloatTensor of shape (batch_size, num_channels, height, width)) β€” Last hidden states (final feature map) of the last stage of the pixel decoder model (FPN).

  • transformer_decoder_last_hidden_state (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size)) β€” Last hidden states (final feature map) of the last stage of the transformer decoder model.

  • encoder_hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) β€” Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each stage) of shape (batch_size, num_channels, height, width). Hidden-states (also called feature maps) of the encoder model at the output of each stage.

  • pixel_decoder_hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) β€” Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each stage) of shape (batch_size, num_channels, height, width). Hidden-states (also called feature maps) of the pixel decoder model at the output of each stage.

  • transformer_decoder_hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) β€” Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each stage) of shape (batch_size, sequence_length, hidden_size). Hidden-states of the transformer decoder at the output of each stage.

  • hidden_states tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) β€” Tuple of torch.FloatTensor containing encoder_hidden_states, pixel_decoder_hidden_states and decoder_hidden_states.

  • attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) β€” Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length). Attentions weights from Detr’s decoder after the attention softmax, used to compute the weighted average in the self-attention heads.

Class for outputs of MaskFormerForInstanceSegmentationarrow-up-right.

This output can be directly passed to post_process_semantic_segmentation()arrow-up-right or or post_process_instance_segmentation()arrow-up-right or post_process_panoptic_segmentation()arrow-up-right depending on the task. Please, see [`~MaskFormerImageProcessor] for details regarding usage.

MaskFormerConfig

class transformers.MaskFormerConfig

<source>arrow-up-right

( fpn_feature_size: int = 256mask_feature_size: int = 256no_object_weight: float = 0.1use_auxiliary_loss: bool = Falsebackbone_config: typing.Optional[typing.Dict] = Nonedecoder_config: typing.Optional[typing.Dict] = Noneinit_std: float = 0.02init_xavier_std: float = 1.0dice_weight: float = 1.0cross_entropy_weight: float = 1.0mask_weight: float = 20.0output_auxiliary_logits: typing.Optional[bool] = None**kwargs )

Parameters

  • mask_feature_size (int, optional, defaults to 256) β€” The masks’ features size, this value will also be used to specify the Feature Pyramid Network features’ size.

  • no_object_weight (float, optional, defaults to 0.1) β€” Weight to apply to the null (no object) class.

  • use_auxiliary_loss(bool, optional, defaults to False) β€” If True MaskFormerForInstanceSegmentationOutput will contain the auxiliary losses computed using the logits from each decoder’s stage.

  • backbone_config (Dict, optional) β€” The configuration passed to the backbone, if unset, the configuration corresponding to swin-base-patch4-window12-384 will be used.

  • decoder_config (Dict, optional) β€” The configuration passed to the transformer decoder model, if unset the base config for detr-resnet-50 will be used.

  • init_std (float, optional, defaults to 0.02) β€” The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

  • init_xavier_std (float, optional, defaults to 1) β€” The scaling factor used for the Xavier initialization gain in the HM Attention map module.

  • dice_weight (float, optional, defaults to 1.0) β€” The weight for the dice loss.

  • cross_entropy_weight (float, optional, defaults to 1.0) β€” The weight for the cross entropy loss.

  • mask_weight (float, optional, defaults to 20.0) β€” The weight for the mask loss.

  • output_auxiliary_logits (bool, optional) β€” Should the model output its auxiliary_logits or not.

Raises

ValueError

  • ValueError β€” Raised if the backbone model type selected is not in ["swin"] or the decoder model type selected is not in ["detr"]

This is the configuration class to store the configuration of a MaskFormerModelarrow-up-right. It is used to instantiate a MaskFormer model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the MaskFormer facebook/maskformer-swin-base-adearrow-up-right architecture trained on ADE20k-150arrow-up-right.

Configuration objects inherit from PretrainedConfigarrow-up-right and can be used to control the model outputs. Read the documentation from PretrainedConfigarrow-up-right for more information.

Currently, MaskFormer only supports the Swin Transformerarrow-up-right as backbone.

Examples:

Copied

from_backbone_and_decoder_configs

<source>arrow-up-right

( backbone_config: PretrainedConfigdecoder_config: PretrainedConfig**kwargs ) β†’ MaskFormerConfigarrow-up-right

Parameters

Returns

MaskFormerConfigarrow-up-right

An instance of a configuration object

Instantiate a MaskFormerConfigarrow-up-right (or a derived class) from a pre-trained backbone model configuration and DETR model configuration.

MaskFormerImageProcessor

class transformers.MaskFormerImageProcessor

<source>arrow-up-right

( do_resize: bool = Truesize: typing.Dict[str, int] = Nonesize_divisor: int = 32resample: Resampling = <Resampling.BILINEAR: 2>do_rescale: bool = Truerescale_factor: float = 0.00392156862745098do_normalize: bool = Trueimage_mean: typing.Union[float, typing.List[float]] = Noneimage_std: typing.Union[float, typing.List[float]] = Noneignore_index: typing.Optional[int] = Nonedo_reduce_labels: bool = False**kwargs )

Parameters

  • do_resize (bool, optional, defaults to True) β€” Whether to resize the input to a certain size.

  • size (int, optional, defaults to 800) β€” Resize the input to the given size. Only has an effect if do_resize is set to True. If size is a sequence like (width, height), output size will be matched to this. If size is an int, smaller edge of the image will be matched to this number. i.e, if height > width, then image will be rescaled to (size * height / width, size).

  • max_size (int, optional, defaults to 1333) β€” The largest size an image dimension can have (otherwise it’s capped). Only has an effect if do_resize is set to True.

  • resample (int, optional, defaults to PIL.Image.Resampling.BILINEAR) β€” An optional resampling filter. This can be one of PIL.Image.Resampling.NEAREST, PIL.Image.Resampling.BOX, PIL.Image.Resampling.BILINEAR, PIL.Image.Resampling.HAMMING, PIL.Image.Resampling.BICUBIC or PIL.Image.Resampling.LANCZOS. Only has an effect if do_resize is set to True.

  • size_divisor (int, optional, defaults to 32) β€” Some backbones need images divisible by a certain number. If not passed, it defaults to the value used in Swin Transformer.

  • do_rescale (bool, optional, defaults to True) β€” Whether to rescale the input to a certain scale.

  • rescale_factor (float, optional, defaults to 1/ 255) β€” Rescale the input by the given factor. Only has an effect if do_rescale is set to True.

  • do_normalize (bool, optional, defaults to True) β€” Whether or not to normalize the input with mean and standard deviation.

  • image_mean (int, optional, defaults to [0.485, 0.456, 0.406]) β€” The sequence of means for each channel, to be used when normalizing images. Defaults to the ImageNet mean.

  • image_std (int, optional, defaults to [0.229, 0.224, 0.225]) β€” The sequence of standard deviations for each channel, to be used when normalizing images. Defaults to the ImageNet std.

  • ignore_index (int, optional) β€” Label to be assigned to background pixels in segmentation maps. If provided, segmentation map pixels denoted with 0 (background) will be replaced with ignore_index.

  • do_reduce_labels (bool, optional, defaults to False) β€” Whether or not to decrement all label values of segmentation maps by 1. Usually used for datasets where 0 is used for background, and background itself is not included in all classes of a dataset (e.g. ADE20k). The background label will be replaced by ignore_index.

Constructs a MaskFormer image processor. The image processor can be used to prepare image(s) and optional targets for the model.

This image processor inherits from BaseImageProcessor which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.

preprocess

<source>arrow-up-right

( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), typing.List[ForwardRef('PIL.Image.Image')], typing.List[numpy.ndarray], typing.List[ForwardRef('torch.Tensor')]]segmentation_maps: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), typing.List[ForwardRef('PIL.Image.Image')], typing.List[numpy.ndarray], typing.List[ForwardRef('torch.Tensor')], NoneType] = Noneinstance_id_to_semantic_id: typing.Union[typing.Dict[int, int], NoneType] = Nonedo_resize: typing.Optional[bool] = Nonesize: typing.Union[typing.Dict[str, int], NoneType] = Nonesize_divisor: typing.Optional[int] = Noneresample: Resampling = Nonedo_rescale: typing.Optional[bool] = Nonerescale_factor: typing.Optional[float] = Nonedo_normalize: typing.Optional[bool] = Noneimage_mean: typing.Union[float, typing.List[float], NoneType] = Noneimage_std: typing.Union[float, typing.List[float], NoneType] = Noneignore_index: typing.Optional[int] = Nonedo_reduce_labels: typing.Optional[bool] = Nonereturn_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = Nonedata_format: typing.Union[str, transformers.image_utils.ChannelDimension] = <ChannelDimension.FIRST: 'channels_first'>input_data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None**kwargs )

encode_inputs

<source>arrow-up-right

( pixel_values_list: typing.List[typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), typing.List[ForwardRef('PIL.Image.Image')], typing.List[numpy.ndarray], typing.List[ForwardRef('torch.Tensor')]]]segmentation_maps: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), typing.List[ForwardRef('PIL.Image.Image')], typing.List[numpy.ndarray], typing.List[ForwardRef('torch.Tensor')]] = Noneinstance_id_to_semantic_id: typing.Union[typing.List[typing.Dict[int, int]], typing.Dict[int, int], NoneType] = Noneignore_index: typing.Optional[int] = Nonereduce_labels: bool = Falsereturn_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = Noneinput_data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None ) β†’ BatchFeaturearrow-up-right

Parameters

  • pixel_values_list (List[ImageInput]) β€” List of images (pixel values) to be padded. Each image should be a tensor of shape (channels, height, width).

  • segmentation_maps (ImageInput, optional) β€” The corresponding semantic segmentation maps with the pixel-wise annotations.

    (bool, optional, defaults to True): Whether or not to pad images up to the largest image in a batch and create a pixel mask.

    If left to the default, will return a pixel mask that is:

    • 1 for pixels that are real (i.e. not masked),

    • 0 for pixels that are padding (i.e. masked).

  • instance_id_to_semantic_id (List[Dict[int, int]] or Dict[int, int], optional) β€” A mapping between object instance ids and class ids. If passed, segmentation_maps is treated as an instance segmentation map where each pixel represents an instance id. Can be provided as a single dictionary with a global/dataset-level mapping or as a list of dictionaries (one per image), to map instance ids in each image separately.

  • return_tensors (str or TensorTypearrow-up-right, optional) β€” If set, will return tensors instead of NumPy arrays. If set to 'pt', return PyTorch torch.Tensor objects.

Returns

BatchFeaturearrow-up-right

A BatchFeaturearrow-up-right with the following fields:

  • pixel_values β€” Pixel values to be fed to a model.

  • pixel_mask β€” Pixel mask to be fed to a model (when =True or if pixel_mask is in self.model_input_names).

  • mask_labels β€” Optional list of mask labels of shape (labels, height, width) to be fed to a model (when annotations are provided).

  • class_labels β€” Optional list of class labels of shape (labels) to be fed to a model (when annotations are provided). They identify the labels of mask_labels, e.g. the label of mask_labels[i][j] if class_labels[i][j].

Pad images up to the largest image in a batch and create a corresponding pixel_mask.

MaskFormer addresses semantic segmentation with a mask classification paradigm, thus input segmentation maps will be converted to lists of binary masks and their respective labels. Let’s see an example, assuming segmentation_maps = [[2,6,7,9]], the output will contain mask_labels = [[1,0,0,0],[0,1,0,0],[0,0,1,0],[0,0,0,1]] (four binary masks) and class_labels = [2,6,7,9], the labels for each mask.

post_process_semantic_segmentation

<source>arrow-up-right

( outputstarget_sizes: typing.Union[typing.List[typing.Tuple[int, int]], NoneType] = None ) β†’ List[torch.Tensor]

Parameters

  • outputs (MaskFormerForInstanceSegmentationarrow-up-right) β€” Raw outputs of the model.

  • target_sizes (List[Tuple[int, int]], optional) β€” List of length (batch_size), where each list item (Tuple[int, int]]) corresponds to the requested final size (height, width) of each prediction. If left to None, predictions will not be resized.

Returns

List[torch.Tensor]

A list of length batch_size, where each item is a semantic segmentation map of shape (height, width) corresponding to the target_sizes entry (if target_sizes is specified). Each entry of each torch.Tensor correspond to a semantic class id.

Converts the output of MaskFormerForInstanceSegmentationarrow-up-right into semantic segmentation maps. Only supports PyTorch.

post_process_instance_segmentation

<source>arrow-up-right

( outputsthreshold: float = 0.5mask_threshold: float = 0.5overlap_mask_area_threshold: float = 0.8target_sizes: typing.Union[typing.List[typing.Tuple[int, int]], NoneType] = Nonereturn_coco_annotation: typing.Optional[bool] = Falsereturn_binary_maps: typing.Optional[bool] = False ) β†’ List[Dict]

Parameters

  • outputs (MaskFormerForInstanceSegmentationarrow-up-right) β€” Raw outputs of the model.

  • threshold (float, optional, defaults to 0.5) β€” The probability score threshold to keep predicted instance masks.

  • mask_threshold (float, optional, defaults to 0.5) β€” Threshold to use when turning the predicted masks into binary values.

  • overlap_mask_area_threshold (float, optional, defaults to 0.8) β€” The overlap mask area threshold to merge or discard small disconnected parts within each binary instance mask.

  • target_sizes (List[Tuple], optional) β€” List of length (batch_size), where each list item (Tuple[int, int]]) corresponds to the requested final size (height, width) of each prediction. If left to None, predictions will not be resized.

  • return_coco_annotation (bool, optional, defaults to False) β€” If set to True, segmentation maps are returned in COCO run-length encoding (RLE) format.

  • return_binary_maps (bool, optional, defaults to False) β€” If set to True, segmentation maps are returned as a concatenated tensor of binary segmentation maps (one per detected instance).

Returns

List[Dict]

A list of dictionaries, one per image, each dictionary containing two keys:

  • segmentation β€” A tensor of shape (height, width) where each pixel represents a segment_id or List[List] run-length encoding (RLE) of the segmentation map if return_coco_annotation is set to True. Set to None if no mask if found above threshold.

  • segments_info β€” A dictionary that contains additional information on each segment.

    • id β€” An integer representing the segment_id.

    • label_id β€” An integer representing the label / semantic class id corresponding to segment_id.

    • score β€” Prediction score of segment with segment_id.

Converts the output of MaskFormerForInstanceSegmentationOutput into instance segmentation predictions. Only supports PyTorch.

post_process_panoptic_segmentation

<source>arrow-up-right

( outputsthreshold: float = 0.5mask_threshold: float = 0.5overlap_mask_area_threshold: float = 0.8label_ids_to_fuse: typing.Optional[typing.Set[int]] = Nonetarget_sizes: typing.Union[typing.List[typing.Tuple[int, int]], NoneType] = None ) β†’ List[Dict]

Parameters

  • outputs (MaskFormerForInstanceSegmentationOutput) β€” The outputs from MaskFormerForInstanceSegmentationarrow-up-right.

  • threshold (float, optional, defaults to 0.5) β€” The probability score threshold to keep predicted instance masks.

  • mask_threshold (float, optional, defaults to 0.5) β€” Threshold to use when turning the predicted masks into binary values.

  • overlap_mask_area_threshold (float, optional, defaults to 0.8) β€” The overlap mask area threshold to merge or discard small disconnected parts within each binary instance mask.

  • label_ids_to_fuse (Set[int], optional) β€” The labels in this state will have all their instances be fused together. For instance we could say there can only be one sky in an image, but several persons, so the label ID for sky would be in that set, but not the one for person.

  • target_sizes (List[Tuple], optional) β€” List of length (batch_size), where each list item (Tuple[int, int]]) corresponds to the requested final size (height, width) of each prediction in batch. If left to None, predictions will not be resized.

Returns

List[Dict]

A list of dictionaries, one per image, each dictionary containing two keys:

  • segmentation β€” a tensor of shape (height, width) where each pixel represents a segment_id, set to None if no mask if found above threshold. If target_sizes is specified, segmentation is resized to the corresponding target_sizes entry.

  • segments_info β€” A dictionary that contains additional information on each segment.

    • id β€” an integer representing the segment_id.

    • label_id β€” An integer representing the label / semantic class id corresponding to segment_id.

    • was_fused β€” a boolean, True if label_id was in label_ids_to_fuse, False otherwise. Multiple instances of the same class / label were fused and assigned a single segment_id.

    • score β€” Prediction score of segment with segment_id.

Converts the output of MaskFormerForInstanceSegmentationOutput into image panoptic segmentation predictions. Only supports PyTorch.

MaskFormerFeatureExtractor

class transformers.MaskFormerFeatureExtractor

<source>arrow-up-right

( *args**kwargs )

__call__

<source>arrow-up-right

( imagessegmentation_maps = None**kwargs )

encode_inputs

<source>arrow-up-right

( pixel_values_list: typing.List[typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), typing.List[ForwardRef('PIL.Image.Image')], typing.List[numpy.ndarray], typing.List[ForwardRef('torch.Tensor')]]]segmentation_maps: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), typing.List[ForwardRef('PIL.Image.Image')], typing.List[numpy.ndarray], typing.List[ForwardRef('torch.Tensor')]] = Noneinstance_id_to_semantic_id: typing.Union[typing.List[typing.Dict[int, int]], typing.Dict[int, int], NoneType] = Noneignore_index: typing.Optional[int] = Nonereduce_labels: bool = Falsereturn_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = Noneinput_data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None ) β†’ BatchFeaturearrow-up-right

Parameters

  • pixel_values_list (List[ImageInput]) β€” List of images (pixel values) to be padded. Each image should be a tensor of shape (channels, height, width).

  • segmentation_maps (ImageInput, optional) β€” The corresponding semantic segmentation maps with the pixel-wise annotations.

    (bool, optional, defaults to True): Whether or not to pad images up to the largest image in a batch and create a pixel mask.

    If left to the default, will return a pixel mask that is:

    • 1 for pixels that are real (i.e. not masked),

    • 0 for pixels that are padding (i.e. masked).

  • instance_id_to_semantic_id (List[Dict[int, int]] or Dict[int, int], optional) β€” A mapping between object instance ids and class ids. If passed, segmentation_maps is treated as an instance segmentation map where each pixel represents an instance id. Can be provided as a single dictionary with a global/dataset-level mapping or as a list of dictionaries (one per image), to map instance ids in each image separately.

  • return_tensors (str or TensorTypearrow-up-right, optional) β€” If set, will return tensors instead of NumPy arrays. If set to 'pt', return PyTorch torch.Tensor objects.

Returns

BatchFeaturearrow-up-right

A BatchFeaturearrow-up-right with the following fields:

  • pixel_values β€” Pixel values to be fed to a model.

  • pixel_mask β€” Pixel mask to be fed to a model (when =True or if pixel_mask is in self.model_input_names).

  • mask_labels β€” Optional list of mask labels of shape (labels, height, width) to be fed to a model (when annotations are provided).

  • class_labels β€” Optional list of class labels of shape (labels) to be fed to a model (when annotations are provided). They identify the labels of mask_labels, e.g. the label of mask_labels[i][j] if class_labels[i][j].

Pad images up to the largest image in a batch and create a corresponding pixel_mask.

MaskFormer addresses semantic segmentation with a mask classification paradigm, thus input segmentation maps will be converted to lists of binary masks and their respective labels. Let’s see an example, assuming segmentation_maps = [[2,6,7,9]], the output will contain mask_labels = [[1,0,0,0],[0,1,0,0],[0,0,1,0],[0,0,0,1]] (four binary masks) and class_labels = [2,6,7,9], the labels for each mask.

post_process_semantic_segmentation

<source>arrow-up-right

( outputstarget_sizes: typing.Union[typing.List[typing.Tuple[int, int]], NoneType] = None ) β†’ List[torch.Tensor]

Parameters

  • outputs (MaskFormerForInstanceSegmentationarrow-up-right) β€” Raw outputs of the model.

  • target_sizes (List[Tuple[int, int]], optional) β€” List of length (batch_size), where each list item (Tuple[int, int]]) corresponds to the requested final size (height, width) of each prediction. If left to None, predictions will not be resized.

Returns

List[torch.Tensor]

A list of length batch_size, where each item is a semantic segmentation map of shape (height, width) corresponding to the target_sizes entry (if target_sizes is specified). Each entry of each torch.Tensor correspond to a semantic class id.

Converts the output of MaskFormerForInstanceSegmentationarrow-up-right into semantic segmentation maps. Only supports PyTorch.

post_process_instance_segmentation

<source>arrow-up-right

( outputsthreshold: float = 0.5mask_threshold: float = 0.5overlap_mask_area_threshold: float = 0.8target_sizes: typing.Union[typing.List[typing.Tuple[int, int]], NoneType] = Nonereturn_coco_annotation: typing.Optional[bool] = Falsereturn_binary_maps: typing.Optional[bool] = False ) β†’ List[Dict]

Parameters

  • outputs (MaskFormerForInstanceSegmentationarrow-up-right) β€” Raw outputs of the model.

  • threshold (float, optional, defaults to 0.5) β€” The probability score threshold to keep predicted instance masks.

  • mask_threshold (float, optional, defaults to 0.5) β€” Threshold to use when turning the predicted masks into binary values.

  • overlap_mask_area_threshold (float, optional, defaults to 0.8) β€” The overlap mask area threshold to merge or discard small disconnected parts within each binary instance mask.

  • target_sizes (List[Tuple], optional) β€” List of length (batch_size), where each list item (Tuple[int, int]]) corresponds to the requested final size (height, width) of each prediction. If left to None, predictions will not be resized.

  • return_coco_annotation (bool, optional, defaults to False) β€” If set to True, segmentation maps are returned in COCO run-length encoding (RLE) format.

  • return_binary_maps (bool, optional, defaults to False) β€” If set to True, segmentation maps are returned as a concatenated tensor of binary segmentation maps (one per detected instance).

Returns

List[Dict]

A list of dictionaries, one per image, each dictionary containing two keys:

  • segmentation β€” A tensor of shape (height, width) where each pixel represents a segment_id or List[List] run-length encoding (RLE) of the segmentation map if return_coco_annotation is set to True. Set to None if no mask if found above threshold.

  • segments_info β€” A dictionary that contains additional information on each segment.

    • id β€” An integer representing the segment_id.

    • label_id β€” An integer representing the label / semantic class id corresponding to segment_id.

    • score β€” Prediction score of segment with segment_id.

Converts the output of MaskFormerForInstanceSegmentationOutput into instance segmentation predictions. Only supports PyTorch.

post_process_panoptic_segmentation

<source>arrow-up-right

( outputsthreshold: float = 0.5mask_threshold: float = 0.5overlap_mask_area_threshold: float = 0.8label_ids_to_fuse: typing.Optional[typing.Set[int]] = Nonetarget_sizes: typing.Union[typing.List[typing.Tuple[int, int]], NoneType] = None ) β†’ List[Dict]

Parameters

  • outputs (MaskFormerForInstanceSegmentationOutput) β€” The outputs from MaskFormerForInstanceSegmentationarrow-up-right.

  • threshold (float, optional, defaults to 0.5) β€” The probability score threshold to keep predicted instance masks.

  • mask_threshold (float, optional, defaults to 0.5) β€” Threshold to use when turning the predicted masks into binary values.

  • overlap_mask_area_threshold (float, optional, defaults to 0.8) β€” The overlap mask area threshold to merge or discard small disconnected parts within each binary instance mask.

  • label_ids_to_fuse (Set[int], optional) β€” The labels in this state will have all their instances be fused together. For instance we could say there can only be one sky in an image, but several persons, so the label ID for sky would be in that set, but not the one for person.

  • target_sizes (List[Tuple], optional) β€” List of length (batch_size), where each list item (Tuple[int, int]]) corresponds to the requested final size (height, width) of each prediction in batch. If left to None, predictions will not be resized.

Returns

List[Dict]

A list of dictionaries, one per image, each dictionary containing two keys:

  • segmentation β€” a tensor of shape (height, width) where each pixel represents a segment_id, set to None if no mask if found above threshold. If target_sizes is specified, segmentation is resized to the corresponding target_sizes entry.

  • segments_info β€” A dictionary that contains additional information on each segment.

    • id β€” an integer representing the segment_id.

    • label_id β€” An integer representing the label / semantic class id corresponding to segment_id.

    • was_fused β€” a boolean, True if label_id was in label_ids_to_fuse, False otherwise. Multiple instances of the same class / label were fused and assigned a single segment_id.

    • score β€” Prediction score of segment with segment_id.

Converts the output of MaskFormerForInstanceSegmentationOutput into image panoptic segmentation predictions. Only supports PyTorch.

MaskFormerModel

class transformers.MaskFormerModel

<source>arrow-up-right

( config: MaskFormerConfig )

Parameters

The bare MaskFormer Model outputting raw hidden-states without any specific head on top. This model is a PyTorch torch.nn.Modulearrow-up-right sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

<source>arrow-up-right

( pixel_values: Tensorpixel_mask: typing.Optional[torch.Tensor] = Noneoutput_hidden_states: typing.Optional[bool] = Noneoutput_attentions: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = None ) β†’ transformers.models.maskformer.modeling_maskformer.MaskFormerModelOutputarrow-up-right or tuple(torch.FloatTensor)

Parameters

  • pixel_values (torch.FloatTensor of shape (batch_size, num_channels, height, width)) β€” Pixel values. Pixel values can be obtained using AutoImageProcessorarrow-up-right. See MaskFormerImageProcessor.call()arrow-up-right for details.

  • pixel_mask (torch.LongTensor of shape (batch_size, height, width), optional) β€” Mask to avoid performing attention on padding pixel values. Mask values selected in [0, 1]:

    • 1 for pixels that are real (i.e. not masked),

    • 0 for pixels that are padding (i.e. masked).

    What are attention masks?arrow-up-right

  • output_hidden_states (bool, optional) β€” Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more detail.

  • output_attentions (bool, optional) β€” Whether or not to return the attentions tensors of Detr’s decoder attention layers.

  • return_dict (bool, optional) β€” Whether or not to return a ~MaskFormerModelOutput instead of a plain tuple.

Returns

transformers.models.maskformer.modeling_maskformer.MaskFormerModelOutputarrow-up-right or tuple(torch.FloatTensor)

A transformers.models.maskformer.modeling_maskformer.MaskFormerModelOutputarrow-up-right or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (MaskFormerConfigarrow-up-right) and inputs.

  • encoder_last_hidden_state (torch.FloatTensor of shape (batch_size, num_channels, height, width)) β€” Last hidden states (final feature map) of the last stage of the encoder model (backbone).

  • pixel_decoder_last_hidden_state (torch.FloatTensor of shape (batch_size, num_channels, height, width)) β€” Last hidden states (final feature map) of the last stage of the pixel decoder model (FPN).

  • transformer_decoder_last_hidden_state (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size)) β€” Last hidden states (final feature map) of the last stage of the transformer decoder model.

  • encoder_hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) β€” Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each stage) of shape (batch_size, num_channels, height, width). Hidden-states (also called feature maps) of the encoder model at the output of each stage.

  • pixel_decoder_hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) β€” Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each stage) of shape (batch_size, num_channels, height, width). Hidden-states (also called feature maps) of the pixel decoder model at the output of each stage.

  • transformer_decoder_hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) β€” Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each stage) of shape (batch_size, sequence_length, hidden_size). Hidden-states (also called feature maps) of the transformer decoder at the output of each stage.

  • hidden_states tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) β€” Tuple of torch.FloatTensor containing encoder_hidden_states, pixel_decoder_hidden_states and decoder_hidden_states

  • attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) β€” Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length). Attentions weights from Detr’s decoder after the attention softmax, used to compute the weighted average in the self-attention heads.

The MaskFormerModelarrow-up-right forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Examples:

Copied

MaskFormerForInstanceSegmentation

class transformers.MaskFormerForInstanceSegmentation

<source>arrow-up-right

( config: MaskFormerConfig )

forward

<source>arrow-up-right

( pixel_values: Tensormask_labels: typing.Optional[typing.List[torch.Tensor]] = Noneclass_labels: typing.Optional[typing.List[torch.Tensor]] = Nonepixel_mask: typing.Optional[torch.Tensor] = Noneoutput_auxiliary_logits: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Noneoutput_attentions: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = None ) β†’ transformers.models.maskformer.modeling_maskformer.MaskFormerForInstanceSegmentationOutputarrow-up-right or tuple(torch.FloatTensor)

Parameters

  • pixel_values (torch.FloatTensor of shape (batch_size, num_channels, height, width)) β€” Pixel values. Pixel values can be obtained using AutoImageProcessorarrow-up-right. See MaskFormerImageProcessor.call()arrow-up-right for details.

  • pixel_mask (torch.LongTensor of shape (batch_size, height, width), optional) β€” Mask to avoid performing attention on padding pixel values. Mask values selected in [0, 1]:

    • 1 for pixels that are real (i.e. not masked),

    • 0 for pixels that are padding (i.e. masked).

    What are attention masks?arrow-up-right

  • output_hidden_states (bool, optional) β€” Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more detail.

  • output_attentions (bool, optional) β€” Whether or not to return the attentions tensors of Detr’s decoder attention layers.

  • return_dict (bool, optional) β€” Whether or not to return a ~MaskFormerModelOutput instead of a plain tuple.

  • mask_labels (List[torch.Tensor], optional) β€” List of mask labels of shape (num_labels, height, width) to be fed to a model

  • class_labels (List[torch.LongTensor], optional) β€” list of target class labels of shape (num_labels, height, width) to be fed to a model. They identify the labels of mask_labels, e.g. the label of mask_labels[i][j] if class_labels[i][j].

Returns

transformers.models.maskformer.modeling_maskformer.MaskFormerForInstanceSegmentationOutputarrow-up-right or tuple(torch.FloatTensor)

A transformers.models.maskformer.modeling_maskformer.MaskFormerForInstanceSegmentationOutputarrow-up-right or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (MaskFormerConfigarrow-up-right) and inputs.

  • loss (torch.Tensor, optional) β€” The computed loss, returned when labels are present.

  • class_queries_logits (torch.FloatTensor) β€” A tensor of shape (batch_size, num_queries, num_labels + 1) representing the proposed classes for each query. Note the + 1 is needed because we incorporate the null class.

  • masks_queries_logits (torch.FloatTensor) β€” A tensor of shape (batch_size, num_queries, height, width) representing the proposed masks for each query.

  • encoder_last_hidden_state (torch.FloatTensor of shape (batch_size, num_channels, height, width)) β€” Last hidden states (final feature map) of the last stage of the encoder model (backbone).

  • pixel_decoder_last_hidden_state (torch.FloatTensor of shape (batch_size, num_channels, height, width)) β€” Last hidden states (final feature map) of the last stage of the pixel decoder model (FPN).

  • transformer_decoder_last_hidden_state (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size)) β€” Last hidden states (final feature map) of the last stage of the transformer decoder model.

  • encoder_hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) β€” Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each stage) of shape (batch_size, num_channels, height, width). Hidden-states (also called feature maps) of the encoder model at the output of each stage.

  • pixel_decoder_hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) β€” Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each stage) of shape (batch_size, num_channels, height, width). Hidden-states (also called feature maps) of the pixel decoder model at the output of each stage.

  • transformer_decoder_hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) β€” Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each stage) of shape (batch_size, sequence_length, hidden_size). Hidden-states of the transformer decoder at the output of each stage.

  • hidden_states tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) β€” Tuple of torch.FloatTensor containing encoder_hidden_states, pixel_decoder_hidden_states and decoder_hidden_states.

  • attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) β€” Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length). Attentions weights from Detr’s decoder after the attention softmax, used to compute the weighted average in the self-attention heads.

The MaskFormerForInstanceSegmentationarrow-up-right forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Examples:

Semantic segmentation example:

Copied

Panoptic segmentation example:

Copied

Last updated