Segment Anything
Last updated
Last updated
SAM (Segment Anything Model) was proposed in Segment Anything by Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alex Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick.
The model can be used to predict segmentation masks of any object of interest given an input image.
The abstract from the paper is the following:
We introduce the Segment Anything (SA) project: a new task, model, and dataset for image segmentation. Using our efficient model in a data collection loop, we built the largest segmentation dataset to date (by far), with over 1 billion masks on 11M licensed and privacy respecting images. The model is designed and trained to be promptable, so it can transfer zero-shot to new image distributions and tasks. We evaluate its capabilities on numerous tasks and find that its zero-shot performance is impressive — often competitive with or even superior to prior fully supervised results. We are releasing the Segment Anything Model (SAM) and corresponding dataset (SA-1B) of 1B masks and 11M images at https://segment-anything.com to foster research into foundation models for computer vision.
Tips:
The model predicts binary masks that states the presence or not of the object of interest given an image.
The model predicts much better results if input 2D points and/or input bounding boxes are provided
You can prompt multiple points for the same image, and predict a single mask.
Fine-tuning the model is not supported yet
According to the paper, textual input should be also supported. However, at this time of writing this seems to be not supported according to the official repository.
This model was contributed by ybelkada and ArthurZ. The original code can be found here.
Below is an example on how to run mask generation given an image and a 2D point:
Copied
Resources:
Demo notebook for using the model.
Demo notebook for using the automatic mask generation pipeline.
Demo notebook for inference with MedSAM, a fine-tuned version of SAM on the medical domain.
Demo notebook for fine-tuning the model on custom data.
( vision_config = Noneprompt_encoder_config = Nonemask_decoder_config = Noneinitializer_range = 0.02**kwargs )
Parameters
vision_config (Union[dict
, SamVisionConfig
], optional) — Dictionary of configuration options used to initialize SamVisionConfig.
prompt_encoder_config (Union[dict
, SamPromptEncoderConfig
], optional) — Dictionary of configuration options used to initialize SamPromptEncoderConfig.
mask_decoder_config (Union[dict
, SamMaskDecoderConfig
], optional) — Dictionary of configuration options used to initialize SamMaskDecoderConfig.
kwargs (optional) — Dictionary of keyword arguments.
SamConfig is the configuration class to store the configuration of a SamModel. It is used to instantiate a SAM model according to the specified arguments, defining the vision model, prompt-encoder model and mask decoder configs. Instantiating a configuration with the defaults will yield a similar configuration to that of the SAM-ViT-H facebook/sam-vit-huge architecture.
Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.
Example:
Copied
( hidden_size = 768output_channels = 256num_hidden_layers = 12num_attention_heads = 12num_channels = 3image_size = 1024patch_size = 16hidden_act = 'gelu'layer_norm_eps = 1e-06attention_dropout = 0.0initializer_range = 1e-10qkv_bias = Truemlp_ratio = 4.0use_abs_pos = Trueuse_rel_pos = Truewindow_size = 14global_attn_indexes = [2, 5, 8, 11]num_pos_feats = 128mlp_dim = None**kwargs )
Parameters
hidden_size (int
, optional, defaults to 768) — Dimensionality of the encoder layers and the pooler layer.
output_channels (int
, optional, defaults to 256) — Dimensionality of the output channels in the Patch Encoder.
num_hidden_layers (int
, optional, defaults to 12) — Number of hidden layers in the Transformer encoder.
num_attention_heads (int
, optional, defaults to 12) — Number of attention heads for each attention layer in the Transformer encoder.
num_channels (int
, optional, defaults to 3) — Number of channels in the input image.
image_size (int
, optional, defaults to 1024) — Expected resolution. Target size of the resized input image.
patch_size (int
, optional, defaults to 16) — Size of the patches to be extracted from the input image.
hidden_act (str
, optional, defaults to "gelu"
) — The non-linear activation function (function or string)
layer_norm_eps (float
, optional, defaults to 1e-6) — The epsilon used by the layer normalization layers.
attention_dropout (float
, optional, defaults to 0.0) — The dropout ratio for the attention probabilities.
initializer_range (float
, optional, defaults to 1e-10) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
qkv_bias (bool
, optional, defaults to True
) — Whether to add a bias to query, key, value projections.
mlp_ratio (float
, optional, defaults to 4.0) — Ratio of mlp hidden dim to embedding dim.
use_abs_pos (bool
, optional, defaults to True) — Whether to use absolute position embedding.
use_rel_pos (bool
, optional, defaults to True) — Whether to use relative position embedding.
window_size (int
, optional, defaults to 14) — Window size for relative position.
global_attn_indexes (List[int]
, optional, defaults to [2, 5, 8, 11]
) — The indexes of the global attention layers.
num_pos_feats (int
, optional, defaults to 128) — The dimensionality of the position embedding.
mlp_dim (int
, optional, defaults to None) — The dimensionality of the MLP layer in the Transformer encoder. If None
, defaults to mlp_ratio * hidden_size
.
This is the configuration class to store the configuration of a SamVisionModel
. It is used to instantiate a SAM vision encoder according to the specified arguments, defining the model architecture. Instantiating a configuration defaults will yield a similar configuration to that of the SAM ViT-h facebook/sam-vit-huge architecture.
Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.
( hidden_size = 256hidden_act = 'relu'mlp_dim = 2048num_hidden_layers = 2num_attention_heads = 8attention_downsample_rate = 2num_multimask_outputs = 3iou_head_depth = 3iou_head_hidden_dim = 256layer_norm_eps = 1e-06**kwargs )
Parameters
hidden_size (int
, optional, defaults to 256) — Dimensionality of the hidden states.
hidden_act (str
, optional, defaults to "relu"
) — The non-linear activation function used inside the SamMaskDecoder
module.
mlp_dim (int
, optional, defaults to 2048) — Dimensionality of the “intermediate” (i.e., feed-forward) layer in the Transformer encoder.
num_hidden_layers (int
, optional, defaults to 2) — Number of hidden layers in the Transformer encoder.
num_attention_heads (int
, optional, defaults to 8) — Number of attention heads for each attention layer in the Transformer encoder.
attention_downsample_rate (int
, optional, defaults to 2) — The downsampling rate of the attention layer.
num_multimask_outputs (int
, optional, defaults to 3) — The number of outputs from the SamMaskDecoder
module. In the Segment Anything paper, this is set to 3.
iou_head_depth (int
, optional, defaults to 3) — The number of layers in the IoU head module.
iou_head_hidden_dim (int
, optional, defaults to 256) — The dimensionality of the hidden states in the IoU head module.
layer_norm_eps (float
, optional, defaults to 1e-6) — The epsilon used by the layer normalization layers.
This is the configuration class to store the configuration of a SamMaskDecoder
. It is used to instantiate a SAM mask decoder to the specified arguments, defining the model architecture. Instantiating a configuration defaults will yield a similar configuration to that of the SAM-vit-h facebook/sam-vit-huge architecture.
Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.
( hidden_size = 256image_size = 1024patch_size = 16mask_input_channels = 16num_point_embeddings = 4hidden_act = 'gelu'layer_norm_eps = 1e-06**kwargs )
Parameters
hidden_size (int
, optional, defaults to 256) — Dimensionality of the hidden states.
image_size (int
, optional, defaults to 1024) — The expected output resolution of the image.
patch_size (int
, optional, defaults to 16) — The size (resolution) of each patch.
mask_input_channels (int
, optional, defaults to 16) — The number of channels to be fed to the MaskDecoder
module.
num_point_embeddings (int
, optional, defaults to 4) — The number of point embeddings to be used.
hidden_act (str
, optional, defaults to "gelu"
) — The non-linear activation function in the encoder and pooler.
This is the configuration class to store the configuration of a SamPromptEncoder
. The SamPromptEncoder
module is used to encode the input 2D points and bounding boxes. Instantiating a configuration defaults will yield a similar configuration to that of the SAM-vit-h facebook/sam-vit-huge architecture.
Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.
( image_processor )
Parameters
image_processor (SamImageProcessor
) — An instance of SamImageProcessor. The image processor is a required input.
Constructs a SAM processor which wraps a SAM image processor and an 2D points & Bounding boxes processor into a single processor.
SamProcessor offers all the functionalities of SamImageProcessor. See the docstring of call() for more information.
( do_resize: bool = Truesize: typing.Dict[str, int] = Noneresample: Resampling = <Resampling.BILINEAR: 2>do_rescale: bool = Truerescale_factor: typing.Union[int, float] = 0.00392156862745098do_normalize: bool = Trueimage_mean: typing.Union[float, typing.List[float], NoneType] = Noneimage_std: typing.Union[float, typing.List[float], NoneType] = Nonedo_pad: bool = Truepad_size: int = Nonedo_convert_rgb: bool = True**kwargs )
Parameters
do_resize (bool
, optional, defaults to True
) — Whether to resize the image’s (height, width) dimensions to the specified size
. Can be overridden by the do_resize
parameter in the preprocess
method.
size (dict
, optional, defaults to {"longest_edge" -- 1024}
): Size of the output image after resizing. Resizes the longest edge of the image to match size["longest_edge"]
while maintaining the aspect ratio. Can be overridden by the size
parameter in the preprocess
method.
resample (PILImageResampling
, optional, defaults to PILImageResampling.BICUBIC
) — Resampling filter to use if resizing the image. Can be overridden by the resample
parameter in the preprocess
method.
do_rescale (bool
, optional, defaults to True
) — Wwhether to rescale the image by the specified scale rescale_factor
. Can be overridden by the do_rescale
parameter in the preprocess
method.
rescale_factor (int
or float
, optional, defaults to 1/255
) — Scale factor to use if rescaling the image. Only has an effect if do_rescale
is set to True
. Can be overridden by the rescale_factor
parameter in the preprocess
method.
do_normalize (bool
, optional, defaults to True
) — Whether to normalize the image. Can be overridden by the do_normalize
parameter in the preprocess
method. Can be overridden by the do_normalize
parameter in the preprocess
method.
image_mean (float
or List[float]
, optional, defaults to IMAGENET_DEFAULT_MEAN
) — Mean to use if normalizing the image. This is a float or list of floats the length of the number of channels in the image. Can be overridden by the image_mean
parameter in the preprocess
method. Can be overridden by the image_mean
parameter in the preprocess
method.
image_std (float
or List[float]
, optional, defaults to IMAGENET_DEFAULT_STD
) — Standard deviation to use if normalizing the image. This is a float or list of floats the length of the number of channels in the image. Can be overridden by the image_std
parameter in the preprocess
method. Can be overridden by the image_std
parameter in the preprocess
method.
do_pad (bool
, optional, defaults to True
) — Whether to pad the image to the specified pad_size
. Can be overridden by the do_pad
parameter in the preprocess
method.
pad_size (dict
, optional, defaults to {"height" -- 1024, "width": 1024}
): Size of the output image after padding. Can be overridden by the pad_size
parameter in the preprocess
method.
do_convert_rgb (bool
, optional, defaults to True
) — Whether to convert the image to RGB.
Constructs a SAM image processor.
filter_masks
( masksiou_scoresoriginal_sizecropped_box_imagepred_iou_thresh = 0.88stability_score_thresh = 0.95mask_threshold = 0stability_score_offset = 1return_tensors = 'pt' )
Parameters
masks (Union[torch.Tensor, tf.Tensor]
) — Input masks.
iou_scores (Union[torch.Tensor, tf.Tensor]
) — List of IoU scores.
original_size (Tuple[int,int]
) — Size of the orginal image.
cropped_box_image (np.array
) — The cropped image.
pred_iou_thresh (float
, optional, defaults to 0.88) — The threshold for the iou scores.
stability_score_thresh (float
, optional, defaults to 0.95) — The threshold for the stability score.
mask_threshold (float
, optional, defaults to 0) — The threshold for the predicted masks.
stability_score_offset (float
, optional, defaults to 1) — The offset for the stability score used in the _compute_stability_score
method.
return_tensors (str
, optional, defaults to pt
) — If pt
, returns torch.Tensor
. If tf
, returns tf.Tensor
.
Filters the predicted masks by selecting only the ones that meets several criteria. The first criterion being that the iou scores needs to be greater than pred_iou_thresh
. The second criterion is that the stability score needs to be greater than stability_score_thresh
. The method also converts the predicted masks to bounding boxes and pad the predicted masks if necessary.
generate_crop_boxes
( imagetarget_sizecrop_n_layers: int = 0overlap_ratio: float = 0.3413333333333333points_per_crop: typing.Optional[int] = 32crop_n_points_downscale_factor: typing.Optional[typing.List[int]] = 1device: typing.Optional[ForwardRef('torch.device')] = Noneinput_data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = Nonereturn_tensors: str = 'pt' )
Parameters
image (np.array
) — Input original image
target_size (int
) — Target size of the resized image
crop_n_layers (int
, optional, defaults to 0) — If >0, mask prediction will be run again on crops of the image. Sets the number of layers to run, where each layer has 2**i_layer number of image crops.
overlap_ratio (float
, optional, defaults to 512/1500) — Sets the degree to which crops overlap. In the first crop layer, crops will overlap by this fraction of the image length. Later layers with more crops scale down this overlap.
points_per_crop (int
, optional, defaults to 32) — Number of points to sample from each crop.
crop_n_points_downscale_factor (List[int]
, optional, defaults to 1) — The number of points-per-side sampled in layer n is scaled down by crop_n_points_downscale_factor**n.
device (torch.device
, optional, defaults to None) — Device to use for the computation. If None, cpu will be used.
input_data_format (str
or ChannelDimension
, optional) — The channel dimension format of the input image. If not provided, it will be inferred.
return_tensors (str
, optional, defaults to pt
) — If pt
, returns torch.Tensor
. If tf
, returns tf.Tensor
.
Generates a list of crop boxes of different sizes. Each layer has (2i)2 boxes for the ith layer.
pad_image
( image: ndarraypad_size: typing.Dict[str, int]data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = Noneinput_data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None**kwargs )
Parameters
image (np.ndarray
) — Image to pad.
pad_size (Dict[str, int]
) — Size of the output image after padding.
data_format (str
or ChannelDimension
, optional) — The data format of the image. Can be either “channels_first” or “channels_last”. If None
, the data_format
of the image
will be used.
input_data_format (str
or ChannelDimension
, optional) — The channel dimension format of the input image. If not provided, it will be inferred.
Pad an image to (pad_size["height"], pad_size["width"])
with zeros to the right and bottom.
post_process_for_mask_generation
( all_masksall_scoresall_boxescrops_nms_threshreturn_tensors = 'pt' )
Parameters
all_masks (Union[List[torch.Tensor], List[tf.Tensor]]
) — List of all predicted segmentation masks
all_scores (Union[List[torch.Tensor], List[tf.Tensor]]
) — List of all predicted iou scores
all_boxes (Union[List[torch.Tensor], List[tf.Tensor]]
) — List of all bounding boxes of the predicted masks
crops_nms_thresh (float
) — Threshold for NMS (Non Maximum Suppression) algorithm.
return_tensors (str
, optional, defaults to pt
) — If pt
, returns torch.Tensor
. If tf
, returns tf.Tensor
.
Post processes mask that are generated by calling the Non Maximum Suppression algorithm on the predicted masks.
post_process_masks
( masksoriginal_sizesreshaped_input_sizesmask_threshold = 0.0binarize = Truepad_size = Nonereturn_tensors = 'pt' ) → (Union[torch.Tensor, tf.Tensor]
)
Parameters
masks (Union[List[torch.Tensor], List[np.ndarray], List[tf.Tensor]]
) — Batched masks from the mask_decoder in (batch_size, num_channels, height, width) format.
original_sizes (Union[torch.Tensor, tf.Tensor, List[Tuple[int,int]]]
) — The original sizes of each image before it was resized to the model’s expected input shape, in (height, width) format.
reshaped_input_sizes (Union[torch.Tensor, tf.Tensor, List[Tuple[int,int]]]
) — The size of each image as it is fed to the model, in (height, width) format. Used to remove padding.
mask_threshold (float
, optional, defaults to 0.0) — The threshold to use for binarizing the masks.
binarize (bool
, optional, defaults to True
) — Whether to binarize the masks.
pad_size (int
, optional, defaults to self.pad_size
) — The target size the images were padded to before being passed to the model. If None, the target size is assumed to be the processor’s pad_size
.
return_tensors (str
, optional, defaults to "pt"
) — If "pt"
, return PyTorch tensors. If "tf"
, return TensorFlow tensors.
Returns
(Union[torch.Tensor, tf.Tensor]
)
Batched masks in batch_size, num_channels, height, width) format, where (height, width) is given by original_size.
Remove padding and upscale masks to the original image size.
preprocess
( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), typing.List[ForwardRef('PIL.Image.Image')], typing.List[numpy.ndarray], typing.List[ForwardRef('torch.Tensor')]]do_resize: typing.Optional[bool] = Nonesize: typing.Union[typing.Dict[str, int], NoneType] = Noneresample: typing.Optional[ForwardRef('PILImageResampling')] = Nonedo_rescale: typing.Optional[bool] = Nonerescale_factor: typing.Union[int, float, NoneType] = Nonedo_normalize: typing.Optional[bool] = Noneimage_mean: typing.Union[float, typing.List[float], NoneType] = Noneimage_std: typing.Union[float, typing.List[float], NoneType] = Nonedo_pad: typing.Optional[bool] = Nonepad_size: typing.Union[typing.Dict[str, int], NoneType] = Nonedo_convert_rgb: bool = Nonereturn_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = Nonedata_format: ChannelDimension = <ChannelDimension.FIRST: 'channels_first'>input_data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None**kwargs )
Parameters
images (ImageInput
) — Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If passing in images with pixel values between 0 and 1, set do_rescale=False
.
do_resize (bool
, optional, defaults to self.do_resize
) — Whether to resize the image.
size (Dict[str, int]
, optional, defaults to self.size
) — Controls the size of the image after resize
. The longest edge of the image is resized to size["longest_edge"]
whilst preserving the aspect ratio.
resample (PILImageResampling
, optional, defaults to self.resample
) — PILImageResampling
filter to use when resizing the image e.g. PILImageResampling.BILINEAR
.
do_rescale (bool
, optional, defaults to self.do_rescale
) — Whether to rescale the image pixel values by rescaling factor.
rescale_factor (int
or float
, optional, defaults to self.rescale_factor
) — Rescale factor to apply to the image pixel values.
do_normalize (bool
, optional, defaults to self.do_normalize
) — Whether to normalize the image.
image_mean (float
or List[float]
, optional, defaults to self.image_mean
) — Image mean to normalize the image by if do_normalize
is set to True
.
image_std (float
or List[float]
, optional, defaults to self.image_std
) — Image standard deviation to normalize the image by if do_normalize
is set to True
.
do_pad (bool
, optional, defaults to self.do_pad
) — Whether to pad the image.
pad_size (Dict[str, int]
, optional, defaults to self.pad_size
) — Controls the size of the padding applied to the image. The image is padded to pad_size["height"]
and pad_size["width"]
if do_pad
is set to True
.
do_convert_rgb (bool
, optional, defaults to self.do_convert_rgb
) — Whether to convert the image to RGB.
return_tensors (str
or TensorType
, optional) — The type of tensors to return. Can be one of:
Unset: Return a list of np.ndarray
.
TensorType.TENSORFLOW
or 'tf'
: Return a batch of type tf.Tensor
.
TensorType.PYTORCH
or 'pt'
: Return a batch of type torch.Tensor
.
TensorType.NUMPY
or 'np'
: Return a batch of type np.ndarray
.
TensorType.JAX
or 'jax'
: Return a batch of type jax.numpy.ndarray
.
data_format (ChannelDimension
or str
, optional, defaults to ChannelDimension.FIRST
) — The channel dimension format for the output image. Can be one of:
"channels_first"
or ChannelDimension.FIRST
: image in (num_channels, height, width) format.
"channels_last"
or ChannelDimension.LAST
: image in (height, width, num_channels) format.
Unset: Use the channel dimension format of the input image.
input_data_format (ChannelDimension
or str
, optional) — The channel dimension format for the input image. If unset, the channel dimension format is inferred from the input image. Can be one of:
"channels_first"
or ChannelDimension.FIRST
: image in (num_channels, height, width) format.
"channels_last"
or ChannelDimension.LAST
: image in (height, width, num_channels) format.
"none"
or ChannelDimension.NONE
: image in (height, width) format.
Preprocess an image or batch of images.
resize
( image: ndarraysize: typing.Dict[str, int]resample: Resampling = <Resampling.BICUBIC: 3>data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = Noneinput_data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None**kwargs ) → np.ndarray
Parameters
image (np.ndarray
) — Image to resize.
size (Dict[str, int]
) — Dictionary in the format {"longest_edge": int}
specifying the size of the output image. The longest edge of the image will be resized to the specified size, while the other edge will be resized to maintain the aspect ratio. resample — PILImageResampling
filter to use when resizing the image e.g. PILImageResampling.BILINEAR
.
data_format (ChannelDimension
or str
, optional) — The channel dimension format for the output image. If unset, the channel dimension format of the input image is used. Can be one of:
"channels_first"
or ChannelDimension.FIRST
: image in (num_channels, height, width) format.
"channels_last"
or ChannelDimension.LAST
: image in (height, width, num_channels) format.
input_data_format (ChannelDimension
or str
, optional) — The channel dimension format for the input image. If unset, the channel dimension format is inferred from the input image. Can be one of:
"channels_first"
or ChannelDimension.FIRST
: image in (num_channels, height, width) format.
"channels_last"
or ChannelDimension.LAST
: image in (height, width, num_channels) format.
Returns
np.ndarray
The resized image.
Resize an image to (size["height"], size["width"])
.
( config )
Parameters
config (SamConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
Segment Anything Model (SAM) for generating segmentation masks, given an input image and optional 2D location and bounding boxes. This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
( pixel_values: typing.Optional[torch.FloatTensor] = Noneinput_points: typing.Optional[torch.FloatTensor] = Noneinput_labels: typing.Optional[torch.LongTensor] = Noneinput_boxes: typing.Optional[torch.FloatTensor] = Noneinput_masks: typing.Optional[torch.LongTensor] = Noneimage_embeddings: typing.Optional[torch.FloatTensor] = Nonemultimask_output: bool = Trueattention_similarity: typing.Optional[torch.FloatTensor] = Nonetarget_embedding: typing.Optional[torch.FloatTensor] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = None**kwargs )
Parameters
pixel_values (torch.FloatTensor
of shape (batch_size, num_channels, height, width)
) — Pixel values. Pixel values can be obtained using SamProcessor. See SamProcessor.__call__()
for details.
input_points (torch.FloatTensor
of shape (batch_size, num_points, 2)
) — Input 2D spatial points, this is used by the prompt encoder to encode the prompt. Generally yields to much better results. The points can be obtained by passing a list of list of list to the processor that will create corresponding torch
tensors of dimension 4. The first dimension is the image batch size, the second dimension is the point batch size (i.e. how many segmentation masks do we want the model to predict per input point), the third dimension is the number of points per segmentation mask (it is possible to pass multiple points for a single mask), and the last dimension is the x (vertical) and y (horizontal) coordinates of the point. If a different number of points is passed either for each image, or for each mask, the processor will create “PAD” points that will correspond to the (0, 0) coordinate, and the computation of the embedding will be skipped for these points using the labels.
input_labels (torch.LongTensor
of shape (batch_size, point_batch_size, num_points)
) — Input labels for the points, this is used by the prompt encoder to encode the prompt. According to the official implementation, there are 3 types of labels
1
: the point is a point that contains the object of interest
0
: the point is a point that does not contain the object of interest
-1
: the point corresponds to the background
We added the label:
-10
: the point is a padding point, thus should be ignored by the prompt encoder
The padding labels should be automatically done by the processor.
input_boxes (torch.FloatTensor
of shape (batch_size, num_boxes, 4)
) — Input boxes for the points, this is used by the prompt encoder to encode the prompt. Generally yields to much better generated masks. The boxes can be obtained by passing a list of list of list to the processor, that will generate a torch
tensor, with each dimension corresponding respectively to the image batch size, the number of boxes per image and the coordinates of the top left and botton right point of the box. In the order (x1
, y1
, x2
, y2
):
x1
: the x coordinate of the top left point of the input box
y1
: the y coordinate of the top left point of the input box
x2
: the x coordinate of the bottom right point of the input box
y2
: the y coordinate of the bottom right point of the input box
input_masks (torch.FloatTensor
of shape (batch_size, image_size, image_size)
) — SAM model also accepts segmentation masks as input. The mask will be embedded by the prompt encoder to generate a corresponding embedding, that will be fed later on to the mask decoder. These masks needs to be manually fed by the user, and they need to be of shape (batch_size
, image_size
, image_size
).
image_embeddings (torch.FloatTensor
of shape (batch_size, output_channels, window_size, window_size)
) — Image embeddings, this is used by the mask decder to generate masks and iou scores. For more memory efficient computation, users can first retrieve the image embeddings using the get_image_embeddings
method, and then feed them to the forward
method instead of feeding the pixel_values
.
multimask_output (bool
, optional) — In the original implementation and paper, the model always outputs 3 masks per image (or per point / per bounding box if relevant). However, it is possible to just output a single mask, that corresponds to the “best” mask, by specifying multimask_output=False
.
attention_similarity (torch.FloatTensor
, optional) — Attention similarity tensor, to be provided to the mask decoder for target-guided attention in case the model is used for personalization as introduced in PerSAM.
target_embedding (torch.FloatTensor
, optional) — Embedding of the target concept, to be provided to the mask decoder for target-semantic prompting in case the model is used for personalization as introduced in PerSAM.
output_attentions (bool
, optional) — Whether or not to return the attentions tensors of all attention layers. See attentions
under returned tensors for more detail.
output_hidden_states (bool
, optional) — Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for more detail.
return_dict (bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
Example —
The SamModel forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
( *args**kwargs )
Parameters
config (SamConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
Segment Anything Model (SAM) for generating segmentation masks, given an input image and optional 2D location and bounding boxes. This model inherits from TFPreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a TensorFlow tf.keras.Model subclass. Use it as a regular TensorFlow Model and refer to the TensorFlow documentation for all matter related to general usage and behavior.
call
( pixel_values: TFModelInputType | None = Noneinput_points: tf.Tensor | None = Noneinput_labels: tf.Tensor | None = Noneinput_boxes: tf.Tensor | None = Noneinput_masks: tf.Tensor | None = Noneimage_embeddings: tf.Tensor | None = Nonemultimask_output: bool = Trueoutput_attentions: bool | None = Noneoutput_hidden_states: bool | None = Nonereturn_dict: bool | None = Nonetraining: bool = False**kwargs )
Parameters
pixel_values (tf.Tensor
of shape (batch_size, num_channels, height, width)
) — Pixel values. Pixel values can be obtained using SamProcessor. See SamProcessor.__call__()
for details.
input_points (tf.Tensor
of shape (batch_size, num_points, 2)
) — Input 2D spatial points, this is used by the prompt encoder to encode the prompt. Generally yields to much better results. The points can be obtained by passing a list of list of list to the processor that will create corresponding tf
tensors of dimension 4. The first dimension is the image batch size, the second dimension is the point batch size (i.e. how many segmentation masks do we want the model to predict per input point), the third dimension is the number of points per segmentation mask (it is possible to pass multiple points for a single mask), and the last dimension is the x (vertical) and y (horizontal) coordinates of the point. If a different number of points is passed either for each image, or for each mask, the processor will create “PAD” points that will correspond to the (0, 0) coordinate, and the computation of the embedding will be skipped for these points using the labels.
input_labels (tf.Tensor
of shape (batch_size, point_batch_size, num_points)
) — Input labels for the points, this is used by the prompt encoder to encode the prompt. According to the official implementation, there are 3 types of labels
1
: the point is a point that contains the object of interest
0
: the point is a point that does not contain the object of interest
-1
: the point corresponds to the background
We added the label:
-10
: the point is a padding point, thus should be ignored by the prompt encoder
The padding labels should be automatically done by the processor.
input_boxes (tf.Tensor
of shape (batch_size, num_boxes, 4)
) — Input boxes for the points, this is used by the prompt encoder to encode the prompt. Generally yields to much better generated masks. The boxes can be obtained by passing a list of list of list to the processor, that will generate a tf
tensor, with each dimension corresponding respectively to the image batch size, the number of boxes per image and the coordinates of the top left and botton right point of the box. In the order (x1
, y1
, x2
, y2
):
x1
: the x coordinate of the top left point of the input box
y1
: the y coordinate of the top left point of the input box
x2
: the x coordinate of the bottom right point of the input box
y2
: the y coordinate of the bottom right point of the input box
input_masks (tf.Tensor
of shape (batch_size, image_size, image_size)
) — SAM model also accepts segmentation masks as input. The mask will be embedded by the prompt encoder to generate a corresponding embedding, that will be fed later on to the mask decoder. These masks needs to be manually fed by the user, and they need to be of shape (batch_size
, image_size
, image_size
).
image_embeddings (tf.Tensor
of shape (batch_size, output_channels, window_size, window_size)
) — Image embeddings, this is used by the mask decder to generate masks and iou scores. For more memory efficient computation, users can first retrieve the image embeddings using the get_image_embeddings
method, and then feed them to the call
method instead of feeding the pixel_values
.
multimask_output (bool
, optional) — In the original implementation and paper, the model always outputs 3 masks per image (or per point / per bounding box if relevant). However, it is possible to just output a single mask, that corresponds to the “best” mask, by specifying multimask_output=False
.
output_attentions (bool
, optional) — Whether or not to return the attentions tensors of all attention layers. See attentions
under returned tensors for more detail.
output_hidden_states (bool
, optional) — Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for more detail.
return_dict (bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
The TFSamModel forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.