VQModel
VQModel
The VQ-VAE model was introduced in Neural Discrete Representation Learning by Aaron van den Oord, Oriol Vinyals and Koray Kavukcuoglu. The model is used in π Diffusers to decode latent representations into images. Unlike AutoencoderKL, the VQModel works in a quantized latent space.
The abstract from the paper is:
Learning useful representations without supervision remains a key challenge in machine learning. In this paper, we propose a simple yet powerful generative model that learns such discrete representations. Our model, the Vector Quantised-Variational AutoEncoder (VQ-VAE), differs from VAEs in two key ways: the encoder network outputs discrete, rather than continuous, codes; and the prior is learnt rather than static. In order to learn a discrete latent representation, we incorporate ideas from vector quantisation (VQ). Using the VQ method allows the model to circumvent issues of βposterior collapseβ β where the latents are ignored when they are paired with a powerful autoregressive decoder β typically observed in the VAE framework. Pairing these representations with an autoregressive prior, the model can generate high quality images, videos, and speech as well as doing high quality speaker conversion and unsupervised learning of phonemes, providing further evidence of the utility of the learnt representations.
VQModel
class diffusers.VQModel
( in_channels: int = 3out_channels: int = 3down_block_types: typing.Tuple[str] = ('DownEncoderBlock2D',)up_block_types: typing.Tuple[str] = ('UpDecoderBlock2D',)block_out_channels: typing.Tuple[int] = (64,)layers_per_block: int = 1act_fn: str = 'silu'latent_channels: int = 3sample_size: int = 32num_vq_embeddings: int = 256norm_num_groups: int = 32vq_embed_dim: typing.Optional[int] = Nonescaling_factor: float = 0.18215norm_type: str = 'group' )
Parameters
in_channels (int, optional, defaults to 3) β Number of channels in the input image.
out_channels (int, optional, defaults to 3) β Number of channels in the output.
down_block_types (
Tuple[str]
, optional, defaults to("DownEncoderBlock2D",)
) β Tuple of downsample block types.up_block_types (
Tuple[str]
, optional, defaults to("UpDecoderBlock2D",)
) β Tuple of upsample block types.block_out_channels (
Tuple[int]
, optional, defaults to(64,)
) β Tuple of block output channels.act_fn (
str
, optional, defaults to"silu"
) β The activation function to use.latent_channels (
int
, optional, defaults to3
) β Number of channels in the latent space.sample_size (
int
, optional, defaults to32
) β Sample input size.num_vq_embeddings (
int
, optional, defaults to256
) β Number of codebook vectors in the VQ-VAE.vq_embed_dim (
int
, optional) β Hidden dim of codebook vectors in the VQ-VAE.scaling_factor (
float
, optional, defaults to0.18215
) β The component-wise standard deviation of the trained latent space computed using the first batch of the training set. This is used to scale the latent space to have unit variance when training the diffusion model. The latents are scaled with the formulaz = z * scaling_factor
before being passed to the diffusion model. When decoding, the latents are scaled back to the original scale with the formula:z = 1 / scaling_factor * z
. For more details, refer to sections 4.3.2 and D.1 of the High-Resolution Image Synthesis with Latent Diffusion Models paper.
A VQ-VAE model for decoding latent representations.
This model inherits from ModelMixin. Check the superclass documentation for itβs generic methods implemented for all models (such as downloading or saving).
forward
( sample: FloatTensorreturn_dict: bool = True ) β VQEncoderOutput or tuple
Parameters
sample (
torch.FloatTensor
) β Input sample.return_dict (
bool
, optional, defaults toTrue
) β Whether or not to return a models.vq_model.VQEncoderOutput instead of a plain tuple.
Returns
VQEncoderOutput or tuple
If return_dict is True, a VQEncoderOutput is returned, otherwise a plain tuple
is returned.
The VQModel forward method.
VQEncoderOutput
class diffusers.models.vq_model.VQEncoderOutput
( latents: FloatTensor )
Parameters
latents (
torch.FloatTensor
of shape(batch_size, num_channels, height, width)
) β The encoded output sample from the last layer of the model.
Output of VQModel encoding method.
Last updated