Diffusers BOINC AI docs
  • 🌍GET STARTED
    • Diffusers
    • Quicktour
    • Effective and efficient diffusion
    • Installation
  • 🌍TUTORIALS
    • Overview
    • Understanding models and schedulers
    • AutoPipeline
    • Train a diffusion model
  • 🌍USING DIFFUSERS
    • 🌍LOADING & HUB
      • Overview
      • Load pipelines, models, and schedulers
      • Load and compare different schedulers
      • Load community pipelines
      • Load safetensors
      • Load different Stable Diffusion formats
      • Push files to the Hub
    • 🌍TASKS
      • Unconditional image generation
      • Text-to-image
      • Image-to-image
      • Inpainting
      • Depth-to-image
    • 🌍TECHNIQUES
      • Textual inversion
      • Distributed inference with multiple GPUs
      • Improve image quality with deterministic generation
      • Control image brightness
      • Prompt weighting
    • 🌍PIPELINES FOR INFERENCE
      • Overview
      • Stable Diffusion XL
      • ControlNet
      • Shap-E
      • DiffEdit
      • Distilled Stable Diffusion inference
      • Create reproducible pipelines
      • Community pipelines
      • How to contribute a community pipeline
    • 🌍TRAINING
      • Overview
      • Create a dataset for training
      • Adapt a model to a new task
      • Unconditional image generation
      • Textual Inversion
      • DreamBooth
      • Text-to-image
      • Low-Rank Adaptation of Large Language Models (LoRA)
      • ControlNet
      • InstructPix2Pix Training
      • Custom Diffusion
      • T2I-Adapters
    • 🌍TAKING DIFFUSERS BEYOND IMAGES
      • Other Modalities
  • 🌍OPTIMIZATION/SPECIAL HARDWARE
    • Overview
    • Memory and Speed
    • Torch2.0 support
    • Stable Diffusion in JAX/Flax
    • xFormers
    • ONNX
    • OpenVINO
    • Core ML
    • MPS
    • Habana Gaudi
    • Token Merging
  • 🌍CONCEPTUAL GUIDES
    • Philosophy
    • Controlled generation
    • How to contribute?
    • Diffusers' Ethical Guidelines
    • Evaluating Diffusion Models
  • 🌍API
    • 🌍MAIN CLASSES
      • Attention Processor
      • Diffusion Pipeline
      • Logging
      • Configuration
      • Outputs
      • Loaders
      • Utilities
      • VAE Image Processor
    • 🌍MODELS
      • Overview
      • UNet1DModel
      • UNet2DModel
      • UNet2DConditionModel
      • UNet3DConditionModel
      • VQModel
      • AutoencoderKL
      • AsymmetricAutoencoderKL
      • Tiny AutoEncoder
      • Transformer2D
      • Transformer Temporal
      • Prior Transformer
      • ControlNet
    • 🌍PIPELINES
      • Overview
      • AltDiffusion
      • Attend-and-Excite
      • Audio Diffusion
      • AudioLDM
      • AudioLDM 2
      • AutoPipeline
      • Consistency Models
      • ControlNet
      • ControlNet with Stable Diffusion XL
      • Cycle Diffusion
      • Dance Diffusion
      • DDIM
      • DDPM
      • DeepFloyd IF
      • DiffEdit
      • DiT
      • IF
      • PaInstructPix2Pix
      • Kandinsky
      • Kandinsky 2.2
      • Latent Diffusionge
      • MultiDiffusion
      • MusicLDM
      • PaintByExample
      • Parallel Sampling of Diffusion Models
      • Pix2Pix Zero
      • PNDM
      • RePaint
      • Score SDE VE
      • Self-Attention Guidance
      • Semantic Guidance
      • Shap-E
      • Spectrogram Diffusion
      • 🌍STABLE DIFFUSION
        • Overview
        • Text-to-image
        • Image-to-image
        • Inpainting
        • Depth-to-image
        • Image variation
        • Safe Stable Diffusion
        • Stable Diffusion 2
        • Stable Diffusion XL
        • Latent upscaler
        • Super-resolution
        • LDM3D Text-to-(RGB, Depth)
        • Stable Diffusion T2I-adapter
        • GLIGEN (Grounded Language-to-Image Generation)
      • Stable unCLIP
      • Stochastic Karras VE
      • Text-to-image model editing
      • Text-to-video
      • Text2Video-Zero
      • UnCLIP
      • Unconditional Latent Diffusion
      • UniDiffuser
      • Value-guided sampling
      • Versatile Diffusion
      • VQ Diffusion
      • Wuerstchen
    • 🌍SCHEDULERS
      • Overview
      • CMStochasticIterativeScheduler
      • DDIMInverseScheduler
      • DDIMScheduler
      • DDPMScheduler
      • DEISMultistepScheduler
      • DPMSolverMultistepInverse
      • DPMSolverMultistepScheduler
      • DPMSolverSDEScheduler
      • DPMSolverSinglestepScheduler
      • EulerAncestralDiscreteScheduler
      • EulerDiscreteScheduler
      • HeunDiscreteScheduler
      • IPNDMScheduler
      • KarrasVeScheduler
      • KDPM2AncestralDiscreteScheduler
      • KDPM2DiscreteScheduler
      • LMSDiscreteScheduler
      • PNDMScheduler
      • RePaintScheduler
      • ScoreSdeVeScheduler
      • ScoreSdeVpScheduler
      • UniPCMultistepScheduler
      • VQDiffusionScheduler
Powered by GitBook
On this page
  • VQModel
  • VQModel
  • VQEncoderOutput
  1. API
  2. MODELS

VQModel

PreviousUNet3DConditionModelNextAutoencoderKL

Last updated 1 year ago

VQModel

The VQ-VAE model was introduced in by Aaron van den Oord, Oriol Vinyals and Koray Kavukcuoglu. The model is used in 🌍 Diffusers to decode latent representations into images. Unlike , the works in a quantized latent space.

The abstract from the paper is:

Learning useful representations without supervision remains a key challenge in machine learning. In this paper, we propose a simple yet powerful generative model that learns such discrete representations. Our model, the Vector Quantised-Variational AutoEncoder (VQ-VAE), differs from VAEs in two key ways: the encoder network outputs discrete, rather than continuous, codes; and the prior is learnt rather than static. In order to learn a discrete latent representation, we incorporate ideas from vector quantisation (VQ). Using the VQ method allows the model to circumvent issues of β€œposterior collapse” β€” where the latents are ignored when they are paired with a powerful autoregressive decoder β€” typically observed in the VAE framework. Pairing these representations with an autoregressive prior, the model can generate high quality images, videos, and speech as well as doing high quality speaker conversion and unsupervised learning of phonemes, providing further evidence of the utility of the learnt representations.

VQModel

class diffusers.VQModel

( in_channels: int = 3out_channels: int = 3down_block_types: typing.Tuple[str] = ('DownEncoderBlock2D',)up_block_types: typing.Tuple[str] = ('UpDecoderBlock2D',)block_out_channels: typing.Tuple[int] = (64,)layers_per_block: int = 1act_fn: str = 'silu'latent_channels: int = 3sample_size: int = 32num_vq_embeddings: int = 256norm_num_groups: int = 32vq_embed_dim: typing.Optional[int] = Nonescaling_factor: float = 0.18215norm_type: str = 'group' )

Parameters

  • in_channels (int, optional, defaults to 3) β€” Number of channels in the input image.

  • out_channels (int, optional, defaults to 3) β€” Number of channels in the output.

  • down_block_types (Tuple[str], optional, defaults to ("DownEncoderBlock2D",)) β€” Tuple of downsample block types.

  • up_block_types (Tuple[str], optional, defaults to ("UpDecoderBlock2D",)) β€” Tuple of upsample block types.

  • block_out_channels (Tuple[int], optional, defaults to (64,)) β€” Tuple of block output channels.

  • act_fn (str, optional, defaults to "silu") β€” The activation function to use.

  • latent_channels (int, optional, defaults to 3) β€” Number of channels in the latent space.

  • sample_size (int, optional, defaults to 32) β€” Sample input size.

  • num_vq_embeddings (int, optional, defaults to 256) β€” Number of codebook vectors in the VQ-VAE.

  • vq_embed_dim (int, optional) β€” Hidden dim of codebook vectors in the VQ-VAE.

  • scaling_factor (float, optional, defaults to 0.18215) β€” The component-wise standard deviation of the trained latent space computed using the first batch of the training set. This is used to scale the latent space to have unit variance when training the diffusion model. The latents are scaled with the formula z = z * scaling_factor before being passed to the diffusion model. When decoding, the latents are scaled back to the original scale with the formula: z = 1 / scaling_factor * z. For more details, refer to sections 4.3.2 and D.1 of the paper.

A VQ-VAE model for decoding latent representations.

forward

Parameters

  • sample (torch.FloatTensor) β€” Input sample.

Returns

VQEncoderOutput

class diffusers.models.vq_model.VQEncoderOutput

( latents: FloatTensor )

Parameters

  • latents (torch.FloatTensor of shape (batch_size, num_channels, height, width)) β€” The encoded output sample from the last layer of the model.

Output of VQModel encoding method.

This model inherits from . Check the superclass documentation for it’s generic methods implemented for all models (such as downloading or saving).

( sample: FloatTensorreturn_dict: bool = True ) β†’ or tuple

return_dict (bool, optional, defaults to True) β€” Whether or not to return a instead of a plain tuple.

or tuple

If return_dict is True, a is returned, otherwise a plain tuple is returned.

The forward method.

🌍
🌍
Neural Discrete Representation Learning
AutoencoderKL
VQModel
<source>
High-Resolution Image Synthesis with Latent Diffusion Models
ModelMixin
<source>
VQEncoderOutput
models.vq_model.VQEncoderOutput
VQEncoderOutput
VQEncoderOutput
VQModel
<source>