Diffusers BOINC AI docs
  • 🌍GET STARTED
    • Diffusers
    • Quicktour
    • Effective and efficient diffusion
    • Installation
  • 🌍TUTORIALS
    • Overview
    • Understanding models and schedulers
    • AutoPipeline
    • Train a diffusion model
  • 🌍USING DIFFUSERS
    • 🌍LOADING & HUB
      • Overview
      • Load pipelines, models, and schedulers
      • Load and compare different schedulers
      • Load community pipelines
      • Load safetensors
      • Load different Stable Diffusion formats
      • Push files to the Hub
    • 🌍TASKS
      • Unconditional image generation
      • Text-to-image
      • Image-to-image
      • Inpainting
      • Depth-to-image
    • 🌍TECHNIQUES
      • Textual inversion
      • Distributed inference with multiple GPUs
      • Improve image quality with deterministic generation
      • Control image brightness
      • Prompt weighting
    • 🌍PIPELINES FOR INFERENCE
      • Overview
      • Stable Diffusion XL
      • ControlNet
      • Shap-E
      • DiffEdit
      • Distilled Stable Diffusion inference
      • Create reproducible pipelines
      • Community pipelines
      • How to contribute a community pipeline
    • 🌍TRAINING
      • Overview
      • Create a dataset for training
      • Adapt a model to a new task
      • Unconditional image generation
      • Textual Inversion
      • DreamBooth
      • Text-to-image
      • Low-Rank Adaptation of Large Language Models (LoRA)
      • ControlNet
      • InstructPix2Pix Training
      • Custom Diffusion
      • T2I-Adapters
    • 🌍TAKING DIFFUSERS BEYOND IMAGES
      • Other Modalities
  • 🌍OPTIMIZATION/SPECIAL HARDWARE
    • Overview
    • Memory and Speed
    • Torch2.0 support
    • Stable Diffusion in JAX/Flax
    • xFormers
    • ONNX
    • OpenVINO
    • Core ML
    • MPS
    • Habana Gaudi
    • Token Merging
  • 🌍CONCEPTUAL GUIDES
    • Philosophy
    • Controlled generation
    • How to contribute?
    • Diffusers' Ethical Guidelines
    • Evaluating Diffusion Models
  • 🌍API
    • 🌍MAIN CLASSES
      • Attention Processor
      • Diffusion Pipeline
      • Logging
      • Configuration
      • Outputs
      • Loaders
      • Utilities
      • VAE Image Processor
    • 🌍MODELS
      • Overview
      • UNet1DModel
      • UNet2DModel
      • UNet2DConditionModel
      • UNet3DConditionModel
      • VQModel
      • AutoencoderKL
      • AsymmetricAutoencoderKL
      • Tiny AutoEncoder
      • Transformer2D
      • Transformer Temporal
      • Prior Transformer
      • ControlNet
    • 🌍PIPELINES
      • Overview
      • AltDiffusion
      • Attend-and-Excite
      • Audio Diffusion
      • AudioLDM
      • AudioLDM 2
      • AutoPipeline
      • Consistency Models
      • ControlNet
      • ControlNet with Stable Diffusion XL
      • Cycle Diffusion
      • Dance Diffusion
      • DDIM
      • DDPM
      • DeepFloyd IF
      • DiffEdit
      • DiT
      • IF
      • PaInstructPix2Pix
      • Kandinsky
      • Kandinsky 2.2
      • Latent Diffusionge
      • MultiDiffusion
      • MusicLDM
      • PaintByExample
      • Parallel Sampling of Diffusion Models
      • Pix2Pix Zero
      • PNDM
      • RePaint
      • Score SDE VE
      • Self-Attention Guidance
      • Semantic Guidance
      • Shap-E
      • Spectrogram Diffusion
      • 🌍STABLE DIFFUSION
        • Overview
        • Text-to-image
        • Image-to-image
        • Inpainting
        • Depth-to-image
        • Image variation
        • Safe Stable Diffusion
        • Stable Diffusion 2
        • Stable Diffusion XL
        • Latent upscaler
        • Super-resolution
        • LDM3D Text-to-(RGB, Depth)
        • Stable Diffusion T2I-adapter
        • GLIGEN (Grounded Language-to-Image Generation)
      • Stable unCLIP
      • Stochastic Karras VE
      • Text-to-image model editing
      • Text-to-video
      • Text2Video-Zero
      • UnCLIP
      • Unconditional Latent Diffusion
      • UniDiffuser
      • Value-guided sampling
      • Versatile Diffusion
      • VQ Diffusion
      • Wuerstchen
    • 🌍SCHEDULERS
      • Overview
      • CMStochasticIterativeScheduler
      • DDIMInverseScheduler
      • DDIMScheduler
      • DDPMScheduler
      • DEISMultistepScheduler
      • DPMSolverMultistepInverse
      • DPMSolverMultistepScheduler
      • DPMSolverSDEScheduler
      • DPMSolverSinglestepScheduler
      • EulerAncestralDiscreteScheduler
      • EulerDiscreteScheduler
      • HeunDiscreteScheduler
      • IPNDMScheduler
      • KarrasVeScheduler
      • KDPM2AncestralDiscreteScheduler
      • KDPM2DiscreteScheduler
      • LMSDiscreteScheduler
      • PNDMScheduler
      • RePaintScheduler
      • ScoreSdeVeScheduler
      • ScoreSdeVpScheduler
      • UniPCMultistepScheduler
      • VQDiffusionScheduler
Powered by GitBook
On this page
  • AsymmetricAutoencoderKL
  • Available checkpoints
  • Example Usage
  • AsymmetricAutoencoderKL
  • AutoencoderKLOutput
  • DecoderOutput
  1. API
  2. MODELS

AsymmetricAutoencoderKL

PreviousAutoencoderKLNextTiny AutoEncoder

Last updated 1 year ago

AsymmetricAutoencoderKL

Improved larger variational autoencoder (VAE) model with KL loss for inpainting task: by Zixin Zhu, Xuelu Feng, Dongdong Chen, Jianmin Bao, Le Wang, Yinpeng Chen, Lu Yuan, Gang Hua.

The abstract from the paper is:

StableDiffusion is a revolutionary text-to-image generator that is causing a stir in the world of image generation and editing. Unlike traditional methods that learn a diffusion model in pixel space, StableDiffusion learns a diffusion model in the latent space via a VQGAN, ensuring both efficiency and quality. It not only supports image generation tasks, but also enables image editing for real images, such as image inpainting and local editing. However, we have observed that the vanilla VQGAN used in StableDiffusion leads to significant information loss, causing distortion artifacts even in non-edited image regions. To this end, we propose a new asymmetric VQGAN with two simple designs. Firstly, in addition to the input from the encoder, the decoder contains a conditional branch that incorporates information from task-specific priors, such as the unmasked image region in inpainting. Secondly, the decoder is much heavier than the encoder, allowing for more detailed recovery while only slightly increasing the total inference cost. The training cost of our asymmetric VQGAN is cheap, and we only need to retrain a new asymmetric decoder while keeping the vanilla VQGAN encoder and StableDiffusion unchanged. Our asymmetric VQGAN can be widely used in StableDiffusion-based inpainting and local editing methods. Extensive experiments demonstrate that it can significantly improve the inpainting and editing performance, while maintaining the original text-to-image capability. The code is available at

Evaluation results can be found in section 4.1 of the original paper.

Available checkpoints

Example Usage

Copied

from io import BytesIO
from PIL import Image
import requests
from diffusers import AsymmetricAutoencoderKL, StableDiffusionInpaintPipeline


def download_image(url: str) -> Image.Image:
    response = requests.get(url)
    return Image.open(BytesIO(response.content)).convert("RGB")


prompt = "a photo of a person"
img_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/repaint/celeba_hq_256.png"
mask_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/repaint/mask_256.png"

image = download_image(img_url).resize((256, 256))
mask_image = download_image(mask_url).resize((256, 256))

pipe = StableDiffusionInpaintPipeline.from_pretrained("runwayml/stable-diffusion-inpainting")
pipe.vae = AsymmetricAutoencoderKL.from_pretrained("cross-attention/asymmetric-autoencoder-kl-x-1-5")
pipe.to("cuda")

image = pipe(prompt=prompt, image=image, mask_image=mask_image).images[0]
image.save("image.jpeg")

AsymmetricAutoencoderKL

class diffusers.AsymmetricAutoencoderKL

( in_channels: int = 3out_channels: int = 3down_block_types: typing.Tuple[str] = ('DownEncoderBlock2D',)down_block_out_channels: typing.Tuple[int] = (64,)layers_per_down_block: int = 1up_block_types: typing.Tuple[str] = ('UpDecoderBlock2D',)up_block_out_channels: typing.Tuple[int] = (64,)layers_per_up_block: int = 1act_fn: str = 'silu'latent_channels: int = 4norm_num_groups: int = 32sample_size: int = 32scaling_factor: float = 0.18215 )

Parameters

  • in_channels (int, optional, defaults to 3) β€” Number of channels in the input image.

  • out_channels (int, optional, defaults to 3) β€” Number of channels in the output.

  • down_block_types (Tuple[str], optional, defaults to ("DownEncoderBlock2D",)) β€” Tuple of downsample block types.

  • down_block_out_channels (Tuple[int], optional, defaults to (64,)) β€” Tuple of down block output channels.

  • layers_per_down_block (int, optional, defaults to 1) β€” Number layers for down block.

  • up_block_types (Tuple[str], optional, defaults to ("UpDecoderBlock2D",)) β€” Tuple of upsample block types.

  • up_block_out_channels (Tuple[int], optional, defaults to (64,)) β€” Tuple of up block output channels.

  • layers_per_up_block (int, optional, defaults to 1) β€” Number layers for up block.

  • act_fn (str, optional, defaults to "silu") β€” The activation function to use.

  • latent_channels (int, optional, defaults to 4) β€” Number of channels in the latent space.

  • sample_size (int, optional, defaults to 32) β€” Sample input size.

  • norm_num_groups (int, optional, defaults to 32) β€” Number of groups to use for the first normalization layer in ResNet blocks.

forward

( sample: FloatTensormask: typing.Optional[torch.FloatTensor] = Nonesample_posterior: bool = Falsereturn_dict: bool = Truegenerator: typing.Optional[torch._C.Generator] = None )

Parameters

  • sample (torch.FloatTensor) β€” Input sample.

  • mask (torch.FloatTensor, optional, defaults to None) β€” Optional inpainting mask.

  • sample_posterior (bool, optional, defaults to False) β€” Whether to sample from the posterior.

  • return_dict (bool, optional, defaults to True) β€” Whether or not to return a DecoderOutput instead of a plain tuple.

AutoencoderKLOutput

class diffusers.models.autoencoder_kl.AutoencoderKLOutput

( latent_dist: DiagonalGaussianDistribution )

Parameters

  • latent_dist (DiagonalGaussianDistribution) β€” Encoded outputs of Encoder represented as the mean and logvar of DiagonalGaussianDistribution. DiagonalGaussianDistribution allows for sampling latents from the distribution.

Output of AutoencoderKL encoding method.

DecoderOutput

class diffusers.models.vae.DecoderOutput

( sample: FloatTensor )

Parameters

  • sample (torch.FloatTensor of shape (batch_size, num_channels, height, width)) β€” The decoded output sample from the last layer of the model.

Output of decoding method.

scaling_factor (float, optional, defaults to 0.18215) β€” The component-wise standard deviation of the trained latent space computed using the first batch of the training set. This is used to scale the latent space to have unit variance when training the diffusion model. The latents are scaled with the formula z = z * scaling_factor before being passed to the diffusion model. When decoding, the latents are scaled back to the original scale with the formula: z = 1 / scaling_factor * z. For more details, refer to sections 4.3.2 and D.1 of the paper.

Designing a Better Asymmetric VQGAN for StableDiffusion . A VAE model with KL loss for encoding images into latents and decoding latent representations into images.

This model inherits from . Check the superclass documentation for it’s generic methods implemented for all models (such as downloading or saving).

🌍
🌍
Designing a Better Asymmetric VQGAN for StableDiffusion
https://github.com/buxiangzhiren/Asymmetric_VQGAN
https://huggingface.co/cross-attention/asymmetric-autoencoder-kl-x-1-5
https://huggingface.co/cross-attention/asymmetric-autoencoder-kl-x-2
<source>
High-Resolution Image Synthesis with Latent Diffusion Models
https://arxiv.org/abs/2306.04632
ModelMixin
<source>
<source>
<source>