VQ Diffusion

VQ Diffusion

Vector Quantized Diffusion Model for Text-to-Image Synthesisarrow-up-right is by Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, Baining Guo.

The abstract from the paper is:

We present the vector quantized diffusion (VQ-Diffusion) model for text-to-image generation. This method is based on a vector quantized variational autoencoder (VQ-VAE) whose latent space is modeled by a conditional variant of the recently developed Denoising Diffusion Probabilistic Model (DDPM). We find that this latent-space method is well-suited for text-to-image generation tasks because it not only eliminates the unidirectional bias with existing methods but also allows us to incorporate a mask-and-replace diffusion strategy to avoid the accumulation of errors, which is a serious problem with existing methods. Our experiments show that the VQ-Diffusion produces significantly better text-to-image generation results when compared with conventional autoregressive (AR) models with similar numbers of parameters. Compared with previous GAN-based text-to-image methods, our VQ-Diffusion can handle more complex scenes and improve the synthesized image quality by a large margin. Finally, we show that the image generation computation in our method can be made highly efficient by reparameterization. With traditional AR methods, the text-to-image generation time increases linearly with the output image resolution and hence is quite time consuming even for normal size images. The VQ-Diffusion allows us to achieve a better trade-off between quality and speed. Our experiments indicate that the VQ-Diffusion model with the reparameterization is fifteen times faster than traditional AR methods while achieving a better image quality.

The original codebase can be found at microsoft/VQ-Diffusionarrow-up-right.

Make sure to check out the Schedulers guidearrow-up-right to learn how to explore the tradeoff between scheduler speed and quality, and see the reuse components across pipelinesarrow-up-right section to learn how to efficiently load the same components into multiple pipelines.

VQDiffusionPipeline

class diffusers.VQDiffusionPipeline

<source>arrow-up-right

( vqvae: VQModeltext_encoder: CLIPTextModeltokenizer: CLIPTokenizertransformer: Transformer2DModelscheduler: VQDiffusionSchedulerlearned_classifier_free_sampling_embeddings: LearnedClassifierFreeSamplingEmbeddings )

Parameters

Pipeline for text-to-image generation using VQ Diffusion.

This model inherits from DiffusionPipelinearrow-up-right. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).

__call__

<source>arrow-up-right

( prompt: typing.Union[str, typing.List[str]]num_inference_steps: int = 100guidance_scale: float = 5.0truncation_rate: float = 1.0num_images_per_prompt: int = 1generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = Nonelatents: typing.Optional[torch.FloatTensor] = Noneoutput_type: typing.Optional[str] = 'pil'return_dict: bool = Truecallback: typing.Union[typing.Callable[[int, int, torch.FloatTensor], NoneType], NoneType] = Nonecallback_steps: int = 1 ) β†’ ImagePipelineOutputarrow-up-right or tuple

Parameters

  • prompt (str or List[str]) β€” The prompt or prompts to guide image generation.

  • num_inference_steps (int, optional, defaults to 100) β€” The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference.

  • guidance_scale (float, optional, defaults to 7.5) β€” A higher guidance scale value encourages the model to generate images closely linked to the text prompt at the expense of lower image quality. Guidance scale is enabled when guidance_scale > 1.

  • truncation_rate (float, optional, defaults to 1.0 (equivalent to no truncation)) β€” Used to β€œtruncate” the predicted classes for x_0 such that the cumulative probability for a pixel is at most truncation_rate. The lowest probabilities that would increase the cumulative probability above truncation_rate are set to zero.

  • num_images_per_prompt (int, optional, defaults to 1) β€” The number of images to generate per prompt.

  • generator (torch.Generator, optional) β€” A torch.Generatorarrow-up-right to make generation deterministic.

  • latents (torch.FloatTensor of shape (batch), optional) β€” Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image generation. Must be valid embedding indices.If not provided, a latents tensor will be generated of completely masked latent pixels.

  • output_type (str, optional, defaults to "pil") β€” The output format of the generated image. Choose between PIL.Image or np.array.

  • return_dict (bool, optional, defaults to True) β€” Whether or not to return a ImagePipelineOutputarrow-up-right instead of a plain tuple.

  • callback (Callable, optional) β€” A function that calls every callback_steps steps during inference. The function is called with the following arguments: callback(step: int, timestep: int, latents: torch.FloatTensor).

  • callback_steps (int, optional, defaults to 1) β€” The frequency at which the callback function is called. If not specified, the callback is called at every step.

Returns

ImagePipelineOutputarrow-up-right or tuple

If return_dict is True, ImagePipelineOutputarrow-up-right is returned, otherwise a tuple is returned where the first element is a list with the generated images.

The call function to the pipeline for generation.

truncate

<source>arrow-up-right

( log_p_x_0: FloatTensortruncation_rate: float )

Truncates log_p_x_0 such that for each column vector, the total cumulative probability is truncation_rate The lowest probabilities that would increase the cumulative probability above truncation_rate are set to zero.

ImagePipelineOutput

class diffusers.ImagePipelineOutput

<source>arrow-up-right

( images: typing.Union[typing.List[PIL.Image.Image], numpy.ndarray] )

Parameters

  • images (List[PIL.Image.Image] or np.ndarray) β€” List of denoised PIL images of length batch_size or NumPy array of shape (batch_size, height, width, num_channels).

Output class for image pipelines.

Last updated