Diffusers BOINC AI docs
  • 🌍GET STARTED
    • Diffusers
    • Quicktour
    • Effective and efficient diffusion
    • Installation
  • 🌍TUTORIALS
    • Overview
    • Understanding models and schedulers
    • AutoPipeline
    • Train a diffusion model
  • 🌍USING DIFFUSERS
    • 🌍LOADING & HUB
      • Overview
      • Load pipelines, models, and schedulers
      • Load and compare different schedulers
      • Load community pipelines
      • Load safetensors
      • Load different Stable Diffusion formats
      • Push files to the Hub
    • 🌍TASKS
      • Unconditional image generation
      • Text-to-image
      • Image-to-image
      • Inpainting
      • Depth-to-image
    • 🌍TECHNIQUES
      • Textual inversion
      • Distributed inference with multiple GPUs
      • Improve image quality with deterministic generation
      • Control image brightness
      • Prompt weighting
    • 🌍PIPELINES FOR INFERENCE
      • Overview
      • Stable Diffusion XL
      • ControlNet
      • Shap-E
      • DiffEdit
      • Distilled Stable Diffusion inference
      • Create reproducible pipelines
      • Community pipelines
      • How to contribute a community pipeline
    • 🌍TRAINING
      • Overview
      • Create a dataset for training
      • Adapt a model to a new task
      • Unconditional image generation
      • Textual Inversion
      • DreamBooth
      • Text-to-image
      • Low-Rank Adaptation of Large Language Models (LoRA)
      • ControlNet
      • InstructPix2Pix Training
      • Custom Diffusion
      • T2I-Adapters
    • 🌍TAKING DIFFUSERS BEYOND IMAGES
      • Other Modalities
  • 🌍OPTIMIZATION/SPECIAL HARDWARE
    • Overview
    • Memory and Speed
    • Torch2.0 support
    • Stable Diffusion in JAX/Flax
    • xFormers
    • ONNX
    • OpenVINO
    • Core ML
    • MPS
    • Habana Gaudi
    • Token Merging
  • 🌍CONCEPTUAL GUIDES
    • Philosophy
    • Controlled generation
    • How to contribute?
    • Diffusers' Ethical Guidelines
    • Evaluating Diffusion Models
  • 🌍API
    • 🌍MAIN CLASSES
      • Attention Processor
      • Diffusion Pipeline
      • Logging
      • Configuration
      • Outputs
      • Loaders
      • Utilities
      • VAE Image Processor
    • 🌍MODELS
      • Overview
      • UNet1DModel
      • UNet2DModel
      • UNet2DConditionModel
      • UNet3DConditionModel
      • VQModel
      • AutoencoderKL
      • AsymmetricAutoencoderKL
      • Tiny AutoEncoder
      • Transformer2D
      • Transformer Temporal
      • Prior Transformer
      • ControlNet
    • 🌍PIPELINES
      • Overview
      • AltDiffusion
      • Attend-and-Excite
      • Audio Diffusion
      • AudioLDM
      • AudioLDM 2
      • AutoPipeline
      • Consistency Models
      • ControlNet
      • ControlNet with Stable Diffusion XL
      • Cycle Diffusion
      • Dance Diffusion
      • DDIM
      • DDPM
      • DeepFloyd IF
      • DiffEdit
      • DiT
      • IF
      • PaInstructPix2Pix
      • Kandinsky
      • Kandinsky 2.2
      • Latent Diffusionge
      • MultiDiffusion
      • MusicLDM
      • PaintByExample
      • Parallel Sampling of Diffusion Models
      • Pix2Pix Zero
      • PNDM
      • RePaint
      • Score SDE VE
      • Self-Attention Guidance
      • Semantic Guidance
      • Shap-E
      • Spectrogram Diffusion
      • 🌍STABLE DIFFUSION
        • Overview
        • Text-to-image
        • Image-to-image
        • Inpainting
        • Depth-to-image
        • Image variation
        • Safe Stable Diffusion
        • Stable Diffusion 2
        • Stable Diffusion XL
        • Latent upscaler
        • Super-resolution
        • LDM3D Text-to-(RGB, Depth)
        • Stable Diffusion T2I-adapter
        • GLIGEN (Grounded Language-to-Image Generation)
      • Stable unCLIP
      • Stochastic Karras VE
      • Text-to-image model editing
      • Text-to-video
      • Text2Video-Zero
      • UnCLIP
      • Unconditional Latent Diffusion
      • UniDiffuser
      • Value-guided sampling
      • Versatile Diffusion
      • VQ Diffusion
      • Wuerstchen
    • 🌍SCHEDULERS
      • Overview
      • CMStochasticIterativeScheduler
      • DDIMInverseScheduler
      • DDIMScheduler
      • DDPMScheduler
      • DEISMultistepScheduler
      • DPMSolverMultistepInverse
      • DPMSolverMultistepScheduler
      • DPMSolverSDEScheduler
      • DPMSolverSinglestepScheduler
      • EulerAncestralDiscreteScheduler
      • EulerDiscreteScheduler
      • HeunDiscreteScheduler
      • IPNDMScheduler
      • KarrasVeScheduler
      • KDPM2AncestralDiscreteScheduler
      • KDPM2DiscreteScheduler
      • LMSDiscreteScheduler
      • PNDMScheduler
      • RePaintScheduler
      • ScoreSdeVeScheduler
      • ScoreSdeVpScheduler
      • UniPCMultistepScheduler
      • VQDiffusionScheduler
Powered by GitBook
On this page
  • UnCLIP
  • UnCLIPPipeline
  • UnCLIPImageVariationPipeline
  • ImagePipelineOutput
  1. API
  2. PIPELINES

UnCLIP

PreviousText2Video-ZeroNextUnconditional Latent Diffusion

Last updated 1 year ago

UnCLIP

is by Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen. The UnCLIP model in 🌍 Diffusers comes from kakaobrain’s .

The abstract from the paper is following:

Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style. To leverage these representations for image generation, we propose a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the image embedding. We show that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity. Our decoders conditioned on image representations can also produce variations of an image that preserve both its semantics and style, while varying the non-essential details absent from the image representation. Moreover, the joint embedding space of CLIP enables language-guided image manipulations in a zero-shot fashion. We use diffusion models for the decoder and experiment with both autoregressive and diffusion models for the prior, finding that the latter are computationally more efficient and produce higher-quality samples.

You can find lucidrains DALL-E 2 recreation at .

Make sure to check out the Schedulers to learn how to explore the tradeoff between scheduler speed and quality, and see the section to learn how to efficiently load the same components into multiple pipelines.

UnCLIPPipeline

class diffusers.UnCLIPPipeline

( prior: PriorTransformerdecoder: UNet2DConditionModeltext_encoder: CLIPTextModelWithProjectiontokenizer: CLIPTokenizertext_proj: UnCLIPTextProjModelsuper_res_first: UNet2DModelsuper_res_last: UNet2DModelprior_scheduler: UnCLIPSchedulerdecoder_scheduler: UnCLIPSchedulersuper_res_scheduler: UnCLIPScheduler )

Parameters

  • text_encoder (CLIPTextModelWithProjection) — Frozen text-encoder.

  • tokenizer (CLIPTokenizer) — A CLIPTokenizer to tokenize text.

  • prior () — The canonical unCLIP prior to approximate the image embedding from the text embedding.

  • text_proj (UnCLIPTextProjModel) — Utility class to prepare and combine the embeddings before they are passed to the decoder.

  • decoder () — The decoder to invert the image embedding into an image.

  • super_res_first () — Super resolution UNet. Used in all but the last step of the super resolution diffusion process.

  • super_res_last () — Super resolution UNet. Used in the last step of the super resolution diffusion process.

  • prior_scheduler (UnCLIPScheduler) — Scheduler used in the prior denoising process (a modified ).

  • decoder_scheduler (UnCLIPScheduler) — Scheduler used in the decoder denoising process (a modified ).

  • super_res_scheduler (UnCLIPScheduler) — Scheduler used in the super resolution denoising process (a modified ).

Pipeline for text-to-image generation using unCLIP.

__call__

Parameters

  • prompt (str or List[str]) — The prompt or prompts to guide image generation. This can only be left undefined if text_model_output and text_attention_mask is passed.

  • num_images_per_prompt (int, optional, defaults to 1) — The number of images to generate per prompt.

  • prior_num_inference_steps (int, optional, defaults to 25) — The number of denoising steps for the prior. More denoising steps usually lead to a higher quality image at the expense of slower inference.

  • decoder_num_inference_steps (int, optional, defaults to 25) — The number of denoising steps for the decoder. More denoising steps usually lead to a higher quality image at the expense of slower inference.

  • super_res_num_inference_steps (int, optional, defaults to 7) — The number of denoising steps for super resolution. More denoising steps usually lead to a higher quality image at the expense of slower inference.

  • prior_latents (torch.FloatTensor of shape (batch size, embeddings dimension), optional) — Pre-generated noisy latents to be used as inputs for the prior.

  • decoder_latents (torch.FloatTensor of shape (batch size, channels, height, width), optional) — Pre-generated noisy latents to be used as inputs for the decoder.

  • super_res_latents (torch.FloatTensor of shape (batch size, channels, super res height, super res width), optional) — Pre-generated noisy latents to be used as inputs for the decoder.

  • prior_guidance_scale (float, optional, defaults to 4.0) — A higher guidance scale value encourages the model to generate images closely linked to the text prompt at the expense of lower image quality. Guidance scale is enabled when guidance_scale > 1.

  • decoder_guidance_scale (float, optional, defaults to 4.0) — A higher guidance scale value encourages the model to generate images closely linked to the text prompt at the expense of lower image quality. Guidance scale is enabled when guidance_scale > 1.

  • text_model_output (CLIPTextModelOutput, optional) — Pre-defined CLIPTextModel outputs that can be derived from the text encoder. Pre-defined text outputs can be passed for tasks like text embedding interpolations. Make sure to also pass text_attention_mask in this case. prompt can the be left None.

  • text_attention_mask (torch.Tensor, optional) — Pre-defined CLIP text attention mask that can be derived from the tokenizer. Pre-defined text attention masks are necessary when passing text_model_output.

  • output_type (str, optional, defaults to "pil") — The output format of the generated image. Choose between PIL.Image or np.array.

Returns

The call function to the pipeline for generation.

UnCLIPImageVariationPipeline

class diffusers.UnCLIPImageVariationPipeline

( decoder: UNet2DConditionModeltext_encoder: CLIPTextModelWithProjectiontokenizer: CLIPTokenizertext_proj: UnCLIPTextProjModelfeature_extractor: CLIPImageProcessorimage_encoder: CLIPVisionModelWithProjectionsuper_res_first: UNet2DModelsuper_res_last: UNet2DModeldecoder_scheduler: UnCLIPSchedulersuper_res_scheduler: UnCLIPScheduler )

Parameters

  • text_encoder (CLIPTextModelWithProjection) — Frozen text-encoder.

  • tokenizer (CLIPTokenizer) — A CLIPTokenizer to tokenize text.

  • feature_extractor (CLIPImageProcessor) — Model that extracts features from generated images to be used as inputs for the image_encoder.

  • text_proj (UnCLIPTextProjModel) — Utility class to prepare and combine the embeddings before they are passed to the decoder.

Pipeline to generate image variations from an input image using UnCLIP.

__call__

Parameters

  • num_images_per_prompt (int, optional, defaults to 1) — The number of images to generate per prompt.

  • decoder_num_inference_steps (int, optional, defaults to 25) — The number of denoising steps for the decoder. More denoising steps usually lead to a higher quality image at the expense of slower inference.

  • super_res_num_inference_steps (int, optional, defaults to 7) — The number of denoising steps for super resolution. More denoising steps usually lead to a higher quality image at the expense of slower inference.

  • decoder_latents (torch.FloatTensor of shape (batch size, channels, height, width), optional) — Pre-generated noisy latents to be used as inputs for the decoder.

  • super_res_latents (torch.FloatTensor of shape (batch size, channels, super res height, super res width), optional) — Pre-generated noisy latents to be used as inputs for the decoder.

  • decoder_guidance_scale (float, optional, defaults to 4.0) — A higher guidance scale value encourages the model to generate images closely linked to the text prompt at the expense of lower image quality. Guidance scale is enabled when guidance_scale > 1.

  • image_embeddings (torch.Tensor, optional) — Pre-defined image embeddings that can be derived from the image encoder. Pre-defined image embeddings can be passed for tasks like image interpolations. image can be left as None.

  • output_type (str, optional, defaults to "pil") — The output format of the generated image. Choose between PIL.Image or np.array.

Returns

The call function to the pipeline for generation.

ImagePipelineOutput

class diffusers.ImagePipelineOutput

( images: typing.Union[typing.List[PIL.Image.Image], numpy.ndarray] )

Parameters

  • images (List[PIL.Image.Image] or np.ndarray) — List of denoised PIL images of length batch_size or NumPy array of shape (batch_size, height, width, num_channels).

Output class for image pipelines.

This model inherits from . Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).

( prompt: typing.Union[str, typing.List[str], NoneType] = Nonenum_images_per_prompt: int = 1prior_num_inference_steps: int = 25decoder_num_inference_steps: int = 25super_res_num_inference_steps: int = 7generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = Noneprior_latents: typing.Optional[torch.FloatTensor] = Nonedecoder_latents: typing.Optional[torch.FloatTensor] = Nonesuper_res_latents: typing.Optional[torch.FloatTensor] = Nonetext_model_output: typing.Union[transformers.models.clip.modeling_clip.CLIPTextModelOutput, typing.Tuple, NoneType] = Nonetext_attention_mask: typing.Optional[torch.Tensor] = Noneprior_guidance_scale: float = 4.0decoder_guidance_scale: float = 8.0output_type: typing.Optional[str] = 'pil'return_dict: bool = True ) → or tuple

generator (torch.Generator or List[torch.Generator], optional) — A to make generation deterministic.

return_dict (bool, optional, defaults to True) — Whether or not to return a instead of a plain tuple.

or tuple

If return_dict is True, is returned, otherwise a tuple is returned where the first element is a list with the generated images.

image_encoder (CLIPVisionModelWithProjection) — Frozen CLIP image-encoder ().

decoder () — The decoder to invert the image embedding into an image.

super_res_first () — Super resolution UNet. Used in all but the last step of the super resolution diffusion process.

super_res_last () — Super resolution UNet. Used in the last step of the super resolution diffusion process.

decoder_scheduler (UnCLIPScheduler) — Scheduler used in the decoder denoising process (a modified ).

super_res_scheduler (UnCLIPScheduler) — Scheduler used in the super resolution denoising process (a modified ).

This model inherits from . Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).

( image: typing.Union[PIL.Image.Image, typing.List[PIL.Image.Image], torch.FloatTensor, NoneType] = Nonenum_images_per_prompt: int = 1decoder_num_inference_steps: int = 25super_res_num_inference_steps: int = 7generator: typing.Optional[torch._C.Generator] = Nonedecoder_latents: typing.Optional[torch.FloatTensor] = Nonesuper_res_latents: typing.Optional[torch.FloatTensor] = Noneimage_embeddings: typing.Optional[torch.Tensor] = Nonedecoder_guidance_scale: float = 8.0output_type: typing.Optional[str] = 'pil'return_dict: bool = True ) → or tuple

image (PIL.Image.Image or List[PIL.Image.Image] or torch.FloatTensor) — Image or tensor representing an image batch to be used as the starting point. If you provide a tensor, it needs to be compatible with the CLIPImageProcessor . Can be left as None only when image_embeddings are passed.

generator (torch.Generator, optional) — A to make generation deterministic.

return_dict (bool, optional, defaults to True) — Whether or not to return a instead of a plain tuple.

or tuple

If return_dict is True, is returned, otherwise a tuple is returned where the first element is a list with the generated images.

🌍
🌍
Hierarchical Text-Conditional Image Generation with CLIP Latents
karlo
lucidrains/DALLE2-pytorch
guide
reuse components across pipelines
<source>
PriorTransformer
UNet2DConditionModel
UNet2DModel
UNet2DModel
DDPMScheduler
DDPMScheduler
DDPMScheduler
DiffusionPipeline
<source>
ImagePipelineOutput
torch.Generator
ImagePipelineOutput
ImagePipelineOutput
ImagePipelineOutput
<source>
clip-vit-large-patch14
UNet2DConditionModel
UNet2DModel
UNet2DModel
DDPMScheduler
DDPMScheduler
DiffusionPipeline
<source>
ImagePipelineOutput
configuration
torch.Generator
ImagePipelineOutput
ImagePipelineOutput
ImagePipelineOutput
<source>