Diffusers BOINC AI docs
  • 🌍GET STARTED
    • Diffusers
    • Quicktour
    • Effective and efficient diffusion
    • Installation
  • 🌍TUTORIALS
    • Overview
    • Understanding models and schedulers
    • AutoPipeline
    • Train a diffusion model
  • 🌍USING DIFFUSERS
    • 🌍LOADING & HUB
      • Overview
      • Load pipelines, models, and schedulers
      • Load and compare different schedulers
      • Load community pipelines
      • Load safetensors
      • Load different Stable Diffusion formats
      • Push files to the Hub
    • 🌍TASKS
      • Unconditional image generation
      • Text-to-image
      • Image-to-image
      • Inpainting
      • Depth-to-image
    • 🌍TECHNIQUES
      • Textual inversion
      • Distributed inference with multiple GPUs
      • Improve image quality with deterministic generation
      • Control image brightness
      • Prompt weighting
    • 🌍PIPELINES FOR INFERENCE
      • Overview
      • Stable Diffusion XL
      • ControlNet
      • Shap-E
      • DiffEdit
      • Distilled Stable Diffusion inference
      • Create reproducible pipelines
      • Community pipelines
      • How to contribute a community pipeline
    • 🌍TRAINING
      • Overview
      • Create a dataset for training
      • Adapt a model to a new task
      • Unconditional image generation
      • Textual Inversion
      • DreamBooth
      • Text-to-image
      • Low-Rank Adaptation of Large Language Models (LoRA)
      • ControlNet
      • InstructPix2Pix Training
      • Custom Diffusion
      • T2I-Adapters
    • 🌍TAKING DIFFUSERS BEYOND IMAGES
      • Other Modalities
  • 🌍OPTIMIZATION/SPECIAL HARDWARE
    • Overview
    • Memory and Speed
    • Torch2.0 support
    • Stable Diffusion in JAX/Flax
    • xFormers
    • ONNX
    • OpenVINO
    • Core ML
    • MPS
    • Habana Gaudi
    • Token Merging
  • 🌍CONCEPTUAL GUIDES
    • Philosophy
    • Controlled generation
    • How to contribute?
    • Diffusers' Ethical Guidelines
    • Evaluating Diffusion Models
  • 🌍API
    • 🌍MAIN CLASSES
      • Attention Processor
      • Diffusion Pipeline
      • Logging
      • Configuration
      • Outputs
      • Loaders
      • Utilities
      • VAE Image Processor
    • 🌍MODELS
      • Overview
      • UNet1DModel
      • UNet2DModel
      • UNet2DConditionModel
      • UNet3DConditionModel
      • VQModel
      • AutoencoderKL
      • AsymmetricAutoencoderKL
      • Tiny AutoEncoder
      • Transformer2D
      • Transformer Temporal
      • Prior Transformer
      • ControlNet
    • 🌍PIPELINES
      • Overview
      • AltDiffusion
      • Attend-and-Excite
      • Audio Diffusion
      • AudioLDM
      • AudioLDM 2
      • AutoPipeline
      • Consistency Models
      • ControlNet
      • ControlNet with Stable Diffusion XL
      • Cycle Diffusion
      • Dance Diffusion
      • DDIM
      • DDPM
      • DeepFloyd IF
      • DiffEdit
      • DiT
      • IF
      • PaInstructPix2Pix
      • Kandinsky
      • Kandinsky 2.2
      • Latent Diffusionge
      • MultiDiffusion
      • MusicLDM
      • PaintByExample
      • Parallel Sampling of Diffusion Models
      • Pix2Pix Zero
      • PNDM
      • RePaint
      • Score SDE VE
      • Self-Attention Guidance
      • Semantic Guidance
      • Shap-E
      • Spectrogram Diffusion
      • 🌍STABLE DIFFUSION
        • Overview
        • Text-to-image
        • Image-to-image
        • Inpainting
        • Depth-to-image
        • Image variation
        • Safe Stable Diffusion
        • Stable Diffusion 2
        • Stable Diffusion XL
        • Latent upscaler
        • Super-resolution
        • LDM3D Text-to-(RGB, Depth)
        • Stable Diffusion T2I-adapter
        • GLIGEN (Grounded Language-to-Image Generation)
      • Stable unCLIP
      • Stochastic Karras VE
      • Text-to-image model editing
      • Text-to-video
      • Text2Video-Zero
      • UnCLIP
      • Unconditional Latent Diffusion
      • UniDiffuser
      • Value-guided sampling
      • Versatile Diffusion
      • VQ Diffusion
      • Wuerstchen
    • 🌍SCHEDULERS
      • Overview
      • CMStochasticIterativeScheduler
      • DDIMInverseScheduler
      • DDIMScheduler
      • DDPMScheduler
      • DEISMultistepScheduler
      • DPMSolverMultistepInverse
      • DPMSolverMultistepScheduler
      • DPMSolverSDEScheduler
      • DPMSolverSinglestepScheduler
      • EulerAncestralDiscreteScheduler
      • EulerDiscreteScheduler
      • HeunDiscreteScheduler
      • IPNDMScheduler
      • KarrasVeScheduler
      • KDPM2AncestralDiscreteScheduler
      • KDPM2DiscreteScheduler
      • LMSDiscreteScheduler
      • PNDMScheduler
      • RePaintScheduler
      • ScoreSdeVeScheduler
      • ScoreSdeVpScheduler
      • UniPCMultistepScheduler
      • VQDiffusionScheduler
Powered by GitBook
On this page
  • Controlled generation
  • Instruct Pix2Pix
  • Pix2Pix Zero
  • Attend and Excite
  • Semantic Guidance (SEGA)
  • Self-attention Guidance (SAG)
  • Depth2Image
  • MultiDiffusion Panorama
  • Fine-tuning your own models
  • DreamBooth
  • Textual Inversion
  • ControlNet
  • Prompt Weighting
  • Custom Diffusion
  • Model Editing
  • DiffEdit
  • T2I-Adapter
  • Fabric
  1. CONCEPTUAL GUIDES

Controlled generation

PreviousPhilosophyNextHow to contribute?

Last updated 1 year ago

Controlled generation

Controlling outputs generated by diffusion models has been long pursued by the community and is now an active research topic. In many popular diffusion models, subtle changes in inputs, both images and text prompts, can drastically change outputs. In an ideal world we want to be able to control how semantics are preserved and changed.

Most examples of preserving semantics reduce to being able to accurately map a change in input to a change in output. I.e. adding an adjective to a subject in a prompt preserves the entire image, only modifying the changed subject. Or, image variation of a particular subject preserves the subject’s pose.

Additionally, there are qualities of generated images that we would like to influence beyond semantic preservation. I.e. in general, we would like our outputs to be of good quality, adhere to a particular style, or be realistic.

We will document some of the techniques diffusers supports to control generation of diffusion models. Much is cutting edge research and can be quite nuanced. If something needs clarifying or you have a suggestion, don’t hesitate to open a discussion on the or a .

We provide a high level explanation of how the generation can be controlled as well as a snippet of the technicals. For more in depth explanations on the technicals, the original papers which are linked from the pipelines are always the best resources.

Depending on the use case, one should choose a technique accordingly. In many cases, these techniques can be combined. For example, one can combine Textual Inversion with SEGA to provide more semantic guidance to the outputs generated using Textual Inversion.

Unless otherwise mentioned, these are techniques that work with existing models and don’t require their own weights.

For convenience, we provide a table to denote which methods are inference-only and which require fine-tuning/training.

Method

Inference only

Requires training / fine-tuning

Comments

✅

❌

Can additionally be fine-tuned for better performance on specific edit instructions.

✅

❌

✅

❌

✅

❌

✅

❌

✅

❌

✅

❌

❌

✅

❌

✅

✅

❌

A ControlNet can be trained/fine-tuned on a custom conditioning.

✅

❌

❌

✅

✅

❌

✅

❌

✅

❌

✅

❌

Instruct Pix2Pix

Pix2Pix Zero

The denoising process is guided from one conceptual embedding towards another conceptual embedding. The intermediate latents are optimized during the denoising process to push the attention maps towards reference attention maps. The reference attention maps are from the denoising process of the input image and are used to encourage semantic preservation.

Pix2Pix Zero can be used both to edit synthetic images as well as real images.

Attend and Excite

A set of token indices are given as input, corresponding to the subjects in the prompt that need to be present in the image. During denoising, each token index is guaranteed to have a minimum attention threshold for at least one patch of the image. The intermediate latents are iteratively optimized during the denoising process to strengthen the attention of the most neglected subject token until the attention threshold is passed for all subject tokens.

Semantic Guidance (SEGA)

SEGA allows applying or removing one or more concepts from an image. The strength of the concept can also be controlled. I.e. the smile concept can be used to incrementally increase or decrease the smile of a portrait.

Similar to how classifier free guidance provides guidance via empty prompt inputs, SEGA provides guidance on conceptual prompts. Multiple of these conceptual prompts can be applied simultaneously. Each conceptual prompt can either add or remove their concept depending on if the guidance is applied positively or negatively.

Unlike Pix2Pix Zero or Attend and Excite, SEGA directly interacts with the diffusion process instead of performing any explicit gradient-based optimization.

Self-attention Guidance (SAG)

SAG provides guidance from predictions not conditioned on high-frequency details to fully conditioned images. The high frequency details are extracted out of the UNet self-attention maps.

Depth2Image

It conditions on a monocular depth estimate of the original image.

An important distinction between methods like InstructPix2Pix and Pix2Pix Zero is that the former involves fine-tuning the pre-trained weights while the latter does not. This means that you can apply Pix2Pix Zero to any of the available Stable Diffusion models.

MultiDiffusion Panorama

Fine-tuning your own models

In addition to pre-trained models, Diffusers has training scripts for fine-tuning models on user-provided data.

DreamBooth

Textual Inversion

ControlNet

Prompt Weighting

Prompt weighting is a simple technique that puts more attention weight on certain parts of the text input.

Custom Diffusion

Model Editing

DiffEdit

T2I-Adapter

Fabric

is fine-tuned from stable diffusion to support editing input images. It takes as inputs an image and a prompt describing an edit, and it outputs the edited image. Instruct Pix2Pix has been explicitly trained to work well with -like prompts.

See for more information on how to use it.

allows modifying an image so that one concept or subject is translated to another one while preserving general image semantics.

To edit synthetic images, one first generates an image given a caption. Next, we generate image captions for the concept that shall be edited and for the new target concept. We can use a model like for this purpose. Then, “mean” prompt embeddings for both the source and target concepts are created via the text encoder. Finally, the pix2pix-zero algorithm is used to edit the synthetic image.

To edit a real image, one first generates an image caption using a model like . Then one applies ddim inversion on the prompt and image to generate “inverse” latents. Similar to before, “mean” prompt embeddings for both source and target concepts are created and finally the pix2pix-zero algorithm in combination with the “inverse” latents is used to edit the image.

Pix2Pix Zero is the first model that allows “zero-shot” image editing. This means that the model can edit an image in less than a minute on a consumer GPU as shown .

As mentioned above, Pix2Pix Zero includes optimizing the latents (and not any of the UNet, VAE, or the text encoder) to steer the generation toward a specific concept. This means that the overall pipeline might require more memory than a standard .

See for more information on how to use it.

allows subjects in the prompt to be faithfully represented in the final image.

Like Pix2Pix Zero, Attend and Excite also involves a mini optimization loop (leaving the pre-trained weights untouched) in its pipeline and can require more memory than the usual .

See for more information on how to use it.

See for more information on how to use it.

improves the general quality of images.

See for more information on how to use it.

is fine-tuned from Stable Diffusion to better preserve semantics for text guided image variation.

See for more information on how to use it.

MultiDiffusion defines a new generation process over a pre-trained diffusion model. This process binds together multiple diffusion generation methods that can be readily applied to generate high quality and diverse images. Results adhere to user-provided controls, such as desired aspect ratio (e.g., panorama), and spatial guiding signals, ranging from tight segmentation masks to bounding boxes. allows to generate high-quality images at arbitrary aspect ratios (e.g., panoramas).

See for more information on how to use it to generate panoramic images.

fine-tunes a model to teach it about a new subject. I.e. a few pictures of a person can be used to generate images of that person in different styles.

See for more information on how to use it.

fine-tunes a model to teach it about a new concept. I.e. a few pictures of a style of artwork can be used to generate images in that style.

See for more information on how to use it.

is an auxiliary network which adds an extra condition. is an auxiliary network which adds an extra condition. There are 8 canonical pre-trained ControlNets trained on different conditionings such as edge detection, scribbles, depth maps, and semantic segmentations.

See for more information on how to use it.

For a more in-detail explanation and examples, see .

only fine-tunes the cross-attention maps of a pre-trained text-to-image diffusion model. It also allows for additionally performing textual inversion. It supports multi-concept training by design. Like DreamBooth and Textual Inversion, Custom Diffusion is also used to teach a pre-trained text-to-image diffusion model about new concepts to generate outputs involving the concept(s) of interest.

For more details, check out our .

The helps you mitigate some of the incorrect implicit assumptions a pre-trained text-to-image diffusion model might make about the subjects present in the input prompt. For example, if you prompt Stable Diffusion to generate images for “A pack of roses”, the roses in the generated images are more likely to be red. This pipeline helps you change that assumption.

To know more details, check out the .

allows for semantic editing of input images along with input prompts while preserving the original input images as much as possible.

To know more details, check out the .

is an auxiliary network which adds an extra condition. There are 8 canonical pre-trained adapters trained on different conditionings such as edge detection, sketch, depth maps, and semantic segmentations.

See for more information on how to use it.

is a training-free approach applicable to a wide range of popular diffusion models, which exploits the self-attention layer present in the most widely used architectures to condition the diffusion process on a set of feedback images.

To know more details, check out the .

🌍
forum
GitHub issue
Instruct Pix2Pix
Pix2Pix Zero
Attend and Excite
Semantic Guidance
Self-attention Guidance
Depth2Image
MultiDiffusion Panorama
DreamBooth
Textual Inversion
ControlNet
Prompt Weighting
Custom Diffusion
Model Editing
DiffEdit
T2I-Adapter
FABRIC
Paper
Instruct Pix2Pix
InstructGPT
here
Paper
Pix2Pix Zero
Flan-T5
BLIP
here
StableDiffusionPipeline
here
Paper
Attend and Excite
StableDiffusionPipeline
here
Paper
here
Paper
Self-attention Guidance
here
Project
Depth2Image
here
Paper
MultiDiffusion Panorama
here
DreamBooth
here
Textual Inversion
here
Paper
ControlNet
ControlNet
here
here
Custom Diffusion
official doc
Paper
text-to-image model editing pipeline
official doc
Paper
DiffEdit
official doc
Paper
T2I-Adapter
here
Paper
Fabric
official doc
Instruct Pix2Pix
Pix2Pix Zero
Attend and Excite
Semantic Guidance
Self-attention Guidance
Depth2Image
MultiDiffusion Panorama
DreamBooth
Textual Inversion
ControlNet
Prompt Weighting
Custom Diffusion
Model Editing
DiffEdit
T2I-Adapter
Fabric