Cycle Diffusion
Cycle Diffusion
Cycle Diffusion is a text guided image-to-image generation model proposed in Unifying Diffusion Models’ Latent Space, with Applications to CycleDiffusion and Guidance by Chen Henry Wu, Fernando De la Torre.
The abstract from the paper is:
Diffusion models have achieved unprecedented performance in generative modeling. The commonly-adopted formulation of the latent code of diffusion models is a sequence of gradually denoised samples, as opposed to the simpler (e.g., Gaussian) latent space of GANs, VAEs, and normalizing flows. This paper provides an alternative, Gaussian formulation of the latent space of various diffusion models, as well as an invertible DPM-Encoder that maps images into the latent space. While our formulation is purely based on the definition of diffusion models, we demonstrate several intriguing consequences. (1) Empirically, we observe that a common latent space emerges from two diffusion models trained independently on related domains. In light of this finding, we propose CycleDiffusion, which uses DPM-Encoder for unpaired image-to-image translation. Furthermore, applying CycleDiffusion to text-to-image diffusion models, we show that large-scale text-to-image diffusion models can be used as zero-shot image-to-image editors. (2) One can guide pre-trained diffusion models and GANs by controlling the latent codes in a unified, plug-and-play formulation based on energy-based models. Using the CLIP model and a face recognition model as guidance, we demonstrate that diffusion models have better coverage of low-density sub-populations and individuals than GANs.
Make sure to check out the Schedulers guide to learn how to explore the tradeoff between scheduler speed and quality, and see the reuse components across pipelines section to learn how to efficiently load the same components into multiple pipelines.
CycleDiffusionPipeline
class diffusers.CycleDiffusionPipeline
( vae: AutoencoderKLtext_encoder: CLIPTextModeltokenizer: CLIPTokenizerunet: UNet2DConditionModelscheduler: DDIMSchedulersafety_checker: StableDiffusionSafetyCheckerfeature_extractor: CLIPImageProcessorrequires_safety_checker: bool = True )
Parameters
vae (AutoencoderKL) — Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations.
text_encoder (
CLIPTextModel
) — Frozen text-encoder (clip-vit-large-patch14).tokenizer (
CLIPTokenizer
) — ACLIPTokenizer
to tokenize text.unet (UNet2DConditionModel) — A
UNet2DConditionModel
to denoise the encoded image latents.scheduler (SchedulerMixin) — A scheduler to be used in combination with
unet
to denoise the encoded image latents. Can only be an instance of DDIMScheduler.safety_checker (
StableDiffusionSafetyChecker
) — Classification module that estimates whether generated images could be considered offensive or harmful. Please refer to the model card for more details about a model’s potential harms.feature_extractor (
CLIPImageProcessor
) — ACLIPImageProcessor
to extract features from generated images; used as inputs to thesafety_checker
.
Pipeline for text-guided image to image generation using Stable Diffusion.
This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).
__call__
( prompt: typing.Union[str, typing.List[str]]source_prompt: typing.Union[str, typing.List[str]]image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.FloatTensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.FloatTensor]] = Nonestrength: float = 0.8num_inference_steps: typing.Optional[int] = 50guidance_scale: typing.Optional[float] = 7.5source_guidance_scale: typing.Optional[float] = 1num_images_per_prompt: typing.Optional[int] = 1eta: typing.Optional[float] = 0.1generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = Noneprompt_embeds: typing.Optional[torch.FloatTensor] = Noneoutput_type: typing.Optional[str] = 'pil'return_dict: bool = Truecallback: typing.Union[typing.Callable[[int, int, torch.FloatTensor], NoneType], NoneType] = Nonecallback_steps: int = 1cross_attention_kwargs: typing.Union[typing.Dict[str, typing.Any], NoneType] = None ) → StableDiffusionPipelineOutput or tuple
Parameters
prompt (
str
orList[str]
) — The prompt or prompts to guide the image generation.image (
torch.FloatTensor
np.ndarray
,PIL.Image.Image
,List[torch.FloatTensor]
,List[PIL.Image.Image]
, orList[np.ndarray]
) —Image
or tensor representing an image batch to be used as the starting point. Can also accept image latents asimage
, but if passing latents directly it is not encoded again.strength (
float
, optional, defaults to 0.8) — Indicates extent to transform the referenceimage
. Must be between 0 and 1.image
is used as a starting point and more noise is added the higher thestrength
. The number of denoising steps depends on the amount of noise initially added. Whenstrength
is 1, added noise is maximum and the denoising process runs for the full number of iterations specified innum_inference_steps
. A value of 1 essentially ignoresimage
.num_inference_steps (
int
, optional, defaults to 50) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. This parameter is modulated bystrength
.guidance_scale (
float
, optional, defaults to 7.5) — A higher guidance scale value encourages the model to generate images closely linked to the textprompt
at the expense of lower image quality. Guidance scale is enabled whenguidance_scale > 1
.source_guidance_scale (
float
, optional, defaults to 1) — Guidance scale for the source prompt. This is useful to control the amount of influence the source prompt has for encoding.num_images_per_prompt (
int
, optional, defaults to 1) — The number of images to generate per prompt.eta (
float
, optional, defaults to 0.0) — Corresponds to parameter eta (η) from the DDIM paper. Only applies to the DDIMScheduler, and is ignored in other schedulers.generator (
torch.Generator
orList[torch.Generator]
, optional) — Atorch.Generator
to make generation deterministic.prompt_embeds (
torch.FloatTensor
, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not provided, text embeddings are generated from theprompt
input argument.negative_prompt_embeds (
torch.FloatTensor
, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not provided,negative_prompt_embeds
are generated from thenegative_prompt
input argument.output_type (
str
, optional, defaults to"pil"
) — The output format of the generated image. Choose betweenPIL.Image
ornp.array
.return_dict (
bool
, optional, defaults toTrue
) — Whether or not to return a StableDiffusionPipelineOutput instead of a plain tuple.callback (
Callable
, optional) — A function that calls everycallback_steps
steps during inference. The function is called with the following arguments:callback(step: int, timestep: int, latents: torch.FloatTensor)
.callback_steps (
int
, optional, defaults to 1) — The frequency at which thecallback
function is called. If not specified, the callback is called at every step.cross_attention_kwargs (
dict
, optional) — A kwargs dictionary that if specified is passed along to theAttentionProcessor
as defined inself.processor
.
Returns
StableDiffusionPipelineOutput or tuple
If return_dict
is True
, StableDiffusionPipelineOutput is returned, otherwise a tuple
is returned where the first element is a list with the generated images and the second element is a list of bool
s indicating whether the corresponding generated image contains “not-safe-for-work” (nsfw) content.
The call function to the pipeline for generation.
Example:
Copied
encode_prompt
( promptdevicenum_images_per_promptdo_classifier_free_guidancenegative_prompt = Noneprompt_embeds: typing.Optional[torch.FloatTensor] = Nonenegative_prompt_embeds: typing.Optional[torch.FloatTensor] = Nonelora_scale: typing.Optional[float] = None )
Parameters
prompt (
str
orList[str]
, optional) — prompt to be encoded device — (torch.device
): torch devicenum_images_per_prompt (
int
) — number of images that should be generated per promptdo_classifier_free_guidance (
bool
) — whether to use classifier free guidance or notnegative_prompt (
str
orList[str]
, optional) — The prompt or prompts not to guide the image generation. If not defined, one has to passnegative_prompt_embeds
instead. Ignored when not using guidance (i.e., ignored ifguidance_scale
is less than1
).prompt_embeds (
torch.FloatTensor
, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated fromprompt
input argument.negative_prompt_embeds (
torch.FloatTensor
, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated fromnegative_prompt
input argument.lora_scale (
float
, optional) — A lora scale that will be applied to all LoRA layers of the text encoder if LoRA layers are loaded.
Encodes the prompt into text encoder hidden states.
StableDiffusionPiplineOutput
class diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput
( images: typing.Union[typing.List[PIL.Image.Image], numpy.ndarray]nsfw_content_detected: typing.Optional[typing.List[bool]] )
Parameters
images (
List[PIL.Image.Image]
ornp.ndarray
) — List of denoised PIL images of lengthbatch_size
or NumPy array of shape(batch_size, height, width, num_channels)
.nsfw_content_detected (
List[bool]
) — List indicating whether the corresponding generated image contains “not-safe-for-work” (nsfw) content orNone
if safety checking could not be performed.
Output class for Stable Diffusion pipelines.
Last updated