Diffusers BOINC AI docs
  • 🌍GET STARTED
    • Diffusers
    • Quicktour
    • Effective and efficient diffusion
    • Installation
  • 🌍TUTORIALS
    • Overview
    • Understanding models and schedulers
    • AutoPipeline
    • Train a diffusion model
  • 🌍USING DIFFUSERS
    • 🌍LOADING & HUB
      • Overview
      • Load pipelines, models, and schedulers
      • Load and compare different schedulers
      • Load community pipelines
      • Load safetensors
      • Load different Stable Diffusion formats
      • Push files to the Hub
    • 🌍TASKS
      • Unconditional image generation
      • Text-to-image
      • Image-to-image
      • Inpainting
      • Depth-to-image
    • 🌍TECHNIQUES
      • Textual inversion
      • Distributed inference with multiple GPUs
      • Improve image quality with deterministic generation
      • Control image brightness
      • Prompt weighting
    • 🌍PIPELINES FOR INFERENCE
      • Overview
      • Stable Diffusion XL
      • ControlNet
      • Shap-E
      • DiffEdit
      • Distilled Stable Diffusion inference
      • Create reproducible pipelines
      • Community pipelines
      • How to contribute a community pipeline
    • 🌍TRAINING
      • Overview
      • Create a dataset for training
      • Adapt a model to a new task
      • Unconditional image generation
      • Textual Inversion
      • DreamBooth
      • Text-to-image
      • Low-Rank Adaptation of Large Language Models (LoRA)
      • ControlNet
      • InstructPix2Pix Training
      • Custom Diffusion
      • T2I-Adapters
    • 🌍TAKING DIFFUSERS BEYOND IMAGES
      • Other Modalities
  • 🌍OPTIMIZATION/SPECIAL HARDWARE
    • Overview
    • Memory and Speed
    • Torch2.0 support
    • Stable Diffusion in JAX/Flax
    • xFormers
    • ONNX
    • OpenVINO
    • Core ML
    • MPS
    • Habana Gaudi
    • Token Merging
  • 🌍CONCEPTUAL GUIDES
    • Philosophy
    • Controlled generation
    • How to contribute?
    • Diffusers' Ethical Guidelines
    • Evaluating Diffusion Models
  • 🌍API
    • 🌍MAIN CLASSES
      • Attention Processor
      • Diffusion Pipeline
      • Logging
      • Configuration
      • Outputs
      • Loaders
      • Utilities
      • VAE Image Processor
    • 🌍MODELS
      • Overview
      • UNet1DModel
      • UNet2DModel
      • UNet2DConditionModel
      • UNet3DConditionModel
      • VQModel
      • AutoencoderKL
      • AsymmetricAutoencoderKL
      • Tiny AutoEncoder
      • Transformer2D
      • Transformer Temporal
      • Prior Transformer
      • ControlNet
    • 🌍PIPELINES
      • Overview
      • AltDiffusion
      • Attend-and-Excite
      • Audio Diffusion
      • AudioLDM
      • AudioLDM 2
      • AutoPipeline
      • Consistency Models
      • ControlNet
      • ControlNet with Stable Diffusion XL
      • Cycle Diffusion
      • Dance Diffusion
      • DDIM
      • DDPM
      • DeepFloyd IF
      • DiffEdit
      • DiT
      • IF
      • PaInstructPix2Pix
      • Kandinsky
      • Kandinsky 2.2
      • Latent Diffusionge
      • MultiDiffusion
      • MusicLDM
      • PaintByExample
      • Parallel Sampling of Diffusion Models
      • Pix2Pix Zero
      • PNDM
      • RePaint
      • Score SDE VE
      • Self-Attention Guidance
      • Semantic Guidance
      • Shap-E
      • Spectrogram Diffusion
      • 🌍STABLE DIFFUSION
        • Overview
        • Text-to-image
        • Image-to-image
        • Inpainting
        • Depth-to-image
        • Image variation
        • Safe Stable Diffusion
        • Stable Diffusion 2
        • Stable Diffusion XL
        • Latent upscaler
        • Super-resolution
        • LDM3D Text-to-(RGB, Depth)
        • Stable Diffusion T2I-adapter
        • GLIGEN (Grounded Language-to-Image Generation)
      • Stable unCLIP
      • Stochastic Karras VE
      • Text-to-image model editing
      • Text-to-video
      • Text2Video-Zero
      • UnCLIP
      • Unconditional Latent Diffusion
      • UniDiffuser
      • Value-guided sampling
      • Versatile Diffusion
      • VQ Diffusion
      • Wuerstchen
    • 🌍SCHEDULERS
      • Overview
      • CMStochasticIterativeScheduler
      • DDIMInverseScheduler
      • DDIMScheduler
      • DDPMScheduler
      • DEISMultistepScheduler
      • DPMSolverMultistepInverse
      • DPMSolverMultistepScheduler
      • DPMSolverSDEScheduler
      • DPMSolverSinglestepScheduler
      • EulerAncestralDiscreteScheduler
      • EulerDiscreteScheduler
      • HeunDiscreteScheduler
      • IPNDMScheduler
      • KarrasVeScheduler
      • KDPM2AncestralDiscreteScheduler
      • KDPM2DiscreteScheduler
      • LMSDiscreteScheduler
      • PNDMScheduler
      • RePaintScheduler
      • ScoreSdeVeScheduler
      • ScoreSdeVpScheduler
      • UniPCMultistepScheduler
      • VQDiffusionScheduler
Powered by GitBook
On this page
  • Pix2Pix Zero
  • Tips
  • Available Pipelines:
  • Usage example
  • Generating source and target embeddings
  • StableDiffusionPix2PixZeroPipeline
  1. API
  2. PIPELINES

Pix2Pix Zero

PreviousParallel Sampling of Diffusion ModelsNextPNDM

Last updated 1 year ago

Pix2Pix Zero

is by Gaurav Parmar, Krishna Kumar Singh, Richard Zhang, Yijun Li, Jingwan Lu, and Jun-Yan Zhu.

The abstract from the paper is:

Large-scale text-to-image generative models have shown their remarkable ability to synthesize diverse and high-quality images. However, it is still challenging to directly apply these models for editing real images for two reasons. First, it is hard for users to come up with a perfect text prompt that accurately describes every visual detail in the input image. Second, while existing models can introduce desirable changes in certain regions, they often dramatically alter the input content and introduce unexpected changes in unwanted regions. In this work, we propose pix2pix-zero, an image-to-image translation method that can preserve the content of the original image without manual prompting. We first automatically discover editing directions that reflect desired edits in the text embedding space. To preserve the general content structure after editing, we further propose cross-attention guidance, which aims to retain the cross-attention maps of the input image throughout the diffusion process. In addition, our method does not need additional training for these edits and can directly use the existing pre-trained text-to-image diffusion model. We conduct extensive experiments and show that our method outperforms existing and concurrent works for both real and synthetic image editing.

You can find additional information about Pix2Pix Zero on the , , and try it out in a .

Tips

  • The pipeline can be conditioned on real input images. Check out the code examples below to know more.

  • The pipeline exposes two arguments namely source_embeds and target_embeds that let you control the direction of the semantic edits in the final image to be generated. Let’s say, you wanted to translate from “cat” to “dog”. In this case, the edit direction will be “cat -> dog”. To reflect this in the pipeline, you simply have to set the embeddings related to the phrases including “cat” to source_embeds and “dog” to target_embeds. Refer to the code example below for more details.

  • When you’re using this pipeline from a prompt, specify the source concept in the prompt. Taking the above example, a valid input prompt would be: “a high resolution painting of a cat in the style of van gough”.

  • If you wanted to reverse the direction in the example above, i.e., “dog -> cat”, then it’s recommended to:

    • Swap the source_embeds and target_embeds.

    • Change the input prompt to include “dog”.

  • To learn more about how the source and target embeddings are generated, refer to the . Below, we also provide some directions on how to generate the embeddings.

  • Note that the quality of the outputs generated with this pipeline is dependent on how good the source_embeds and target_embeds are. Please, refer to for some suggestions on the topic.

Available Pipelines:

Pipeline
Tasks
Demo

Text-Based Image Editing

Usage example

Based on an image generated with the input prompt

Copied

import requests
import torch

from diffusers import DDIMScheduler, StableDiffusionPix2PixZeroPipeline


def download(embedding_url, local_filepath):
    r = requests.get(embedding_url)
    with open(local_filepath, "wb") as f:
        f.write(r.content)


model_ckpt = "CompVis/stable-diffusion-v1-4"
pipeline = StableDiffusionPix2PixZeroPipeline.from_pretrained(
    model_ckpt, conditions_input_image=False, torch_dtype=torch.float16
)
pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config)
pipeline.to("cuda")

prompt = "a high resolution painting of a cat in the style of van gogh"
src_embs_url = "https://github.com/pix2pixzero/pix2pix-zero/raw/main/assets/embeddings_sd_1.4/cat.pt"
target_embs_url = "https://github.com/pix2pixzero/pix2pix-zero/raw/main/assets/embeddings_sd_1.4/dog.pt"

for url in [src_embs_url, target_embs_url]:
    download(url, url.split("/")[-1])

src_embeds = torch.load(src_embs_url.split("/")[-1])
target_embeds = torch.load(target_embs_url.split("/")[-1])

images = pipeline(
    prompt,
    source_embeds=src_embeds,
    target_embeds=target_embeds,
    num_inference_steps=50,
    cross_attention_guidance_amount=0.15,
).images
images[0].save("edited_image_dog.png")

Based on an input image

When the pipeline is conditioned on an input image, we first obtain an inverted noise from it using a DDIMInverseScheduler with the help of a generated caption. Then the inverted noise is used to start the generation process.

First, let’s load our pipeline:

Copied

import torch
from transformers import BlipForConditionalGeneration, BlipProcessor
from diffusers import DDIMScheduler, DDIMInverseScheduler, StableDiffusionPix2PixZeroPipeline

captioner_id = "Salesforce/blip-image-captioning-base"
processor = BlipProcessor.from_pretrained(captioner_id)
model = BlipForConditionalGeneration.from_pretrained(captioner_id, torch_dtype=torch.float16, low_cpu_mem_usage=True)

sd_model_ckpt = "CompVis/stable-diffusion-v1-4"
pipeline = StableDiffusionPix2PixZeroPipeline.from_pretrained(
    sd_model_ckpt,
    caption_generator=model,
    caption_processor=processor,
    torch_dtype=torch.float16,
    safety_checker=None,
)
pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config)
pipeline.inverse_scheduler = DDIMInverseScheduler.from_config(pipeline.scheduler.config)
pipeline.enable_model_cpu_offload()

Then, we load an input image for conditioning and obtain a suitable caption for it:

Copied

import requests
from PIL import Image

img_url = "https://github.com/pix2pixzero/pix2pix-zero/raw/main/assets/test_images/cats/cat_6.png"
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert("RGB").resize((512, 512))
caption = pipeline.generate_caption(raw_image)

Then we employ the generated caption and the input image to get the inverted noise:

Copied

generator = torch.manual_seed(0)
inv_latents = pipeline.invert(caption, image=raw_image, generator=generator).latents

Now, generate the image with edit directions:

Copied

# See the "Generating source and target embeddings" section below to
# automate the generation of these captions with a pre-trained model like Flan-T5 as explained below.
source_prompts = ["a cat sitting on the street", "a cat playing in the field", "a face of a cat"]
target_prompts = ["a dog sitting on the street", "a dog playing in the field", "a face of a dog"]

source_embeds = pipeline.get_embeds(source_prompts, batch_size=2)
target_embeds = pipeline.get_embeds(target_prompts, batch_size=2)


image = pipeline(
    caption,
    source_embeds=source_embeds,
    target_embeds=target_embeds,
    num_inference_steps=50,
    cross_attention_guidance_amount=0.15,
    generator=generator,
    latents=inv_latents,
    negative_prompt=caption,
).images[0]
image.save("edited_image.png")

Generating source and target embeddings

1. Load the generation model:

Copied

import torch
from transformers import AutoTokenizer, T5ForConditionalGeneration

tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-xl")
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-xl", device_map="auto", torch_dtype=torch.float16)

2. Construct a starting prompt:

Copied

source_concept = "cat"
target_concept = "dog"

source_text = f"Provide a caption for images containing a {source_concept}. "
"The captions should be in English and should be no longer than 150 characters."

target_text = f"Provide a caption for images containing a {target_concept}. "
"The captions should be in English and should be no longer than 150 characters."

Here, we’re interested in the “cat -> dog” direction.

3. Generate captions:

We can use a utility like so for this purpose.

Copied

def generate_captions(input_prompt):
    input_ids = tokenizer(input_prompt, return_tensors="pt").input_ids.to("cuda")

    outputs = model.generate(
        input_ids, temperature=0.8, num_return_sequences=16, do_sample=True, max_new_tokens=128, top_k=10
    )
    return tokenizer.batch_decode(outputs, skip_special_tokens=True)

And then we just call it to generate our captions:

Copied

source_captions = generate_captions(source_text)
target_captions = generate_captions(target_concept)

4. Load the embedding model:

Here, we need to use the same text encoder model used by the subsequent Stable Diffusion model.

Copied

from diffusers import StableDiffusionPix2PixZeroPipeline 

pipeline = StableDiffusionPix2PixZeroPipeline.from_pretrained(
    "CompVis/stable-diffusion-v1-4", torch_dtype=torch.float16
)
pipeline = pipeline.to("cuda")
tokenizer = pipeline.tokenizer
text_encoder = pipeline.text_encoder

5. Compute embeddings:

Copied

import torch 

def embed_captions(sentences, tokenizer, text_encoder, device="cuda"):
    with torch.no_grad():
        embeddings = []
        for sent in sentences:
            text_inputs = tokenizer(
                sent,
                padding="max_length",
                max_length=tokenizer.model_max_length,
                truncation=True,
                return_tensors="pt",
            )
            text_input_ids = text_inputs.input_ids
            prompt_embeds = text_encoder(text_input_ids.to(device), attention_mask=None)[0]
            embeddings.append(prompt_embeds)
    return torch.concatenate(embeddings, dim=0).mean(dim=0).unsqueeze(0)

source_embeddings = embed_captions(source_captions, tokenizer, text_encoder)
target_embeddings = embed_captions(target_captions, tokenizer, text_encoder)

Now, you can use these embeddings directly while calling the pipeline:

Copied

from diffusers import DDIMScheduler

pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config)

images = pipeline(
    prompt,
    source_embeds=source_embeddings,
    target_embeds=target_embeddings,
    num_inference_steps=50,
    cross_attention_guidance_amount=0.15,
).images
images[0].save("edited_image_dog.png")

StableDiffusionPix2PixZeroPipeline

class diffusers.StableDiffusionPix2PixZeroPipeline

( vae: AutoencoderKLtext_encoder: CLIPTextModeltokenizer: CLIPTokenizerunet: UNet2DConditionModelscheduler: typing.Union[diffusers.schedulers.scheduling_ddpm.DDPMScheduler, diffusers.schedulers.scheduling_ddim.DDIMScheduler, diffusers.schedulers.scheduling_euler_ancestral_discrete.EulerAncestralDiscreteScheduler, diffusers.schedulers.scheduling_lms_discrete.LMSDiscreteScheduler]feature_extractor: CLIPImageProcessorsafety_checker: StableDiffusionSafetyCheckerinverse_scheduler: DDIMInverseSchedulercaption_generator: BlipForConditionalGenerationcaption_processor: BlipProcessorrequires_safety_checker: bool = True )

Parameters

  • feature_extractor (CLIPImageProcessor) — Model that extracts features from generated images to be used as inputs for the safety_checker.

  • requires_safety_checker (bool) — Whether the pipeline requires a safety checker. We recommend setting it to True if you’re using the pipeline publicly.

Pipeline for pixel-levl image editing using Pix2Pix Zero. Based on Stable Diffusion.

__call__

Parameters

  • prompt (str or List[str], optional) — The prompt or prompts to guide the image generation. If not defined, one has to pass prompt_embeds. instead.

  • height (int, optional, defaults to self.unet.config.sample_size * self.vae_scale_factor) — The height in pixels of the generated image.

  • width (int, optional, defaults to self.unet.config.sample_size * self.vae_scale_factor) — The width in pixels of the generated image.

  • num_inference_steps (int, optional, defaults to 50) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference.

  • negative_prompt (str or List[str], optional) — The prompt or prompts not to guide the image generation. If not defined, one has to pass negative_prompt_embeds instead. Ignored when not using guidance (i.e., ignored if guidance_scale is less than 1).

  • num_images_per_prompt (int, optional, defaults to 1) — The number of images to generate per prompt.

  • latents (torch.FloatTensor, optional) — Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor will ge generated by sampling using the supplied random generator.

  • prompt_embeds (torch.FloatTensor, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated from prompt input argument.

  • negative_prompt_embeds (torch.FloatTensor, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated from negative_prompt input argument.

  • cross_attention_guidance_amount (float, defaults to 0.1) — Amount of guidance needed from the reference cross-attention maps.

  • callback (Callable, optional) — A function that will be called every callback_steps steps during inference. The function will be called with the following arguments: callback(step: int, timestep: int, latents: torch.FloatTensor).

  • callback_steps (int, optional, defaults to 1) — The frequency at which the callback function will be called. If not specified, the callback will be called at every step.

Returns

Function invoked when calling the pipeline for generation.

Examples:

Copied

>>> import requests
>>> import torch

>>> from diffusers import DDIMScheduler, StableDiffusionPix2PixZeroPipeline


>>> def download(embedding_url, local_filepath):
...     r = requests.get(embedding_url)
...     with open(local_filepath, "wb") as f:
...         f.write(r.content)


>>> model_ckpt = "CompVis/stable-diffusion-v1-4"
>>> pipeline = StableDiffusionPix2PixZeroPipeline.from_pretrained(model_ckpt, torch_dtype=torch.float16)
>>> pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config)
>>> pipeline.to("cuda")

>>> prompt = "a high resolution painting of a cat in the style of van gough"
>>> source_emb_url = "https://hf.co/datasets/sayakpaul/sample-datasets/resolve/main/cat.pt"
>>> target_emb_url = "https://hf.co/datasets/sayakpaul/sample-datasets/resolve/main/dog.pt"

>>> for url in [source_emb_url, target_emb_url]:
...     download(url, url.split("/")[-1])

>>> src_embeds = torch.load(source_emb_url.split("/")[-1])
>>> target_embeds = torch.load(target_emb_url.split("/")[-1])
>>> images = pipeline(
...     prompt,
...     source_embeds=src_embeds,
...     target_embeds=target_embeds,
...     num_inference_steps=50,
...     cross_attention_guidance_amount=0.15,
... ).images

>>> images[0].save("edited_image_dog.png")

construct_direction

( embs_source: Tensorembs_target: Tensor )

Constructs the edit direction to steer the image generation process semantically.

encode_prompt

( promptdevicenum_images_per_promptdo_classifier_free_guidancenegative_prompt = Noneprompt_embeds: typing.Optional[torch.FloatTensor] = Nonenegative_prompt_embeds: typing.Optional[torch.FloatTensor] = Nonelora_scale: typing.Optional[float] = None )

Parameters

  • prompt (str or List[str], optional) — prompt to be encoded device — (torch.device): torch device

  • num_images_per_prompt (int) — number of images that should be generated per prompt

  • do_classifier_free_guidance (bool) — whether to use classifier free guidance or not

  • negative_prompt (str or List[str], optional) — The prompt or prompts not to guide the image generation. If not defined, one has to pass negative_prompt_embeds instead. Ignored when not using guidance (i.e., ignored if guidance_scale is less than 1).

  • prompt_embeds (torch.FloatTensor, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated from prompt input argument.

  • negative_prompt_embeds (torch.FloatTensor, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated from negative_prompt input argument.

  • lora_scale (float, optional) — A lora scale that will be applied to all LoRA layers of the text encoder if LoRA layers are loaded.

Encodes the prompt into text encoder hidden states.

generate_caption

( images )

Generates caption for a given image.

invert

( prompt: typing.Optional[str] = Noneimage: typing.Union[PIL.Image.Image, numpy.ndarray, torch.FloatTensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.FloatTensor]] = Nonenum_inference_steps: int = 50guidance_scale: float = 1generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = Nonelatents: typing.Optional[torch.FloatTensor] = Noneprompt_embeds: typing.Optional[torch.FloatTensor] = Nonecross_attention_guidance_amount: float = 0.1output_type: typing.Optional[str] = 'pil'return_dict: bool = Truecallback: typing.Union[typing.Callable[[int, int, torch.FloatTensor], NoneType], NoneType] = Nonecallback_steps: typing.Optional[int] = 1cross_attention_kwargs: typing.Union[typing.Dict[str, typing.Any], NoneType] = Nonelambda_auto_corr: float = 20.0lambda_kl: float = 20.0num_reg_steps: int = 5num_auto_corr_rolls: int = 5 )

Parameters

  • prompt (str or List[str], optional) — The prompt or prompts to guide the image generation. If not defined, one has to pass prompt_embeds. instead.

  • image (torch.FloatTensor np.ndarray, PIL.Image.Image, List[torch.FloatTensor], List[PIL.Image.Image], or List[np.ndarray]) — Image, or tensor representing an image batch which will be used for conditioning. Can also accept image latents as image, if passing latents directly, it will not be encoded again.

  • num_inference_steps (int, optional, defaults to 50) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference.

  • latents (torch.FloatTensor, optional) — Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor will ge generated by sampling using the supplied random generator.

  • prompt_embeds (torch.FloatTensor, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated from prompt input argument.

  • cross_attention_guidance_amount (float, defaults to 0.1) — Amount of guidance needed from the reference cross-attention maps.

  • callback (Callable, optional) — A function that will be called every callback_steps steps during inference. The function will be called with the following arguments: callback(step: int, timestep: int, latents: torch.FloatTensor).

  • callback_steps (int, optional, defaults to 1) — The frequency at which the callback function will be called. If not specified, the callback will be called at every step.

  • lambda_auto_corr (float, optional, defaults to 20.0) — Lambda parameter to control auto correction

  • lambda_kl (float, optional, defaults to 20.0) — Lambda parameter to control Kullback–Leibler divergence output

  • num_reg_steps (int, optional, defaults to 5) — Number of regularization loss steps

  • num_auto_corr_rolls (int, optional, defaults to 5) — Number of auto correction roll steps

Function used to generate inverted latents given a prompt and image.

Examples:

Copied

>>> import torch
>>> from transformers import BlipForConditionalGeneration, BlipProcessor
>>> from diffusers import DDIMScheduler, DDIMInverseScheduler, StableDiffusionPix2PixZeroPipeline

>>> import requests
>>> from PIL import Image

>>> captioner_id = "Salesforce/blip-image-captioning-base"
>>> processor = BlipProcessor.from_pretrained(captioner_id)
>>> model = BlipForConditionalGeneration.from_pretrained(
...     captioner_id, torch_dtype=torch.float16, low_cpu_mem_usage=True
... )

>>> sd_model_ckpt = "CompVis/stable-diffusion-v1-4"
>>> pipeline = StableDiffusionPix2PixZeroPipeline.from_pretrained(
...     sd_model_ckpt,
...     caption_generator=model,
...     caption_processor=processor,
...     torch_dtype=torch.float16,
...     safety_checker=None,
... )

>>> pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config)
>>> pipeline.inverse_scheduler = DDIMInverseScheduler.from_config(pipeline.scheduler.config)
>>> pipeline.enable_model_cpu_offload()

>>> img_url = "https://github.com/pix2pixzero/pix2pix-zero/raw/main/assets/test_images/cats/cat_6.png"

>>> raw_image = Image.open(requests.get(img_url, stream=True).raw).convert("RGB").resize((512, 512))
>>> # generate caption
>>> caption = pipeline.generate_caption(raw_image)

>>> # "a photography of a cat with flowers and dai dai daie - daie - daie kasaii"
>>> inv_latents = pipeline.invert(caption, image=raw_image).latents
>>> # we need to generate source and target embeds

>>> source_prompts = ["a cat sitting on the street", "a cat playing in the field", "a face of a cat"]

>>> target_prompts = ["a dog sitting on the street", "a dog playing in the field", "a face of a dog"]

>>> source_embeds = pipeline.get_embeds(source_prompts)
>>> target_embeds = pipeline.get_embeds(target_prompts)
>>> # the latents can then be used to edit a real image
>>> # when using Stable Diffusion 2 or other models that use v-prediction
>>> # set `cross_attention_guidance_amount` to 0.01 or less to avoid input latent gradient explosion

>>> image = pipeline(
...     caption,
...     source_embeds=source_embeds,
...     target_embeds=target_embeds,
...     num_inference_steps=50,
...     cross_attention_guidance_amount=0.15,
...     generator=generator,
...     latents=inv_latents,
...     negative_prompt=caption,
... ).images[0]
>>> image.save("edited_image.png")

🌍

The authors originally used the to generate the source and target captions for discovering edit directions. However, we can also leverage open source and public models for the same purpose. Below, we provide an end-to-end example with the model for generating captions and for computing embeddings on the generated captions.

We encourage you to play around with the different parameters supported by the generate() method () for the generation quality you are looking for.

And you’re done! is a Colab Notebook that you can use to interact with the entire process.

vae () — Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations.

text_encoder (CLIPTextModel) — Frozen text-encoder. Stable Diffusion uses the text portion of , specifically the variant.

tokenizer (CLIPTokenizer) — Tokenizer of class .

unet () — Conditional U-Net architecture to denoise the encoded image latents.

scheduler () — A scheduler to be used in combination with unet to denoise the encoded image latents. Can be one of , , , or .

safety_checker (StableDiffusionSafetyChecker) — Classification module that estimates whether generated images could be considered offensive or harmful. Please, refer to the for details.

This model inherits from . Check the superclass documentation for the generic methods the library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)

( prompt: typing.Union[str, typing.List[str], NoneType] = Nonesource_embeds: Tensor = Nonetarget_embeds: Tensor = Noneheight: typing.Optional[int] = Nonewidth: typing.Optional[int] = Nonenum_inference_steps: int = 50guidance_scale: float = 7.5negative_prompt: typing.Union[str, typing.List[str], NoneType] = Nonenum_images_per_prompt: typing.Optional[int] = 1eta: float = 0.0generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = Nonelatents: typing.Optional[torch.FloatTensor] = Noneprompt_embeds: typing.Optional[torch.FloatTensor] = Nonenegative_prompt_embeds: typing.Optional[torch.FloatTensor] = Nonecross_attention_guidance_amount: float = 0.1output_type: typing.Optional[str] = 'pil'return_dict: bool = Truecallback: typing.Union[typing.Callable[[int, int, torch.FloatTensor], NoneType], NoneType] = Nonecallback_steps: typing.Optional[int] = 1cross_attention_kwargs: typing.Union[typing.Dict[str, typing.Any], NoneType] = None ) → or tuple

source_embeds (torch.Tensor) — Source concept embeddings. Generation of the embeddings as per the . Used in discovering the edit direction.

target_embeds (torch.Tensor) — Target concept embeddings. Generation of the embeddings as per the . Used in discovering the edit direction.

guidance_scale (float, optional, defaults to 7.5) — Guidance scale as defined in . guidance_scale is defined as w of equation 2. of . Guidance scale is enabled by setting guidance_scale > 1. Higher guidance scale encourages to generate images that are closely linked to the text prompt, usually at the expense of lower image quality.

eta (float, optional, defaults to 0.0) — Corresponds to parameter eta (η) in the DDIM paper: . Only applies to , will be ignored for others.

generator (torch.Generator or List[torch.Generator], optional) — One or a list of to make generation deterministic.

output_type (str, optional, defaults to "pil") — The output format of the generate image. Choose between : PIL.Image.Image or np.array.

return_dict (bool, optional, defaults to True) — Whether or not to return a instead of a plain tuple.

or tuple

if return_dict is True, otherwise a tuple. When returning a tuple, the first element is a list with the generated images, and the second element is a list of bools denoting whether the corresponding generated image likely represents "not-safe-for-work" (nsfw) content, according to the safety_checker`.

guidance_scale (float, optional, defaults to 1) — Guidance scale as defined in . guidance_scale is defined as w of equation 2. of . Guidance scale is enabled by setting guidance_scale > 1. Higher guidance scale encourages to generate images that are closely linked to the text prompt, usually at the expense of lower image quality.

generator (torch.Generator or List[torch.Generator], optional) — One or a list of to make generation deterministic.

output_type (str, optional, defaults to "pil") — The output format of the generate image. Choose between : PIL.Image.Image or np.array.

return_dict (bool, optional, defaults to True) — Whether or not to return a instead of a plain tuple.

🌍
🌍
Zero-shot Image-to-Image Translation
project page
original codebase
demo
original paper
this discussion
GPT-3 API
Flan-T5
CLIP
documentation
Here
<source>
AutoencoderKL
CLIP
clip-vit-large-patch14
CLIPTokenizer
UNet2DConditionModel
SchedulerMixin
DDIMScheduler
LMSDiscreteScheduler
EulerAncestralDiscreteScheduler
DDPMScheduler
model card
DiffusionPipeline
<source>
StableDiffusionPipelineOutput
original paper
original paper
Classifier-Free Diffusion Guidance
Imagen Paper
https://arxiv.org/abs/2010.02502
schedulers.DDIMScheduler
torch generator(s)
PIL
StableDiffusionPipelineOutput
StableDiffusionPipelineOutput
StableDiffusionPipelineOutput
<source>
<source>
<source>
<source>
Classifier-Free Diffusion Guidance
Imagen Paper
torch generator(s)
PIL
StableDiffusionPipelineOutput
StableDiffusionPix2PixZeroPipeline
Space