Pix2Pix Zero
Pix2Pix Zero
Zero-shot Image-to-Image Translation is by Gaurav Parmar, Krishna Kumar Singh, Richard Zhang, Yijun Li, Jingwan Lu, and Jun-Yan Zhu.
The abstract from the paper is:
Large-scale text-to-image generative models have shown their remarkable ability to synthesize diverse and high-quality images. However, it is still challenging to directly apply these models for editing real images for two reasons. First, it is hard for users to come up with a perfect text prompt that accurately describes every visual detail in the input image. Second, while existing models can introduce desirable changes in certain regions, they often dramatically alter the input content and introduce unexpected changes in unwanted regions. In this work, we propose pix2pix-zero, an image-to-image translation method that can preserve the content of the original image without manual prompting. We first automatically discover editing directions that reflect desired edits in the text embedding space. To preserve the general content structure after editing, we further propose cross-attention guidance, which aims to retain the cross-attention maps of the input image throughout the diffusion process. In addition, our method does not need additional training for these edits and can directly use the existing pre-trained text-to-image diffusion model. We conduct extensive experiments and show that our method outperforms existing and concurrent works for both real and synthetic image editing.
You can find additional information about Pix2Pix Zero on the project page, original codebase, and try it out in a demo.
Tips
The pipeline can be conditioned on real input images. Check out the code examples below to know more.
The pipeline exposes two arguments namely
source_embeds
andtarget_embeds
that let you control the direction of the semantic edits in the final image to be generated. Let’s say, you wanted to translate from “cat” to “dog”. In this case, the edit direction will be “cat -> dog”. To reflect this in the pipeline, you simply have to set the embeddings related to the phrases including “cat” tosource_embeds
and “dog” totarget_embeds
. Refer to the code example below for more details.When you’re using this pipeline from a prompt, specify the source concept in the prompt. Taking the above example, a valid input prompt would be: “a high resolution painting of a cat in the style of van gough”.
If you wanted to reverse the direction in the example above, i.e., “dog -> cat”, then it’s recommended to:
Swap the
source_embeds
andtarget_embeds
.Change the input prompt to include “dog”.
To learn more about how the source and target embeddings are generated, refer to the original paper. Below, we also provide some directions on how to generate the embeddings.
Note that the quality of the outputs generated with this pipeline is dependent on how good the
source_embeds
andtarget_embeds
are. Please, refer to this discussion for some suggestions on the topic.
Available Pipelines:
Pipeline | Tasks | Demo |
---|---|---|
Text-Based Image Editing | 🌍 Space |
Usage example
Based on an image generated with the input prompt
Copied
Based on an input image
When the pipeline is conditioned on an input image, we first obtain an inverted noise from it using a DDIMInverseScheduler
with the help of a generated caption. Then the inverted noise is used to start the generation process.
First, let’s load our pipeline:
Copied
Then, we load an input image for conditioning and obtain a suitable caption for it:
Copied
Then we employ the generated caption and the input image to get the inverted noise:
Copied
Now, generate the image with edit directions:
Copied
Generating source and target embeddings
The authors originally used the GPT-3 API to generate the source and target captions for discovering edit directions. However, we can also leverage open source and public models for the same purpose. Below, we provide an end-to-end example with the Flan-T5 model for generating captions and CLIP for computing embeddings on the generated captions.
1. Load the generation model:
Copied
2. Construct a starting prompt:
Copied
Here, we’re interested in the “cat -> dog” direction.
3. Generate captions:
We can use a utility like so for this purpose.
Copied
And then we just call it to generate our captions:
Copied
We encourage you to play around with the different parameters supported by the generate()
method (documentation) for the generation quality you are looking for.
4. Load the embedding model:
Here, we need to use the same text encoder model used by the subsequent Stable Diffusion model.
Copied
5. Compute embeddings:
Copied
And you’re done! Here is a Colab Notebook that you can use to interact with the entire process.
Now, you can use these embeddings directly while calling the pipeline:
Copied
StableDiffusionPix2PixZeroPipeline
class diffusers.StableDiffusionPix2PixZeroPipeline
( vae: AutoencoderKLtext_encoder: CLIPTextModeltokenizer: CLIPTokenizerunet: UNet2DConditionModelscheduler: typing.Union[diffusers.schedulers.scheduling_ddpm.DDPMScheduler, diffusers.schedulers.scheduling_ddim.DDIMScheduler, diffusers.schedulers.scheduling_euler_ancestral_discrete.EulerAncestralDiscreteScheduler, diffusers.schedulers.scheduling_lms_discrete.LMSDiscreteScheduler]feature_extractor: CLIPImageProcessorsafety_checker: StableDiffusionSafetyCheckerinverse_scheduler: DDIMInverseSchedulercaption_generator: BlipForConditionalGenerationcaption_processor: BlipProcessorrequires_safety_checker: bool = True )
Parameters
vae (AutoencoderKL) — Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations.
text_encoder (
CLIPTextModel
) — Frozen text-encoder. Stable Diffusion uses the text portion of CLIP, specifically the clip-vit-large-patch14 variant.tokenizer (
CLIPTokenizer
) — Tokenizer of class CLIPTokenizer.unet (UNet2DConditionModel) — Conditional U-Net architecture to denoise the encoded image latents.
scheduler (SchedulerMixin) — A scheduler to be used in combination with
unet
to denoise the encoded image latents. Can be one of DDIMScheduler, LMSDiscreteScheduler, EulerAncestralDiscreteScheduler, or DDPMScheduler.safety_checker (
StableDiffusionSafetyChecker
) — Classification module that estimates whether generated images could be considered offensive or harmful. Please, refer to the model card for details.feature_extractor (
CLIPImageProcessor
) — Model that extracts features from generated images to be used as inputs for thesafety_checker
.requires_safety_checker (bool) — Whether the pipeline requires a safety checker. We recommend setting it to True if you’re using the pipeline publicly.
Pipeline for pixel-levl image editing using Pix2Pix Zero. Based on Stable Diffusion.
This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods the library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)
__call__
( prompt: typing.Union[str, typing.List[str], NoneType] = Nonesource_embeds: Tensor = Nonetarget_embeds: Tensor = Noneheight: typing.Optional[int] = Nonewidth: typing.Optional[int] = Nonenum_inference_steps: int = 50guidance_scale: float = 7.5negative_prompt: typing.Union[str, typing.List[str], NoneType] = Nonenum_images_per_prompt: typing.Optional[int] = 1eta: float = 0.0generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = Nonelatents: typing.Optional[torch.FloatTensor] = Noneprompt_embeds: typing.Optional[torch.FloatTensor] = Nonenegative_prompt_embeds: typing.Optional[torch.FloatTensor] = Nonecross_attention_guidance_amount: float = 0.1output_type: typing.Optional[str] = 'pil'return_dict: bool = Truecallback: typing.Union[typing.Callable[[int, int, torch.FloatTensor], NoneType], NoneType] = Nonecallback_steps: typing.Optional[int] = 1cross_attention_kwargs: typing.Union[typing.Dict[str, typing.Any], NoneType] = None ) → StableDiffusionPipelineOutput or tuple
Parameters
prompt (
str
orList[str]
, optional) — The prompt or prompts to guide the image generation. If not defined, one has to passprompt_embeds
. instead.source_embeds (
torch.Tensor
) — Source concept embeddings. Generation of the embeddings as per the original paper. Used in discovering the edit direction.target_embeds (
torch.Tensor
) — Target concept embeddings. Generation of the embeddings as per the original paper. Used in discovering the edit direction.height (
int
, optional, defaults to self.unet.config.sample_size * self.vae_scale_factor) — The height in pixels of the generated image.width (
int
, optional, defaults to self.unet.config.sample_size * self.vae_scale_factor) — The width in pixels of the generated image.num_inference_steps (
int
, optional, defaults to 50) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference.guidance_scale (
float
, optional, defaults to 7.5) — Guidance scale as defined in Classifier-Free Diffusion Guidance.guidance_scale
is defined asw
of equation 2. of Imagen Paper. Guidance scale is enabled by settingguidance_scale > 1
. Higher guidance scale encourages to generate images that are closely linked to the textprompt
, usually at the expense of lower image quality.negative_prompt (
str
orList[str]
, optional) — The prompt or prompts not to guide the image generation. If not defined, one has to passnegative_prompt_embeds
instead. Ignored when not using guidance (i.e., ignored ifguidance_scale
is less than1
).num_images_per_prompt (
int
, optional, defaults to 1) — The number of images to generate per prompt.eta (
float
, optional, defaults to 0.0) — Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to schedulers.DDIMScheduler, will be ignored for others.generator (
torch.Generator
orList[torch.Generator]
, optional) — One or a list of torch generator(s) to make generation deterministic.latents (
torch.FloatTensor
, optional) — Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor will ge generated by sampling using the supplied randomgenerator
.prompt_embeds (
torch.FloatTensor
, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated fromprompt
input argument.negative_prompt_embeds (
torch.FloatTensor
, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated fromnegative_prompt
input argument.cross_attention_guidance_amount (
float
, defaults to 0.1) — Amount of guidance needed from the reference cross-attention maps.output_type (
str
, optional, defaults to"pil"
) — The output format of the generate image. Choose between PIL:PIL.Image.Image
ornp.array
.return_dict (
bool
, optional, defaults toTrue
) — Whether or not to return a StableDiffusionPipelineOutput instead of a plain tuple.callback (
Callable
, optional) — A function that will be called everycallback_steps
steps during inference. The function will be called with the following arguments:callback(step: int, timestep: int, latents: torch.FloatTensor)
.callback_steps (
int
, optional, defaults to 1) — The frequency at which thecallback
function will be called. If not specified, the callback will be called at every step.
Returns
StableDiffusionPipelineOutput or tuple
StableDiffusionPipelineOutput if return_dict
is True, otherwise a tuple. When returning a tuple, the first element is a list with the generated images, and the second element is a list of
bools denoting whether the corresponding generated image likely represents "not-safe-for-work" (nsfw) content, according to the
safety_checker`.
Function invoked when calling the pipeline for generation.
Examples:
Copied
construct_direction
( embs_source: Tensorembs_target: Tensor )
Constructs the edit direction to steer the image generation process semantically.
encode_prompt
( promptdevicenum_images_per_promptdo_classifier_free_guidancenegative_prompt = Noneprompt_embeds: typing.Optional[torch.FloatTensor] = Nonenegative_prompt_embeds: typing.Optional[torch.FloatTensor] = Nonelora_scale: typing.Optional[float] = None )
Parameters
prompt (
str
orList[str]
, optional) — prompt to be encoded device — (torch.device
): torch devicenum_images_per_prompt (
int
) — number of images that should be generated per promptdo_classifier_free_guidance (
bool
) — whether to use classifier free guidance or notnegative_prompt (
str
orList[str]
, optional) — The prompt or prompts not to guide the image generation. If not defined, one has to passnegative_prompt_embeds
instead. Ignored when not using guidance (i.e., ignored ifguidance_scale
is less than1
).prompt_embeds (
torch.FloatTensor
, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated fromprompt
input argument.negative_prompt_embeds (
torch.FloatTensor
, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated fromnegative_prompt
input argument.lora_scale (
float
, optional) — A lora scale that will be applied to all LoRA layers of the text encoder if LoRA layers are loaded.
Encodes the prompt into text encoder hidden states.
generate_caption
( images )
Generates caption for a given image.
invert
( prompt: typing.Optional[str] = Noneimage: typing.Union[PIL.Image.Image, numpy.ndarray, torch.FloatTensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.FloatTensor]] = Nonenum_inference_steps: int = 50guidance_scale: float = 1generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = Nonelatents: typing.Optional[torch.FloatTensor] = Noneprompt_embeds: typing.Optional[torch.FloatTensor] = Nonecross_attention_guidance_amount: float = 0.1output_type: typing.Optional[str] = 'pil'return_dict: bool = Truecallback: typing.Union[typing.Callable[[int, int, torch.FloatTensor], NoneType], NoneType] = Nonecallback_steps: typing.Optional[int] = 1cross_attention_kwargs: typing.Union[typing.Dict[str, typing.Any], NoneType] = Nonelambda_auto_corr: float = 20.0lambda_kl: float = 20.0num_reg_steps: int = 5num_auto_corr_rolls: int = 5 )
Parameters
prompt (
str
orList[str]
, optional) — The prompt or prompts to guide the image generation. If not defined, one has to passprompt_embeds
. instead.image (
torch.FloatTensor
np.ndarray
,PIL.Image.Image
,List[torch.FloatTensor]
,List[PIL.Image.Image]
, orList[np.ndarray]
) —Image
, or tensor representing an image batch which will be used for conditioning. Can also accept image latents asimage
, if passing latents directly, it will not be encoded again.num_inference_steps (
int
, optional, defaults to 50) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference.guidance_scale (
float
, optional, defaults to 1) — Guidance scale as defined in Classifier-Free Diffusion Guidance.guidance_scale
is defined asw
of equation 2. of Imagen Paper. Guidance scale is enabled by settingguidance_scale > 1
. Higher guidance scale encourages to generate images that are closely linked to the textprompt
, usually at the expense of lower image quality.generator (
torch.Generator
orList[torch.Generator]
, optional) — One or a list of torch generator(s) to make generation deterministic.latents (
torch.FloatTensor
, optional) — Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor will ge generated by sampling using the supplied randomgenerator
.prompt_embeds (
torch.FloatTensor
, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated fromprompt
input argument.cross_attention_guidance_amount (
float
, defaults to 0.1) — Amount of guidance needed from the reference cross-attention maps.output_type (
str
, optional, defaults to"pil"
) — The output format of the generate image. Choose between PIL:PIL.Image.Image
ornp.array
.return_dict (
bool
, optional, defaults toTrue
) — Whether or not to return a StableDiffusionPipelineOutput instead of a plain tuple.callback (
Callable
, optional) — A function that will be called everycallback_steps
steps during inference. The function will be called with the following arguments:callback(step: int, timestep: int, latents: torch.FloatTensor)
.callback_steps (
int
, optional, defaults to 1) — The frequency at which thecallback
function will be called. If not specified, the callback will be called at every step.lambda_auto_corr (
float
, optional, defaults to 20.0) — Lambda parameter to control auto correctionlambda_kl (
float
, optional, defaults to 20.0) — Lambda parameter to control Kullback–Leibler divergence outputnum_reg_steps (
int
, optional, defaults to 5) — Number of regularization loss stepsnum_auto_corr_rolls (
int
, optional, defaults to 5) — Number of auto correction roll steps
Function used to generate inverted latents given a prompt and image.
Examples:
Copied
Last updated