Gaudi Stable Diffusion Pipeline
Last updated
Last updated
The GaudiStableDiffusionPipeline
class enables to perform text-to-image generation on HPUs. It inherits from the GaudiDiffusionPipeline
class that is the parent to any kind of diffuser pipeline.
To get the most out of it, it should be associated with a scheduler that is optimized for HPUs like GaudiDDIMScheduler
.
( vae: AutoencoderKL text_encoder: CLIPTextModel tokenizer: CLIPTokenizer unet: UNet2DConditionModel scheduler: KarrasDiffusionSchedulers safety_checker: StableDiffusionSafetyChecker feature_extractor: CLIPImageProcessor requires_safety_checker: bool = True use_habana: bool = False use_hpu_graphs: bool = False gaudi_config: typing.Union[str, optimum.habana.transformers.gaudi_configuration.GaudiConfig] = None bf16_full_eval: bool = False )
Parameters
vae (AutoencoderKL
) β Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations.
text_encoder () β Frozen text-encoder ().
tokenizer (~transformers.CLIPTokenizer
) β A CLIPTokenizer
to tokenize text.
unet (UNet2DConditionModel
) β A UNet2DConditionModel
to denoise the encoded image latents.
scheduler (SchedulerMixin
) β A scheduler to be used in combination with unet
to denoise the encoded image latents. Can be one of DDIMScheduler
, LMSDiscreteScheduler
, or PNDMScheduler
.
safety_checker (StableDiffusionSafetyChecker
) β Classification module that estimates whether generated images could be considered offensive or harmful. Please refer to the for more details about a modelβs potential harms.
feature_extractor () β A CLIPImageProcessor
to extract features from generated images; used as inputs to the safety_checker
.
use_habana (bool, defaults to False
) β Whether to use Gaudi (True
) or CPU (False
).
use_hpu_graphs (bool, defaults to False
) β Whether to use HPU graphs or not.
gaudi_config (Union[str, ], defaults to None
) β Gaudi configuration to use. Can be a string to download it from the Hub. Or a previously initialized config can be passed.
bf16_full_eval (bool, defaults to False
) β Whether to use full bfloat16 evaluation instead of 32-bit. This will be faster and save memory compared to fp32/mixed precision but can harm generated images.
Generation is performed by batches
Two mark_step()
were added to add support for lazy mode
Added support for HPU graphs
__call__
( prompt: typing.Union[str, typing.List[str]] = None height: typing.Optional[int] = None width: typing.Optional[int] = None num_inference_steps: int = 50 guidance_scale: float = 7.5 negative_prompt: typing.Union[typing.List[str], str, NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 batch_size: int = 1 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.FloatTensor] = None prompt_embeds: typing.Optional[torch.FloatTensor] = None negative_prompt_embeds: typing.Optional[torch.FloatTensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback: typing.Union[typing.Callable[[int, int, torch.FloatTensor], NoneType], NoneType] = None callback_steps: int = 1 cross_attention_kwargs: typing.Union[typing.Dict[str, typing.Any], NoneType] = None guidance_rescale: float = 0.0 ) β GaudiStableDiffusionPipelineOutput
or tuple
Parameters
prompt (str
or List[str]
, optional) β The prompt or prompts to guide image generation. If not defined, you need to pass prompt_embeds
.
height (int
, optional, defaults to self.unet.config.sample_size * self.vae_scale_factor
) β The height in pixels of the generated images.
width (int
, optional, defaults to self.unet.config.sample_size * self.vae_scale_factor
) β The width in pixels of the generated images.
num_inference_steps (int
, optional, defaults to 50) β The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference.
guidance_scale (float
, optional, defaults to 7.5) β A higher guidance scale value encourages the model to generate images closely linked to the text prompt
at the expense of lower image quality. Guidance scale is enabled when guidance_scale > 1
.
negative_prompt (str
or List[str]
, optional) β The prompt or prompts to guide what to not include in image generation. If not defined, you need to pass negative_prompt_embeds
instead. Ignored when not using guidance (guidance_scale < 1
).
num_images_per_prompt (int
, optional, defaults to 1) β The number of images to generate per prompt.
batch_size (int
, optional, defaults to 1) β The number of images in a batch.
latents (torch.FloatTensor
, optional) β Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor is generated by sampling using the supplied random generator
.
prompt_embeds (torch.FloatTensor
, optional) β Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not provided, text embeddings are generated from the prompt
input argument.
negative_prompt_embeds (torch.FloatTensor
, optional) β Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not provided, negative_prompt_embeds
are generated from the negative_prompt
input argument.
output_type (str
, optional, defaults to "pil"
) β The output format of the generated image. Choose between PIL.Image
or np.array
.
return_dict (bool
, optional, defaults to True
) β Whether or not to return a GaudiStableDiffusionPipelineOutput
instead of a plain tuple.
callback (Callable
, optional) β A function that calls every callback_steps
steps during inference. The function is called with the following arguments: callback(step: int, timestep: int, latents: torch.FloatTensor)
.
callback_steps (int
, optional, defaults to 1) β The frequency at which the callback
function is called. If not specified, the callback is called at every step.
Returns
GaudiStableDiffusionPipelineOutput
or tuple
If return_dict
is True
, ~diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput
is returned, otherwise a tuple
is returned where the first element is a list with the generated images and the second element is a list of bool
s indicating whether the corresponding generated image contains βnot-safe-for-workβ (nsfw) content.
The call function to the pipeline for generation.
( use_habana: bool = False use_hpu_graphs: bool = False gaudi_config: typing.Union[str, optimum.habana.transformers.gaudi_configuration.GaudiConfig] = None bf16_full_eval: bool = False )
Parameters
use_habana (bool, defaults to False
) β Whether to use Gaudi (True
) or CPU (False
).
use_hpu_graphs (bool, defaults to False
) β Whether to use HPU graphs or not.
bf16_full_eval (bool, defaults to False
) β Whether to use full bfloat16 evaluation instead of 32-bit. This will be faster and save memory compared to fp32/mixed precision but can harm generated images.
The pipeline is initialized on Gaudi if use_habana=True
.
The pipelineβs Gaudi configuration is saved and pushed to the hub.
from_pretrained
( pretrained_model_name_or_path: typing.Union[str, os.PathLike, NoneType] **kwargs )
save_pretrained
( save_directory: typing.Union[str, os.PathLike] safe_serialization: bool = True variant: typing.Optional[str] = None push_to_hub: bool = False **kwargs )
Parameters
save_directory (str
or os.PathLike
) β Directory to which to save. Will be created if it doesnβt exist.
safe_serialization (bool
, optional, defaults to True
) β Whether to save the model using safetensors
or the traditional PyTorch way (that uses pickle
).
variant (str
, optional) β If specified, weights are saved in the format pytorch_model..bin.
push_to_hub (bool
, optional, defaults to False
) β Whether or not to push your model to the BOINC AI model hub after saving it. You can specify the repository you want to push to with repo_id
(will default to the name of save_directory
in your namespace).
kwargs (Dict[str, Any]
, optional) β Additional keyword arguments passed along to the ~utils.PushToHubMixin.push_to_hub
method.
( num_train_timesteps: int = 1000 beta_start: float = 0.0001 beta_end: float = 0.02 beta_schedule: str = 'linear' trained_betas: typing.Union[numpy.ndarray, typing.List[float], NoneType] = None clip_sample: bool = True set_alpha_to_one: bool = True steps_offset: int = 0 prediction_type: str = 'epsilon' thresholding: bool = False dynamic_thresholding_ratio: float = 0.995 clip_sample_range: float = 1.0 sample_max_value: float = 1.0 timestep_spacing: str = 'leading' rescale_betas_zero_snr: bool = False )
Parameters
num_train_timesteps (int
, defaults to 1000) β The number of diffusion steps to train the model.
beta_start (float
, defaults to 0.0001) β The starting beta
value of inference.
beta_end (float
, defaults to 0.02) β The final beta
value.
beta_schedule (str
, defaults to "linear"
) β The beta schedule, a mapping from a beta range to a sequence of betas for stepping the model. Choose from linear
, scaled_linear
, or squaredcos_cap_v2
.
trained_betas (np.ndarray
, optional) β Pass an array of betas directly to the constructor to bypass beta_start
and beta_end
.
clip_sample (bool
, defaults to True
) β Clip the predicted sample for numerical stability.
clip_sample_range (float
, defaults to 1.0) β The maximum magnitude for sample clipping. Valid only when clip_sample=True
.
set_alpha_to_one (bool
, defaults to True
) β Each diffusion step uses the alphas product value at that step and at the previous one. For the final step there is no previous alpha. When this option is True
the previous alpha product is fixed to 1
, otherwise it uses the alpha value at step 0.
steps_offset (int
, defaults to 0) β An offset added to the inference steps. You can use a combination of offset=1
and set_alpha_to_one=False
to make the last step use step 0 for the previous alpha product like in Stable Diffusion.
thresholding (bool
, defaults to False
) β Whether to use the βdynamic thresholdingβ method. This is unsuitable for latent-space diffusion models such as Stable Diffusion.
dynamic_thresholding_ratio (float
, defaults to 0.995) β The ratio for the dynamic thresholding method. Valid only when thresholding=True
.
sample_max_value (float
, defaults to 1.0) β The threshold value for dynamic thresholding. Valid only when thresholding=True
.
All time-dependent parameters are generated at the beginning
At each time step, tensors are rolled to update the values of the time-dependent parameters
step
( model_output: FloatTensor sample: FloatTensor eta: float = 0.0 use_clipped_model_output: bool = False generator = None variance_noise: typing.Optional[torch.FloatTensor] = None return_dict: bool = True ) β diffusers.schedulers.scheduling_utils.DDIMSchedulerOutput
or tuple
Parameters
model_output (torch.FloatTensor
) β The direct output from learned diffusion model.
timestep (float
) β The current discrete timestep in the diffusion chain.
sample (torch.FloatTensor
) β A current instance of a sample created by the diffusion process.
eta (float
) β The weight of noise for added noise in diffusion step.
use_clipped_model_output (bool
, defaults to False
) β If True
, computes βcorrectedβ model_output
from the clipped predicted original sample. Necessary because predicted original sample is clipped to [-1, 1] when self.config.clip_sample
is True
. If no clipping has happened, βcorrectedβ model_output
would coincide with the one provided as input and use_clipped_model_output
has no effect.
generator (torch.Generator
, optional) β A random number generator.
variance_noise (torch.FloatTensor
) β Alternative to generating noise with generator
by directly providing the noise for the variance itself. Useful for methods such as CycleDiffusion
.
return_dict (bool
, optional, defaults to True
) β Whether or not to return a DDIMSchedulerOutput
or tuple
.
Returns
diffusers.schedulers.scheduling_utils.DDIMSchedulerOutput
or tuple
If return_dict is True
, DDIMSchedulerOutput
is returned, otherwise a tuple is returned where the first element is the sample tensor.
Predict the sample from the previous timestep by reversing the SDE. This function propagates the diffusion process from the learned model outputs (most often the predicted noise).
The GaudiStableDiffusionUpscalePipeline
is used to enhance the resolution of input images by a factor of 4 on HPUs. It inherits from the GaudiDiffusionPipeline
class that is the parent to any kind of diffuser pipeline.
( vae: AutoencoderKL text_encoder: CLIPTextModel tokenizer: CLIPTokenizer unet: UNet2DConditionModel low_res_scheduler: DDPMScheduler scheduler: KarrasDiffusionSchedulers safety_checker: typing.Optional[typing.Any] = None feature_extractor: typing.Optional[transformers.models.clip.image_processing_clip.CLIPImageProcessor] = None use_habana: bool = False use_hpu_graphs: bool = False gaudi_config: typing.Union[str, optimum.habana.transformers.gaudi_configuration.GaudiConfig] = None bf16_full_eval: bool = False watermarker: typing.Optional[typing.Any] = None max_noise_level: int = 350 )
Parameters
vae (AutoencoderKL
) β Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations.
unet (UNet2DConditionModel
) β Conditional U-Net architecture to denoise the encoded image latents.
low_res_scheduler (SchedulerMixin
) β A scheduler used to add initial noise to the low resolution conditioning image. It must be an instance of DDPMScheduler
.
scheduler (SchedulerMixin
) β A scheduler to be used in combination with unet
to denoise the encoded image latents. Can be one of DDIMScheduler
, LMSDiscreteScheduler
, or PNDMScheduler
.
feature_extractor (CLIPImageProcessor
) β Model that extracts features from generated images to be used as inputs for the safety_checker
.
use_habana (bool, defaults to False
) β Whether to use Gaudi (True
) or CPU (False
).
use_hpu_graphs (bool, defaults to False
) β Whether to use HPU graphs or not.
bf16_full_eval (bool, defaults to False
) β Whether to use full bfloat16 evaluation instead of 32-bit. This will be faster and save memory compared to fp32/mixed precision but can harm generated images.
Pipeline for text-guided image super-resolution using Stable Diffusion 2.
Generation is performed by batches
Two mark_step()
were added to add support for lazy mode
Added support for HPU graphs
__call__
( prompt: typing.Union[str, typing.List[str]] = None image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.FloatTensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.FloatTensor]] = None num_inference_steps: int = 75 guidance_scale: float = 9.0 noise_level: int = 20 negative_prompt: typing.Union[typing.List[str], str, NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 batch_size: int = 1 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.FloatTensor] = None prompt_embeds: typing.Optional[torch.FloatTensor] = None negative_prompt_embeds: typing.Optional[torch.FloatTensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback: typing.Union[typing.Callable[[int, int, torch.FloatTensor], NoneType], NoneType] = None callback_steps: int = 1 cross_attention_kwargs: typing.Union[typing.Dict[str, typing.Any], NoneType] = None ) β GaudiStableDiffusionPipelineOutput
or tuple
Parameters
prompt (str
or List[str]
, optional) β The prompt or prompts to guide the image generation. If not defined, one has to pass prompt_embeds
. instead.
image (torch.FloatTensor
, PIL.Image.Image
, np.ndarray
, List[torch.FloatTensor]
, List[PIL.Image.Image]
, or List[np.ndarray]
) β Image
or tensor representing an image batch to be upscaled.
num_inference_steps (int
, optional, defaults to 50) β The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference.
negative_prompt (str
or List[str]
, optional) β The prompt or prompts not to guide the image generation. If not defined, one has to pass negative_prompt_embeds
instead. Ignored when not using guidance (i.e., ignored if guidance_scale
is less than 1
).
num_images_per_prompt (int
, optional, defaults to 1) β The number of images to generate per prompt.
batch_size (int
, optional, defaults to 1) β The number of images in a batch.
latents (torch.FloatTensor
, optional) β Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor will ge generated randomly.
prompt_embeds (torch.FloatTensor
, optional) β Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated from prompt
input argument.
negative_prompt_embeds (torch.FloatTensor
, optional) β Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated from negative_prompt
input argument.
return_dict (bool
, optional, defaults to True
) β Whether or not to return a GaudiStableDiffusionPipelineOutput
instead of a plain tuple.
callback (Callable
, optional) β A function that will be called every callback_steps
steps during inference. The function will be called with the following arguments: callback(step: int, timestep: int, latents: torch.FloatTensor)
.
callback_steps (int
, optional, defaults to 1) β The frequency at which the callback
function will be called. If not specified, the callback will be called at every step.
Returns
GaudiStableDiffusionPipelineOutput
or tuple
GaudiStableDiffusionPipelineOutput
if return_dict
is True, otherwise a tuple
. When returning a tuple, the first element is a list with the generated images, and the second element is a list of bool
s denoting whether the corresponding generated image likely represents βnot-safe-for-workβ (nsfw) content, according to the safety_checker
.
Function invoked when calling the pipeline for generation.
Examples:
Copied
Extends the class:
eta (float
, optional, defaults to 0.0) β Corresponds to parameter eta (Ξ·) from the paper. Only applies to the ~schedulers.DDIMScheduler
, and is ignored in other schedulers.
generator (torch.Generator
or List[torch.Generator]
, optional) β A to make generation deterministic.
cross_attention_kwargs (dict
, optional) β A kwargs dictionary that if specified is passed along to the AttentionProcessor
as defined in .
guidance_rescale (float
, optional, defaults to 0.7) β Guidance rescale factor from . Guidance rescale factor should fix overexposure when using zero terminal SNR.
gaudi_config (Union[str, ], defaults to None
) β Gaudi configuration to use. Can be a string to download it from the Hub. Or a previously initialized config can be passed.
Extends the class:
More information .
Save the pipeline and Gaudi configurations. More information .
prediction_type (str
, defaults to epsilon
, optional) β Prediction type of the scheduler function; can be epsilon
(predicts the noise of the diffusion process), sample
(directly predicts the noisy sample) or
v_prediction` (see section 2.4 of paper).
timestep_spacing (str
, defaults to "leading"
) β The way the timesteps should be scaled. Refer to Table 2 of the for more information.
rescale_betas_zero_snr (bool
, defaults to False
) β Whether to rescale the betas to have zero terminal SNR. This enables the model to generate very bright and dark samples instead of limiting it to samples with medium brightness. Loosely related to .
Extends to run optimally on Gaudi:
text_encoder (CLIPTextModel
) β Frozen text-encoder. Stable Diffusion uses the text portion of , specifically the variant.
tokenizer (CLIPTokenizer
) β Tokenizer of class .
safety_checker (StableDiffusionSafetyChecker
) β Classification module that estimates whether generated images could be considered offensive or harmful. Please, refer to the for details.
gaudi_config (Union[str, ], defaults to None
) β Gaudi configuration to use. Can be a string to download it from the Hub. Or a previously initialized config can be passed.
Extends the class:
guidance_scale (float
, optional, defaults to 7.5) β Guidance scale as defined in . guidance_scale
is defined as w
of equation 2. of . Guidance scale is enabled by setting guidance_scale > 1
. Higher guidance scale encourages to generate images that are closely linked to the text prompt
, usually at the expense of lower image quality.
eta (float
, optional, defaults to 0.0) β Corresponds to parameter eta (Ξ·) in the DDIM paper: . Only applies to schedulers.DDIMScheduler
, will be ignored for others.
generator (torch.Generator
or List[torch.Generator]
, optional) β One or a list of to make generation deterministic.
output_type (str
, optional, defaults to "pil"
) β The output format of the generate image. Choose between : PIL.Image.Image
or np.array
.
cross_attention_kwargs (dict
, optional) β A kwargs dictionary that if specified is passed along to the AttentionProcessor
as defined under self.processor
in .