Diffusers BOINC AI docs
  • 🌍GET STARTED
    • Diffusers
    • Quicktour
    • Effective and efficient diffusion
    • Installation
  • 🌍TUTORIALS
    • Overview
    • Understanding models and schedulers
    • AutoPipeline
    • Train a diffusion model
  • 🌍USING DIFFUSERS
    • 🌍LOADING & HUB
      • Overview
      • Load pipelines, models, and schedulers
      • Load and compare different schedulers
      • Load community pipelines
      • Load safetensors
      • Load different Stable Diffusion formats
      • Push files to the Hub
    • 🌍TASKS
      • Unconditional image generation
      • Text-to-image
      • Image-to-image
      • Inpainting
      • Depth-to-image
    • 🌍TECHNIQUES
      • Textual inversion
      • Distributed inference with multiple GPUs
      • Improve image quality with deterministic generation
      • Control image brightness
      • Prompt weighting
    • 🌍PIPELINES FOR INFERENCE
      • Overview
      • Stable Diffusion XL
      • ControlNet
      • Shap-E
      • DiffEdit
      • Distilled Stable Diffusion inference
      • Create reproducible pipelines
      • Community pipelines
      • How to contribute a community pipeline
    • 🌍TRAINING
      • Overview
      • Create a dataset for training
      • Adapt a model to a new task
      • Unconditional image generation
      • Textual Inversion
      • DreamBooth
      • Text-to-image
      • Low-Rank Adaptation of Large Language Models (LoRA)
      • ControlNet
      • InstructPix2Pix Training
      • Custom Diffusion
      • T2I-Adapters
    • 🌍TAKING DIFFUSERS BEYOND IMAGES
      • Other Modalities
  • 🌍OPTIMIZATION/SPECIAL HARDWARE
    • Overview
    • Memory and Speed
    • Torch2.0 support
    • Stable Diffusion in JAX/Flax
    • xFormers
    • ONNX
    • OpenVINO
    • Core ML
    • MPS
    • Habana Gaudi
    • Token Merging
  • 🌍CONCEPTUAL GUIDES
    • Philosophy
    • Controlled generation
    • How to contribute?
    • Diffusers' Ethical Guidelines
    • Evaluating Diffusion Models
  • 🌍API
    • 🌍MAIN CLASSES
      • Attention Processor
      • Diffusion Pipeline
      • Logging
      • Configuration
      • Outputs
      • Loaders
      • Utilities
      • VAE Image Processor
    • 🌍MODELS
      • Overview
      • UNet1DModel
      • UNet2DModel
      • UNet2DConditionModel
      • UNet3DConditionModel
      • VQModel
      • AutoencoderKL
      • AsymmetricAutoencoderKL
      • Tiny AutoEncoder
      • Transformer2D
      • Transformer Temporal
      • Prior Transformer
      • ControlNet
    • 🌍PIPELINES
      • Overview
      • AltDiffusion
      • Attend-and-Excite
      • Audio Diffusion
      • AudioLDM
      • AudioLDM 2
      • AutoPipeline
      • Consistency Models
      • ControlNet
      • ControlNet with Stable Diffusion XL
      • Cycle Diffusion
      • Dance Diffusion
      • DDIM
      • DDPM
      • DeepFloyd IF
      • DiffEdit
      • DiT
      • IF
      • PaInstructPix2Pix
      • Kandinsky
      • Kandinsky 2.2
      • Latent Diffusionge
      • MultiDiffusion
      • MusicLDM
      • PaintByExample
      • Parallel Sampling of Diffusion Models
      • Pix2Pix Zero
      • PNDM
      • RePaint
      • Score SDE VE
      • Self-Attention Guidance
      • Semantic Guidance
      • Shap-E
      • Spectrogram Diffusion
      • 🌍STABLE DIFFUSION
        • Overview
        • Text-to-image
        • Image-to-image
        • Inpainting
        • Depth-to-image
        • Image variation
        • Safe Stable Diffusion
        • Stable Diffusion 2
        • Stable Diffusion XL
        • Latent upscaler
        • Super-resolution
        • LDM3D Text-to-(RGB, Depth)
        • Stable Diffusion T2I-adapter
        • GLIGEN (Grounded Language-to-Image Generation)
      • Stable unCLIP
      • Stochastic Karras VE
      • Text-to-image model editing
      • Text-to-video
      • Text2Video-Zero
      • UnCLIP
      • Unconditional Latent Diffusion
      • UniDiffuser
      • Value-guided sampling
      • Versatile Diffusion
      • VQ Diffusion
      • Wuerstchen
    • 🌍SCHEDULERS
      • Overview
      • CMStochasticIterativeScheduler
      • DDIMInverseScheduler
      • DDIMScheduler
      • DDPMScheduler
      • DEISMultistepScheduler
      • DPMSolverMultistepInverse
      • DPMSolverMultistepScheduler
      • DPMSolverSDEScheduler
      • DPMSolverSinglestepScheduler
      • EulerAncestralDiscreteScheduler
      • EulerDiscreteScheduler
      • HeunDiscreteScheduler
      • IPNDMScheduler
      • KarrasVeScheduler
      • KDPM2AncestralDiscreteScheduler
      • KDPM2DiscreteScheduler
      • LMSDiscreteScheduler
      • PNDMScheduler
      • RePaintScheduler
      • ScoreSdeVeScheduler
      • ScoreSdeVpScheduler
      • UniPCMultistepScheduler
      • VQDiffusionScheduler
Powered by GitBook
On this page
  • Text-to-image
  • Hardware requirements
  • Upload model to Hub
  • Save and load checkpoints
  • Fine-tuning
  • Training with Min-SNR weighting
  • LoRA
  • Inference
  • Stable Diffusion XL
  1. USING DIFFUSERS
  2. TRAINING

Text-to-image

PreviousDreamBoothNextLow-Rank Adaptation of Large Language Models (LoRA)

Last updated 1 year ago

Text-to-image

The text-to-image fine-tuning script is experimental. It’s easy to overfit and run into issues like catastrophic forgetting. We recommend you explore different hyperparameters to get the best results on your dataset.

Text-to-image models like Stable Diffusion generate an image from a text prompt. This guide will show you how to finetune the model on your own dataset with PyTorch and Flax. All the training scripts for text-to-image finetuning used in this guide can be found in this if you’re interested in taking a closer look.

Before running the scripts, make sure to install the library’s training dependencies:

Copied

pip install git+https://github.com/boincai/diffusers.git
pip install -U -r requirements.txt

And initialize an 🌍 environment with:

Copied

accelerate config

If you have already cloned the repo, then you won’t need to go through these steps. Instead, you can pass the path to your local checkout to the training script and it will be loaded from there.

Hardware requirements

Using gradient_checkpointing and mixed_precision, it should be possible to finetune the model on a single 24GB GPU. For higher batch_size’s and faster training, it’s better to use GPUs with more than 30GB of GPU memory. You can also use JAX/Flax for fine-tuning on TPUs or GPUs, which will be covered .

You can reduce your memory footprint even more by enabling memory efficient attention with xFormers. Make sure you have and pass the --enable_xformers_memory_efficient_attention flag to the training script.

xFormers is not available for Flax.

Upload model to Hub

Store your model on the Hub by adding the following argument to the training script:

Copied

  --push_to_hub

Save and load checkpoints

It is a good idea to regularly save checkpoints in case anything happens during training. To save a checkpoint, pass the following argument to the training script:

Copied

  --checkpointing_steps=500

Every 500 steps, the full training state is saved in a subfolder in the output_dir. The checkpoint has the format checkpoint- followed by the number of steps trained so far. For example, checkpoint-1500 is a checkpoint saved after 1500 training steps.

To load a checkpoint to resume training, pass the argument --resume_from_checkpoint to the training script and specify the checkpoint you want to resume from. For example, the following argument resumes training from the checkpoint saved after 1500 training steps:

Copied

  --resume_from_checkpoint="checkpoint-1500"

Fine-tuning

PytorchHide Pytorch content

Copied

export MODEL_NAME="CompVis/stable-diffusion-v1-4"
export dataset_name="lambdalabs/pokemon-blip-captions"

accelerate launch --mixed_precision="fp16"  train_text_to_image.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --dataset_name=$dataset_name \
  --use_ema \
  --resolution=512 --center_crop --random_flip \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 \
  --gradient_checkpointing \
  --max_train_steps=15000 \
  --learning_rate=1e-05 \
  --max_grad_norm=1 \
  --lr_scheduler="constant" --lr_warmup_steps=0 \
  --output_dir="sd-pokemon-model" \
  --push_to_hub

Modify the script if you want to use custom loading logic. We left pointers in the code in the appropriate places to help you. 🌍 The example script below shows how to finetune on a local dataset in TRAIN_DIR and where to save the model to in OUTPUT_DIR:

Copied

export MODEL_NAME="CompVis/stable-diffusion-v1-4"
export TRAIN_DIR="path_to_your_dataset"
export OUTPUT_DIR="path_to_save_model"

accelerate launch train_text_to_image.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --train_data_dir=$TRAIN_DIR \
  --use_ema \
  --resolution=512 --center_crop --random_flip \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 \
  --gradient_checkpointing \
  --mixed_precision="fp16" \
  --max_train_steps=15000 \
  --learning_rate=1e-05 \
  --max_grad_norm=1 \
  --lr_scheduler="constant" 
  --lr_warmup_steps=0 \
  --output_dir=${OUTPUT_DIR} \
  --push_to_hub

Training with multiple GPUs

Copied

export MODEL_NAME="CompVis/stable-diffusion-v1-4"
export dataset_name="lambdalabs/pokemon-blip-captions"

accelerate launch --mixed_precision="fp16" --multi_gpu  train_text_to_image.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --dataset_name=$dataset_name \
  --use_ema \
  --resolution=512 --center_crop --random_flip \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 \
  --gradient_checkpointing \
  --max_train_steps=15000 \ 
  --learning_rate=1e-05 \
  --max_grad_norm=1 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --output_dir="sd-pokemon-model" \
  --push_to_hub

JAXHide JAX content

Before running the script, make sure you have the requirements installed:

Copied

pip install -U -r requirements_flax.txt

Copied

export MODEL_NAME="runwayml/stable-diffusion-v1-5"
export dataset_name="lambdalabs/pokemon-blip-captions"

python train_text_to_image_flax.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --dataset_name=$dataset_name \
  --resolution=512 --center_crop --random_flip \
  --train_batch_size=1 \
  --max_train_steps=15000 \
  --learning_rate=1e-05 \
  --max_grad_norm=1 \
  --output_dir="sd-pokemon-model" \
  --push_to_hub

Modify the script if you want to use custom loading logic. We left pointers in the code in the appropriate places to help you. 🌍 The example script below shows how to finetune on a local dataset in TRAIN_DIR:

Copied

export MODEL_NAME="duongna/stable-diffusion-v1-4-flax"
export TRAIN_DIR="path_to_your_dataset"

python train_text_to_image_flax.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --train_data_dir=$TRAIN_DIR \
  --resolution=512 --center_crop --random_flip \
  --train_batch_size=1 \
  --mixed_precision="fp16" \
  --max_train_steps=15000 \
  --learning_rate=1e-05 \
  --max_grad_norm=1 \
  --output_dir="sd-pokemon-model" \
  --push_to_hub

Training with Min-SNR weighting

  • Training without the Min-SNR weighting strategy

  • Training with the Min-SNR weighting strategy (snr_gamma set to 5.0)

  • Training with the Min-SNR weighting strategy (snr_gamma set to 1.0)

For our small Pokemons dataset, the effects of Min-SNR weighting strategy might not appear to be pronounced, but for larger datasets, we believe the effects will be more pronounced.

Also, note that in this example, we either predict epsilon (i.e., the noise) or the v_prediction. For both of these cases, the formulation of the Min-SNR weighting strategy that we have used holds.

Training with Min-SNR weighting strategy is only supported in PyTorch.

LoRA

Inference

PytorchHide Pytorch contentCopied

from diffusers import StableDiffusionPipeline

model_path = "path_to_saved_model"
pipe = StableDiffusionPipeline.from_pretrained(model_path, torch_dtype=torch.float16, use_safetensors=True)
pipe.to("cuda")

image = pipe(prompt="yoda").images[0]
image.save("yoda-pokemon.png")

JAXHide JAX contentCopied

import jax
import numpy as np
from flax.jax_utils import replicate
from flax.training.common_utils import shard
from diffusers import FlaxStableDiffusionPipeline

model_path = "path_to_saved_model"
pipe, params = FlaxStableDiffusionPipeline.from_pretrained(model_path, dtype=jax.numpy.bfloat16)

prompt = "yoda pokemon"
prng_seed = jax.random.PRNGKey(0)
num_inference_steps = 50

num_samples = jax.device_count()
prompt = num_samples * [prompt]
prompt_ids = pipeline.prepare_inputs(prompt)

# shard inputs and rng
params = replicate(params)
prng_seed = jax.random.split(prng_seed, jax.device_count())
prompt_ids = shard(prompt_ids)

images = pipeline(prompt_ids, params, prng_seed, num_inference_steps, jit=True).images
images = pipeline.numpy_to_pil(np.asarray(images.reshape((num_samples,) + images.shape[-3:])))
image.save("yoda-pokemon.png")

Stable Diffusion XL

Launch the for a fine-tuning run on the dataset like this.

Specify the MODEL_NAME environment variable (either a Hub model repository id or a path to the directory containing the model weights) and pass it to the argument.

To finetune on your own dataset, prepare the dataset according to the format required by 🌍 . You can , or you can .

accelerate allows for seamless multi-GPU training. Follow the instructions for running distributed training with accelerate. Here is an example command:

With Flax, it’s possible to train a Stable Diffusion model faster on TPUs and GPUs thanks to . This is very efficient on TPU hardware but works great on GPUs too. The Flax training script doesn’t support features like gradient checkpointing or gradient accumulation yet, so you’ll need a GPU with at least 30GB of memory or a TPU v3.

Specify the MODEL_NAME environment variable (either a Hub model repository id or a path to the directory containing the model weights) and pass it to the argument.

Now you can launch the like this:

To finetune on your own dataset, prepare the dataset according to the format required by 🌍 . You can , or you can .

We support training with the Min-SNR weighting strategy proposed in which helps to achieve faster convergence by rebalancing the loss. In order to use it, one needs to set the --snr_gamma argument. The recommended value when using it is 5.0.

You can find that compares the loss surfaces of the following setups:

You can also use Low-Rank Adaptation of Large Language Models (LoRA), a fine-tuning technique for accelerating training large models, for fine-tuning text-to-image models. For more details, take a look at the guide.

Now you can load the fine-tuned model for inference by passing the model path or model name on the Hub to the :

We support fine-tuning the UNet shipped in via the train_text_to_image_sdxl.py script. Please refer to the docs .

We also support fine-tuning of the UNet and Text Encoder shipped in with LoRA via the train_text_to_image_lora_sdxl.py script. Please refer to the docs .

🌍
🌍
CompVis/stable-diffusion-v1-4
repository
Accelerate
below
xFormers installed
PyTorch training script
PokΓ©mon BLIP captions
pretrained_model_name_or_path
Datasets
upload your dataset to the Hub
prepare a local folder with your files
here
@duongna211
pretrained_model_name_or_path
Flax training script
Datasets
upload your dataset to the Hub
prepare a local folder with your files
Efficient Diffusion Training via Min-SNR Weighting Strategy
this project on Weights and Biases
LoRA training
StableDiffusionPipeline
Stable Diffusion XL
here
Stable Diffusion XL
here