Diffusers BOINC AI docs
  • 🌍GET STARTED
    • Diffusers
    • Quicktour
    • Effective and efficient diffusion
    • Installation
  • 🌍TUTORIALS
    • Overview
    • Understanding models and schedulers
    • AutoPipeline
    • Train a diffusion model
  • 🌍USING DIFFUSERS
    • 🌍LOADING & HUB
      • Overview
      • Load pipelines, models, and schedulers
      • Load and compare different schedulers
      • Load community pipelines
      • Load safetensors
      • Load different Stable Diffusion formats
      • Push files to the Hub
    • 🌍TASKS
      • Unconditional image generation
      • Text-to-image
      • Image-to-image
      • Inpainting
      • Depth-to-image
    • 🌍TECHNIQUES
      • Textual inversion
      • Distributed inference with multiple GPUs
      • Improve image quality with deterministic generation
      • Control image brightness
      • Prompt weighting
    • 🌍PIPELINES FOR INFERENCE
      • Overview
      • Stable Diffusion XL
      • ControlNet
      • Shap-E
      • DiffEdit
      • Distilled Stable Diffusion inference
      • Create reproducible pipelines
      • Community pipelines
      • How to contribute a community pipeline
    • 🌍TRAINING
      • Overview
      • Create a dataset for training
      • Adapt a model to a new task
      • Unconditional image generation
      • Textual Inversion
      • DreamBooth
      • Text-to-image
      • Low-Rank Adaptation of Large Language Models (LoRA)
      • ControlNet
      • InstructPix2Pix Training
      • Custom Diffusion
      • T2I-Adapters
    • 🌍TAKING DIFFUSERS BEYOND IMAGES
      • Other Modalities
  • 🌍OPTIMIZATION/SPECIAL HARDWARE
    • Overview
    • Memory and Speed
    • Torch2.0 support
    • Stable Diffusion in JAX/Flax
    • xFormers
    • ONNX
    • OpenVINO
    • Core ML
    • MPS
    • Habana Gaudi
    • Token Merging
  • 🌍CONCEPTUAL GUIDES
    • Philosophy
    • Controlled generation
    • How to contribute?
    • Diffusers' Ethical Guidelines
    • Evaluating Diffusion Models
  • 🌍API
    • 🌍MAIN CLASSES
      • Attention Processor
      • Diffusion Pipeline
      • Logging
      • Configuration
      • Outputs
      • Loaders
      • Utilities
      • VAE Image Processor
    • 🌍MODELS
      • Overview
      • UNet1DModel
      • UNet2DModel
      • UNet2DConditionModel
      • UNet3DConditionModel
      • VQModel
      • AutoencoderKL
      • AsymmetricAutoencoderKL
      • Tiny AutoEncoder
      • Transformer2D
      • Transformer Temporal
      • Prior Transformer
      • ControlNet
    • 🌍PIPELINES
      • Overview
      • AltDiffusion
      • Attend-and-Excite
      • Audio Diffusion
      • AudioLDM
      • AudioLDM 2
      • AutoPipeline
      • Consistency Models
      • ControlNet
      • ControlNet with Stable Diffusion XL
      • Cycle Diffusion
      • Dance Diffusion
      • DDIM
      • DDPM
      • DeepFloyd IF
      • DiffEdit
      • DiT
      • IF
      • PaInstructPix2Pix
      • Kandinsky
      • Kandinsky 2.2
      • Latent Diffusionge
      • MultiDiffusion
      • MusicLDM
      • PaintByExample
      • Parallel Sampling of Diffusion Models
      • Pix2Pix Zero
      • PNDM
      • RePaint
      • Score SDE VE
      • Self-Attention Guidance
      • Semantic Guidance
      • Shap-E
      • Spectrogram Diffusion
      • 🌍STABLE DIFFUSION
        • Overview
        • Text-to-image
        • Image-to-image
        • Inpainting
        • Depth-to-image
        • Image variation
        • Safe Stable Diffusion
        • Stable Diffusion 2
        • Stable Diffusion XL
        • Latent upscaler
        • Super-resolution
        • LDM3D Text-to-(RGB, Depth)
        • Stable Diffusion T2I-adapter
        • GLIGEN (Grounded Language-to-Image Generation)
      • Stable unCLIP
      • Stochastic Karras VE
      • Text-to-image model editing
      • Text-to-video
      • Text2Video-Zero
      • UnCLIP
      • Unconditional Latent Diffusion
      • UniDiffuser
      • Value-guided sampling
      • Versatile Diffusion
      • VQ Diffusion
      • Wuerstchen
    • 🌍SCHEDULERS
      • Overview
      • CMStochasticIterativeScheduler
      • DDIMInverseScheduler
      • DDIMScheduler
      • DDPMScheduler
      • DEISMultistepScheduler
      • DPMSolverMultistepInverse
      • DPMSolverMultistepScheduler
      • DPMSolverSDEScheduler
      • DPMSolverSinglestepScheduler
      • EulerAncestralDiscreteScheduler
      • EulerDiscreteScheduler
      • HeunDiscreteScheduler
      • IPNDMScheduler
      • KarrasVeScheduler
      • KDPM2AncestralDiscreteScheduler
      • KDPM2DiscreteScheduler
      • LMSDiscreteScheduler
      • PNDMScheduler
      • RePaintScheduler
      • ScoreSdeVeScheduler
      • ScoreSdeVpScheduler
      • UniPCMultistepScheduler
      • VQDiffusionScheduler
Powered by GitBook
On this page
  • Textual Inversion
  • Upload model to Hub
  • Save and load checkpoints
  • Finetuning
  • Inference
  • How it works
  1. USING DIFFUSERS
  2. TRAINING

Textual Inversion

PreviousUnconditional image generationNextDreamBooth

Last updated 1 year ago

Textual Inversion

is a technique for capturing novel concepts from a small number of example images. While the technique was originally demonstrated with a , it has since been applied to other model variants like . The learned concepts can be used to better control the images generated from text-to-image pipelines. It learns new “words” in the text encoder’s embedding space, which are used within text prompts for personalized image generation.

Textual Inversion example

Before you begin, make sure you install the library’s training dependencies:

Copied

pip install diffusers accelerate transformers

Copied

accelerate config

To setup a default 🌍 Accelerate environment without choosing any configurations:

Copied

accelerate config default

Or if your environment doesn’t support an interactive shell like a notebook, you can use:

Copied

from accelerate.utils import write_basic_config

write_basic_config()

Upload model to Hub

If you want to store your model on the Hub, add the following argument to the training script:

Copied

--push_to_hub

Save and load checkpoints

It is often a good idea to regularly save checkpoints of your model during training. This way, you can resume training from a saved checkpoint if your training is interrupted for any reason. To save a checkpoint, pass the following argument to the training script to save the full training state in a subfolder in output_dir every 500 steps:

Copied

--checkpointing_steps=500

To resume training from a saved checkpoint, pass the following argument to the training script and the specific checkpoint you’d like to resume from:

Copied

--resume_from_checkpoint="checkpoint-1500"

Finetuning

Copied

from huggingface_hub import snapshot_download

local_dir = "./cat"
snapshot_download(
    "diffusers/cat_toy_example", local_dir=local_dir, repo_type="dataset", ignore_patterns=".gitattributes"
)

PytorchHide Pytorch contentCopied

export MODEL_NAME="runwayml/stable-diffusion-v1-5"
export DATA_DIR="./cat"

accelerate launch textual_inversion.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --train_data_dir=$DATA_DIR \
  --learnable_property="object" \
  --placeholder_token="<cat-toy>" --initializer_token="toy" \
  --resolution=512 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 \
  --max_train_steps=3000 \
  --learning_rate=5.0e-04 --scale_lr \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --output_dir="textual_inversion_cat" \
  --push_to_hub

💡 If you want to increase the trainable capacity, you can associate your placeholder token, e.g. <cat-toy> to multiple embedding vectors. This can help the model to better capture the style of more (complex) images. To enable training multiple embedding vectors, simply pass:

Copied

--num_vectors=5

JAXHide JAX content

Before you begin, make sure you install the Flax specific dependencies:

Copied

pip install -U -r requirements_flax.txt

Copied

export MODEL_NAME="duongna/stable-diffusion-v1-4-flax"
export DATA_DIR="./cat"

python textual_inversion_flax.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --train_data_dir=$DATA_DIR \
  --learnable_property="object" \
  --placeholder_token="<cat-toy>" --initializer_token="toy" \
  --resolution=512 \
  --train_batch_size=1 \
  --max_train_steps=3000 \
  --learning_rate=5.0e-04 --scale_lr \
  --output_dir="textual_inversion_cat" \
  --push_to_hub

Intermediate logging

If you’re interested in following along with your model training progress, you can save the generated images from the training process. Add the following arguments to the training script to enable intermediate logging:

  • validation_prompt, the prompt used to generate samples (this is set to None by default and intermediate logging is disabled)

  • num_validation_images, the number of sample images to generate

  • validation_steps, the number of steps before generating num_validation_images from the validation_prompt

Copied

--validation_prompt="A <cat-toy> backpack"
--num_validation_images=4
--validation_steps=100

Inference

The textual inversion script will by default only save the textual inversion embedding vector(s) that have been added to the text encoder embedding matrix and consequently been trained.

PytorchHide Pytorch content

Copied

from diffusers import StableDiffusionPipeline
import torch

model_id = "runwayml/stable-diffusion-v1-5"
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16, use_safetensors=True).to("cuda")

Next, we need to load the textual inversion embedding vector which can be done via the TextualInversionLoaderMixin.load_textual_inversion function. Here we’ll load the embeddings of the ”” example from before.

Copied

pipe.load_textual_inversion("sd-concepts-library/cat-toy")

Now we can run the pipeline making sure that the placeholder token <cat-toy> is used in our prompt.

Copied

prompt = "A <cat-toy> backpack"

image = pipe(prompt, num_inference_steps=50).images[0]
image.save("cat-backpack.png")

Copied

pipe.load_textual_inversion("./charturnerv2.pt")

JAXHide JAX content

Currently there is no load_textual_inversion function for Flax so one has to make sure the textual inversion embedding vector is saved as part of the model after training.

The model can then be run just like any other Flax model:

Copied

import jax
import numpy as np
from flax.jax_utils import replicate
from flax.training.common_utils import shard
from diffusers import FlaxStableDiffusionPipeline

model_path = "path-to-your-trained-model"
pipeline, params = FlaxStableDiffusionPipeline.from_pretrained(model_path, dtype=jax.numpy.bfloat16)

prompt = "A <cat-toy> backpack"
prng_seed = jax.random.PRNGKey(0)
num_inference_steps = 50

num_samples = jax.device_count()
prompt = num_samples * [prompt]
prompt_ids = pipeline.prepare_inputs(prompt)

# shard inputs and rng
params = replicate(params)
prng_seed = jax.random.split(prng_seed, jax.device_count())
prompt_ids = shard(prompt_ids)

images = pipeline(prompt_ids, params, prng_seed, num_inference_steps, jit=True).images
images = pipeline.numpy_to_pil(np.asarray(images.reshape((num_samples,) + images.shape[-3:])))
image.save("cat-backpack.png")

How it works

Usually, text prompts are tokenized into an embedding before being passed to a model, which is often a transformer. Textual Inversion does something similar, but it learns a new token embedding, v*, from a special token S* in the diagram above. The model output is used to condition the diffusion model, which helps the diffusion model understand the prompt and new concepts from just a few example images.

To do this, Textual Inversion uses a generator model and noisy versions of the training images. The generator tries to predict less noisy versions of the images, and the token embedding v* is optimized based on how well the generator does. If the token embedding successfully captures the new concept, it gives more useful information to the diffusion model and helps create clearer images with less noise. This optimization process typically occurs after several thousand steps of exposure to a variety of prompt and image variants.

By using just 3-5 images you can teach new concepts to a model such as Stable Diffusion for personalized image generation .

This guide will show you how to train a model with Textual Inversion. All the training scripts for Textual Inversion used in this guide can be found if you’re interested in taking a closer look at how things work under the hood.

There is a community-created collection of trained Textual Inversion models in the which are readily available for inference. Over time, this’ll hopefully grow into a useful resource as more concepts are added!

After all the dependencies have been set up, initialize a 🌍 environment with:

Finally, you try and to reduce your memory footprint with xFormers memory-efficient attention. Once you have xFormers installed, add the --enable_xformers_memory_efficient_attention argument to the training script. xFormers is not supported for Flax.

For your training dataset, download these and store them in a directory. To use your own dataset, take a look at the guide.

Specify the MODEL_NAME environment variable (either a Hub model repository id or a path to the directory containing the model weights) and pass it to the argument, and the DATA_DIR environment variable to the path of the directory containing the images.

Now you can launch the . The script creates and saves the following files to your repository: learned_embeds.bin, token_identifier.txt, and type_of_concept.txt.

💡 A full training run takes ~1 hour on one V100 GPU. While you’re waiting for the training to complete, feel free to check out in the section below if you’re curious!

If you have access to TPUs, try out the to train even faster (this’ll also work for GPUs). With the same configuration settings, the Flax training script should be at least 70% faster than the PyTorch training script! ⚡️

Specify the MODEL_NAME environment variable (either a Hub model repository id or a path to the directory containing the model weights) and pass it to the argument.

Then you can launch the :

Once you have trained a model, you can use it for inference with the .

💡 The community has created a large library of different textual inversion embedding vectors, called . Instead of training textual inversion embeddings from scratch you can also see whether a fitting textual inversion embedding has already been added to the libary.

To load the textual inversion embeddings you first need to load the base model that was used when training your textual inversion embedding vectors. Here we assume that was used as a base model so we load it first:

The function TextualInversionLoaderMixin.load_textual_inversion can not only load textual embedding vectors saved in Diffusers’ format, but also embedding vectors saved in format. To do so, you can first download an embedding vector from and then load it locally:

Diagram from the paper showing overview

Architecture overview from the Textual Inversion

🌍
🌍
(image source)
runwayml/stable-diffusion-v1-5
here
Stable Diffusion Textual Inversion Concepts Library
Accelerate
install xFormers
images of a cat toy
Create a dataset for training
pretrained_model_name_or_path
training script
how Textual Inversion works
Flax training script
pretrained_model_name_or_path
training script
StableDiffusionPipeline
sd-concepts-library
runwayml/stable-diffusion-v1-5
Automatic1111
civitAI
blog post.
Textual Inversion
latent diffusion model
Stable Diffusion