Diffusers BOINC AI docs
  • 🌍GET STARTED
    • Diffusers
    • Quicktour
    • Effective and efficient diffusion
    • Installation
  • 🌍TUTORIALS
    • Overview
    • Understanding models and schedulers
    • AutoPipeline
    • Train a diffusion model
  • 🌍USING DIFFUSERS
    • 🌍LOADING & HUB
      • Overview
      • Load pipelines, models, and schedulers
      • Load and compare different schedulers
      • Load community pipelines
      • Load safetensors
      • Load different Stable Diffusion formats
      • Push files to the Hub
    • 🌍TASKS
      • Unconditional image generation
      • Text-to-image
      • Image-to-image
      • Inpainting
      • Depth-to-image
    • 🌍TECHNIQUES
      • Textual inversion
      • Distributed inference with multiple GPUs
      • Improve image quality with deterministic generation
      • Control image brightness
      • Prompt weighting
    • 🌍PIPELINES FOR INFERENCE
      • Overview
      • Stable Diffusion XL
      • ControlNet
      • Shap-E
      • DiffEdit
      • Distilled Stable Diffusion inference
      • Create reproducible pipelines
      • Community pipelines
      • How to contribute a community pipeline
    • 🌍TRAINING
      • Overview
      • Create a dataset for training
      • Adapt a model to a new task
      • Unconditional image generation
      • Textual Inversion
      • DreamBooth
      • Text-to-image
      • Low-Rank Adaptation of Large Language Models (LoRA)
      • ControlNet
      • InstructPix2Pix Training
      • Custom Diffusion
      • T2I-Adapters
    • 🌍TAKING DIFFUSERS BEYOND IMAGES
      • Other Modalities
  • 🌍OPTIMIZATION/SPECIAL HARDWARE
    • Overview
    • Memory and Speed
    • Torch2.0 support
    • Stable Diffusion in JAX/Flax
    • xFormers
    • ONNX
    • OpenVINO
    • Core ML
    • MPS
    • Habana Gaudi
    • Token Merging
  • 🌍CONCEPTUAL GUIDES
    • Philosophy
    • Controlled generation
    • How to contribute?
    • Diffusers' Ethical Guidelines
    • Evaluating Diffusion Models
  • 🌍API
    • 🌍MAIN CLASSES
      • Attention Processor
      • Diffusion Pipeline
      • Logging
      • Configuration
      • Outputs
      • Loaders
      • Utilities
      • VAE Image Processor
    • 🌍MODELS
      • Overview
      • UNet1DModel
      • UNet2DModel
      • UNet2DConditionModel
      • UNet3DConditionModel
      • VQModel
      • AutoencoderKL
      • AsymmetricAutoencoderKL
      • Tiny AutoEncoder
      • Transformer2D
      • Transformer Temporal
      • Prior Transformer
      • ControlNet
    • 🌍PIPELINES
      • Overview
      • AltDiffusion
      • Attend-and-Excite
      • Audio Diffusion
      • AudioLDM
      • AudioLDM 2
      • AutoPipeline
      • Consistency Models
      • ControlNet
      • ControlNet with Stable Diffusion XL
      • Cycle Diffusion
      • Dance Diffusion
      • DDIM
      • DDPM
      • DeepFloyd IF
      • DiffEdit
      • DiT
      • IF
      • PaInstructPix2Pix
      • Kandinsky
      • Kandinsky 2.2
      • Latent Diffusionge
      • MultiDiffusion
      • MusicLDM
      • PaintByExample
      • Parallel Sampling of Diffusion Models
      • Pix2Pix Zero
      • PNDM
      • RePaint
      • Score SDE VE
      • Self-Attention Guidance
      • Semantic Guidance
      • Shap-E
      • Spectrogram Diffusion
      • 🌍STABLE DIFFUSION
        • Overview
        • Text-to-image
        • Image-to-image
        • Inpainting
        • Depth-to-image
        • Image variation
        • Safe Stable Diffusion
        • Stable Diffusion 2
        • Stable Diffusion XL
        • Latent upscaler
        • Super-resolution
        • LDM3D Text-to-(RGB, Depth)
        • Stable Diffusion T2I-adapter
        • GLIGEN (Grounded Language-to-Image Generation)
      • Stable unCLIP
      • Stochastic Karras VE
      • Text-to-image model editing
      • Text-to-video
      • Text2Video-Zero
      • UnCLIP
      • Unconditional Latent Diffusion
      • UniDiffuser
      • Value-guided sampling
      • Versatile Diffusion
      • VQ Diffusion
      • Wuerstchen
    • 🌍SCHEDULERS
      • Overview
      • CMStochasticIterativeScheduler
      • DDIMInverseScheduler
      • DDIMScheduler
      • DDPMScheduler
      • DEISMultistepScheduler
      • DPMSolverMultistepInverse
      • DPMSolverMultistepScheduler
      • DPMSolverSDEScheduler
      • DPMSolverSinglestepScheduler
      • EulerAncestralDiscreteScheduler
      • EulerDiscreteScheduler
      • HeunDiscreteScheduler
      • IPNDMScheduler
      • KarrasVeScheduler
      • KDPM2AncestralDiscreteScheduler
      • KDPM2DiscreteScheduler
      • LMSDiscreteScheduler
      • PNDMScheduler
      • RePaintScheduler
      • ScoreSdeVeScheduler
      • ScoreSdeVpScheduler
      • UniPCMultistepScheduler
      • VQDiffusionScheduler
Powered by GitBook
On this page
  • AudioLDM
  • Tips
  • AudioLDMPipeline
  • AudioPipelineOutput
  1. API
  2. PIPELINES

AudioLDM

PreviousAudio DiffusionNextAudioLDM 2

Last updated 1 year ago

AudioLDM

AudioLDM was proposed in by Haohe Liu et al. Inspired by , AudioLDM is a text-to-audio latent diffusion model (LDM) that learns continuous audio representations from latents. AudioLDM takes a text prompt as input and predicts the corresponding audio. It can generate text-conditional sound effects, human speech and music.

The abstract from the paper is:

Text-to-audio (TTA) system has recently gained attention for its ability to synthesize general audio based on text descriptions. However, previous studies in TTA have limited generation quality with high computational costs. In this study, we propose AudioLDM, a TTA system that is built on a latent space to learn the continuous audio representations from contrastive language-audio pretraining (CLAP) latents. The pretrained CLAP models enable us to train LDMs with audio embedding while providing text embedding as a condition during sampling. By learning the latent representations of audio signals and their compositions without modeling the cross-modal relationship, AudioLDM is advantageous in both generation quality and computational efficiency. Trained on AudioCaps with a single GPU, AudioLDM achieves state-of-the-art TTA performance measured by both objective and subjective metrics (e.g., frechet distance). Moreover, AudioLDM is the first TTA system that enables various text-guided audio manipulations (e.g., style transfer) in a zero-shot fashion. Our implementation and demos are available at .

The original codebase can be found at .

Tips

When constructing a prompt, keep in mind:

  • Descriptive prompt inputs work best; you can use adjectives to describe the sound (for example, “high quality” or “clear”) and make the prompt context specific (for example, “water stream in a forest” instead of “stream”).

  • It’s best to use general terms like “cat” or “dog” instead of specific names or abstract objects the model may not be familiar with.

During inference:

  • The quality of the predicted audio sample can be controlled by the num_inference_steps argument; higher steps give higher quality audio at the expense of slower inference.

  • The length of the predicted audio sample can be controlled by varying the audio_length_in_s argument.

Make sure to check out the Schedulers to learn how to explore the tradeoff between scheduler speed and quality, and see the section to learn how to efficiently load the same components into multiple pipelines.

AudioLDMPipeline

class diffusers.AudioLDMPipeline

( vae: AutoencoderKLtext_encoder: ClapTextModelWithProjectiontokenizer: typing.Union[transformers.models.roberta.tokenization_roberta.RobertaTokenizer, transformers.models.roberta.tokenization_roberta_fast.RobertaTokenizerFast]unet: UNet2DConditionModelscheduler: KarrasDiffusionSchedulersvocoder: SpeechT5HifiGan )

Parameters

  • tokenizer (PreTrainedTokenizer) — A RobertaTokenizer to tokenize text.

  • vocoder (SpeechT5HifiGan) — Vocoder of class SpeechT5HifiGan.

Pipeline for text-to-audio generation using AudioLDM.

__call__

Parameters

  • prompt (str or List[str], optional) — The prompt or prompts to guide audio generation. If not defined, you need to pass prompt_embeds.

  • audio_length_in_s (int, optional, defaults to 5.12) — The length of the generated audio sample in seconds.

  • num_inference_steps (int, optional, defaults to 10) — The number of denoising steps. More denoising steps usually lead to a higher quality audio at the expense of slower inference.

  • guidance_scale (float, optional, defaults to 2.5) — A higher guidance scale value encourages the model to generate audio that is closely linked to the text prompt at the expense of lower sound quality. Guidance scale is enabled when guidance_scale > 1.

  • negative_prompt (str or List[str], optional) — The prompt or prompts to guide what to not include in audio generation. If not defined, you need to pass negative_prompt_embeds instead. Ignored when not using guidance (guidance_scale < 1).

  • num_waveforms_per_prompt (int, optional, defaults to 1) — The number of waveforms to generate per prompt.

  • latents (torch.FloatTensor, optional) — Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor is generated by sampling using the supplied random generator.

  • prompt_embeds (torch.FloatTensor, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not provided, text embeddings are generated from the prompt input argument.

  • negative_prompt_embeds (torch.FloatTensor, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not provided, negative_prompt_embeds are generated from the negative_prompt input argument.

  • callback (Callable, optional) — A function that calls every callback_steps steps during inference. The function is called with the following arguments: callback(step: int, timestep: int, latents: torch.FloatTensor).

  • callback_steps (int, optional, defaults to 1) — The frequency at which the callback function is called. If not specified, the callback is called at every step.

  • output_type (str, optional, defaults to "np") — The output format of the generated image. Choose between "np" to return a NumPy np.ndarray or "pt" to return a PyTorch torch.Tensor object.

Returns

The call function to the pipeline for generation.

Examples:

Copied

>>> from diffusers import AudioLDMPipeline
>>> import torch
>>> import scipy

>>> repo_id = "cvssp/audioldm-s-full-v2"
>>> pipe = AudioLDMPipeline.from_pretrained(repo_id, torch_dtype=torch.float16)
>>> pipe = pipe.to("cuda")

>>> prompt = "Techno music with a strong, upbeat tempo and high melodic riffs"
>>> audio = pipe(prompt, num_inference_steps=10, audio_length_in_s=5.0).audios[0]

>>> # save the audio sample as a .wav file
>>> scipy.io.wavfile.write("techno.wav", rate=16000, data=audio)

disable_vae_slicing

( )

Disable sliced VAE decoding. If enable_vae_slicing was previously enabled, this method will go back to computing decoding in one step.

enable_vae_slicing

( )

Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to compute decoding in several steps. This is useful to save some memory and allow larger batch sizes.

AudioPipelineOutput

class diffusers.AudioPipelineOutput

( audios: ndarray )

Parameters

  • audios (np.ndarray) — List of denoised audio samples of a NumPy array of shape (batch_size, num_channels, sample_rate).

Output class for audio pipelines.

vae () — Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations.

text_encoder (ClapTextModelWithProjection) — Frozen text-encoder (ClapTextModelWithProjection, specifically the variant.

unet () — A UNet2DConditionModel to denoise the encoded audio latents.

scheduler () — A scheduler to be used in combination with unet to denoise the encoded audio latents. Can be one of , , or .

This model inherits from . Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).

( prompt: typing.Union[str, typing.List[str]] = Noneaudio_length_in_s: typing.Optional[float] = Nonenum_inference_steps: int = 10guidance_scale: float = 2.5negative_prompt: typing.Union[str, typing.List[str], NoneType] = Nonenum_waveforms_per_prompt: typing.Optional[int] = 1eta: float = 0.0generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = Nonelatents: typing.Optional[torch.FloatTensor] = Noneprompt_embeds: typing.Optional[torch.FloatTensor] = Nonenegative_prompt_embeds: typing.Optional[torch.FloatTensor] = Nonereturn_dict: bool = Truecallback: typing.Union[typing.Callable[[int, int, torch.FloatTensor], NoneType], NoneType] = Nonecallback_steps: typing.Optional[int] = 1cross_attention_kwargs: typing.Union[typing.Dict[str, typing.Any], NoneType] = Noneoutput_type: typing.Optional[str] = 'np' ) → or tuple

eta (float, optional, defaults to 0.0) — Corresponds to parameter eta (η) from the paper. Only applies to the , and is ignored in other schedulers.

generator (torch.Generator or List[torch.Generator], optional) — A to make generation deterministic.

return_dict (bool, optional, defaults to True) — Whether or not to return a instead of a plain tuple.

cross_attention_kwargs (dict, optional) — A kwargs dictionary that if specified is passed along to the AttentionProcessor as defined in .

or tuple

If return_dict is True, is returned, otherwise a tuple is returned where the first element is a list with the generated audio.

🌍
🌍
AudioLDM: Text-to-Audio Generation with Latent Diffusion Models
Stable Diffusion
CLAP
https://audioldm.github.io
haoheliu/AudioLDM
guide
reuse components across pipelines
<source>
AutoencoderKL
laion/clap-htsat-unfused
UNet2DConditionModel
SchedulerMixin
DDIMScheduler
LMSDiscreteScheduler
PNDMScheduler
DiffusionPipeline
<source>
AudioPipelineOutput
DDIM
DDIMScheduler
torch.Generator
AudioPipelineOutput
self.processor
AudioPipelineOutput
AudioPipelineOutput
<source>
<source>
<source>