Diffusers BOINC AI docs
  • 🌍GET STARTED
    • Diffusers
    • Quicktour
    • Effective and efficient diffusion
    • Installation
  • 🌍TUTORIALS
    • Overview
    • Understanding models and schedulers
    • AutoPipeline
    • Train a diffusion model
  • 🌍USING DIFFUSERS
    • 🌍LOADING & HUB
      • Overview
      • Load pipelines, models, and schedulers
      • Load and compare different schedulers
      • Load community pipelines
      • Load safetensors
      • Load different Stable Diffusion formats
      • Push files to the Hub
    • 🌍TASKS
      • Unconditional image generation
      • Text-to-image
      • Image-to-image
      • Inpainting
      • Depth-to-image
    • 🌍TECHNIQUES
      • Textual inversion
      • Distributed inference with multiple GPUs
      • Improve image quality with deterministic generation
      • Control image brightness
      • Prompt weighting
    • 🌍PIPELINES FOR INFERENCE
      • Overview
      • Stable Diffusion XL
      • ControlNet
      • Shap-E
      • DiffEdit
      • Distilled Stable Diffusion inference
      • Create reproducible pipelines
      • Community pipelines
      • How to contribute a community pipeline
    • 🌍TRAINING
      • Overview
      • Create a dataset for training
      • Adapt a model to a new task
      • Unconditional image generation
      • Textual Inversion
      • DreamBooth
      • Text-to-image
      • Low-Rank Adaptation of Large Language Models (LoRA)
      • ControlNet
      • InstructPix2Pix Training
      • Custom Diffusion
      • T2I-Adapters
    • 🌍TAKING DIFFUSERS BEYOND IMAGES
      • Other Modalities
  • 🌍OPTIMIZATION/SPECIAL HARDWARE
    • Overview
    • Memory and Speed
    • Torch2.0 support
    • Stable Diffusion in JAX/Flax
    • xFormers
    • ONNX
    • OpenVINO
    • Core ML
    • MPS
    • Habana Gaudi
    • Token Merging
  • 🌍CONCEPTUAL GUIDES
    • Philosophy
    • Controlled generation
    • How to contribute?
    • Diffusers' Ethical Guidelines
    • Evaluating Diffusion Models
  • 🌍API
    • 🌍MAIN CLASSES
      • Attention Processor
      • Diffusion Pipeline
      • Logging
      • Configuration
      • Outputs
      • Loaders
      • Utilities
      • VAE Image Processor
    • 🌍MODELS
      • Overview
      • UNet1DModel
      • UNet2DModel
      • UNet2DConditionModel
      • UNet3DConditionModel
      • VQModel
      • AutoencoderKL
      • AsymmetricAutoencoderKL
      • Tiny AutoEncoder
      • Transformer2D
      • Transformer Temporal
      • Prior Transformer
      • ControlNet
    • 🌍PIPELINES
      • Overview
      • AltDiffusion
      • Attend-and-Excite
      • Audio Diffusion
      • AudioLDM
      • AudioLDM 2
      • AutoPipeline
      • Consistency Models
      • ControlNet
      • ControlNet with Stable Diffusion XL
      • Cycle Diffusion
      • Dance Diffusion
      • DDIM
      • DDPM
      • DeepFloyd IF
      • DiffEdit
      • DiT
      • IF
      • PaInstructPix2Pix
      • Kandinsky
      • Kandinsky 2.2
      • Latent Diffusionge
      • MultiDiffusion
      • MusicLDM
      • PaintByExample
      • Parallel Sampling of Diffusion Models
      • Pix2Pix Zero
      • PNDM
      • RePaint
      • Score SDE VE
      • Self-Attention Guidance
      • Semantic Guidance
      • Shap-E
      • Spectrogram Diffusion
      • 🌍STABLE DIFFUSION
        • Overview
        • Text-to-image
        • Image-to-image
        • Inpainting
        • Depth-to-image
        • Image variation
        • Safe Stable Diffusion
        • Stable Diffusion 2
        • Stable Diffusion XL
        • Latent upscaler
        • Super-resolution
        • LDM3D Text-to-(RGB, Depth)
        • Stable Diffusion T2I-adapter
        • GLIGEN (Grounded Language-to-Image Generation)
      • Stable unCLIP
      • Stochastic Karras VE
      • Text-to-image model editing
      • Text-to-video
      • Text2Video-Zero
      • UnCLIP
      • Unconditional Latent Diffusion
      • UniDiffuser
      • Value-guided sampling
      • Versatile Diffusion
      • VQ Diffusion
      • Wuerstchen
    • 🌍SCHEDULERS
      • Overview
      • CMStochasticIterativeScheduler
      • DDIMInverseScheduler
      • DDIMScheduler
      • DDPMScheduler
      • DEISMultistepScheduler
      • DPMSolverMultistepInverse
      • DPMSolverMultistepScheduler
      • DPMSolverSDEScheduler
      • DPMSolverSinglestepScheduler
      • EulerAncestralDiscreteScheduler
      • EulerDiscreteScheduler
      • HeunDiscreteScheduler
      • IPNDMScheduler
      • KarrasVeScheduler
      • KDPM2AncestralDiscreteScheduler
      • KDPM2DiscreteScheduler
      • LMSDiscreteScheduler
      • PNDMScheduler
      • RePaintScheduler
      • ScoreSdeVeScheduler
      • ScoreSdeVpScheduler
      • UniPCMultistepScheduler
      • VQDiffusionScheduler
Powered by GitBook
On this page
  • Audio Diffusion
  • AudioDiffusionPipeline
  • AudioPipelineOutput
  • ImagePipelineOutput
  • Mel
  1. API
  2. PIPELINES

Audio Diffusion

PreviousAttend-and-ExciteNextAudioLDM

Last updated 1 year ago

Audio Diffusion

is by Robert Dargavel Smith, and it leverages the recent advances in image generation from diffusion models by converting audio samples to and from Mel spectrogram images.

The original codebase, training scripts and example notebooks can be found at .

Make sure to check out the Schedulers to learn how to explore the tradeoff between scheduler speed and quality, and see the section to learn how to efficiently load the same components into multiple pipelines.

AudioDiffusionPipeline

class diffusers.AudioDiffusionPipeline

( vqvae: AutoencoderKLunet: UNet2DConditionModelmel: Melscheduler: typing.Union[diffusers.schedulers.scheduling_ddim.DDIMScheduler, diffusers.schedulers.scheduling_ddpm.DDPMScheduler] )

Parameters

  • vqae () — Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations.

  • unet () — A UNet2DConditionModel to denoise the encoded image latents.

  • mel () — Transform audio into a spectrogram.

  • scheduler ( or ) — A scheduler to be used in combination with unet to denoise the encoded image latents. Can be one of or .

Pipeline for audio diffusion.

This model inherits from . Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).

__call__

( batch_size: int = 1audio_file: str = Noneraw_audio: ndarray = Noneslice: int = 0start_step: int = 0steps: int = Nonegenerator: Generator = Nonemask_start_secs: float = 0mask_end_secs: float = 0step_generator: Generator = Noneeta: float = 0noise: Tensor = Noneencoding: Tensor = Nonereturn_dict = True ) → List[PIL Image]

Parameters

  • batch_size (int) — Number of samples to generate.

  • raw_audio (np.ndarray) — The raw audio file as a NumPy array.

  • slice (int) — Slice number of audio to convert.

  • start_step (int) — Step to start diffusion from.

  • steps (int) — Number of denoising steps (defaults to 50 for DDIM and 1000 for DDPM).

  • mask_start_secs (float) — Number of seconds of audio to mask (not generate) at start.

  • mask_end_secs (float) — Number of seconds of audio to mask (not generate) at end.

  • noise (torch.Tensor) — A noise tensor of shape (batch_size, 1, height, width) or None.

Returns

List[PIL Image]

A list of Mel spectrograms (float, List[np.ndarray]) with the sample rate and raw audio.

The call function to the pipeline for generation.

Examples:

For audio diffusion:

Copied

import torch
from IPython.display import Audio
from diffusers import DiffusionPipeline

device = "cuda" if torch.cuda.is_available() else "cpu"
pipe = DiffusionPipeline.from_pretrained("teticio/audio-diffusion-256").to(device)

output = pipe()
display(output.images[0])
display(Audio(output.audios[0], rate=mel.get_sample_rate()))

For latent audio diffusion:

Copied

import torch
from IPython.display import Audio
from diffusers import DiffusionPipeline

device = "cuda" if torch.cuda.is_available() else "cpu"
pipe = DiffusionPipeline.from_pretrained("teticio/latent-audio-diffusion-256").to(device)

output = pipe()
display(output.images[0])
display(Audio(output.audios[0], rate=pipe.mel.get_sample_rate()))

For other tasks like variation, inpainting, outpainting, etc:

Copied

output = pipe(
    raw_audio=output.audios[0, 0],
    start_step=int(pipe.get_default_steps() / 2),
    mask_start_secs=1,
    mask_end_secs=1,
)
display(output.images[0])
display(Audio(output.audios[0], rate=pipe.mel.get_sample_rate()))

encode

( images: typing.List[PIL.Image.Image]steps: int = 50 ) → np.ndarray

Parameters

  • images (List[PIL Image]) — List of images to encode.

  • steps (int) — Number of encoding steps to perform (defaults to 50).

Returns

np.ndarray

A noise tensor of shape (batch_size, 1, height, width).

Reverse the denoising step process to recover a noisy image from the generated image.

get_default_steps

( ) → int

Returns

int

The number of steps.

Returns default number of steps recommended for inference.

slerp

( x0: Tensorx1: Tensoralpha: float ) → torch.Tensor

Parameters

  • x0 (torch.Tensor) — The first tensor to interpolate between.

  • x1 (torch.Tensor) — Second tensor to interpolate between.

  • alpha (float) — Interpolation between 0 and 1

Returns

torch.Tensor

The interpolated tensor.

Spherical Linear intERPolation.

AudioPipelineOutput

class diffusers.AudioPipelineOutput

( audios: ndarray )

Parameters

  • audios (np.ndarray) — List of denoised audio samples of a NumPy array of shape (batch_size, num_channels, sample_rate).

Output class for audio pipelines.

ImagePipelineOutput

class diffusers.ImagePipelineOutput

( images: typing.Union[typing.List[PIL.Image.Image], numpy.ndarray] )

Parameters

  • images (List[PIL.Image.Image] or np.ndarray) — List of denoised PIL images of length batch_size or NumPy array of shape (batch_size, height, width, num_channels).

Output class for image pipelines.

Mel

class diffusers.Mel

( x_res: int = 256y_res: int = 256sample_rate: int = 22050n_fft: int = 2048hop_length: int = 512top_db: int = 80n_iter: int = 32 )

Parameters

  • x_res (int) — x resolution of spectrogram (time).

  • y_res (int) — y resolution of spectrogram (frequency bins).

  • sample_rate (int) — Sample rate of audio.

  • n_fft (int) — Number of Fast Fourier Transforms.

  • hop_length (int) — Hop length (a higher number is recommended if y_res < 256).

  • top_db (int) — Loudest decibel value.

  • n_iter (int) — Number of iterations for Griffin-Lim Mel inversion.

audio_slice_to_image

( slice: int ) → PIL Image

Parameters

  • slice (int) — Slice number of audio to convert (out of get_number_of_slices()).

Returns

PIL Image

A grayscale image of x_res x y_res.

Convert slice of audio to spectrogram.

get_audio_slice

( slice: int = 0 ) → np.ndarray

Parameters

  • slice (int) — Slice number of audio (out of get_number_of_slices()).

Returns

np.ndarray

The audio slice as a NumPy array.

Get slice of audio.

get_number_of_slices

( ) → int

Returns

int

Number of spectograms audio can be sliced into.

Get number of slices in audio.

get_sample_rate

( ) → int

Returns

int

Sample rate of audio.

Get sample rate.

image_to_audio

( image: Image ) → audio (np.ndarray)

Parameters

  • image (PIL Image) — An grayscale image of x_res x y_res.

Returns

audio (np.ndarray)

The audio as a NumPy array.

Converts spectrogram to audio.

load_audio

( audio_file: str = Noneraw_audio: ndarray = None )

Parameters

  • raw_audio (np.ndarray) — The raw audio file as a NumPy array.

Load audio.

set_resolution

( x_res: inty_res: int )

Parameters

  • x_res (int) — x resolution of spectrogram (time).

  • y_res (int) — y resolution of spectrogram (frequency bins).

Set resolution.

audio_file (str) — An audio file that must be on disk due to limitation.

generator (torch.Generator) — A to make generation deterministic.

step_generator (torch.Generator) — A used to denoise. None

eta (float) — Corresponds to parameter eta (η) from the paper. Only applies to the , and is ignored in other schedulers.

encoding (torch.Tensor) — A tensor for of shape (batch_size, seq_length, cross_attention_dim).

return_dict (bool) — Whether or not to return a , or a plain tuple.

audio_file (str) — An audio file that must be on disk due to limitation.

🌍
🌍
Audio Diffusion
teticio/audio-diffusion
guide
reuse components across pipelines
<source>
AutoencoderKL
UNet2DConditionModel
Mel
DDIMScheduler
DDPMScheduler
DDIMScheduler
DDPMScheduler
DiffusionPipeline
<source>
Librosa
torch.Generator
torch.Generator
DDIM
DDIMScheduler
UNet2DConditionModel
AudioPipelineOutput
ImagePipelineOutput
<source>
<source>
<source>
<source>
<source>
<source>
<source>
<source>
<source>
<source>
<source>
<source>
Librosa
<source>