# Audio Diffusion

## Audio Diffusion

[Audio Diffusion](https://github.com/teticio/audio-diffusion) is by Robert Dargavel Smith, and it leverages the recent advances in image generation from diffusion models by converting audio samples to and from Mel spectrogram images.

The original codebase, training scripts and example notebooks can be found at [teticio/audio-diffusion](https://github.com/teticio/audio-diffusion).

Make sure to check out the Schedulers [guide](https://huggingface.co/using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](https://huggingface.co/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.

### AudioDiffusionPipeline

#### class diffusers.AudioDiffusionPipeline

[\<source>](https://github.com/huggingface/diffusers/blob/v0.21.0/src/diffusers/pipelines/audio_diffusion/pipeline_audio_diffusion.py#L30)

( vqvae: AutoencoderKLunet: UNet2DConditionModelmel: Melscheduler: typing.Union\[diffusers.schedulers.scheduling\_ddim.DDIMScheduler, diffusers.schedulers.scheduling\_ddpm.DDPMScheduler] )

Parameters

* **vqae** ([AutoencoderKL](https://huggingface.co/docs/diffusers/v0.21.0/en/api/models/autoencoderkl#diffusers.AutoencoderKL)) — Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations.
* **unet** ([UNet2DConditionModel](https://huggingface.co/docs/diffusers/v0.21.0/en/api/models/unet2d-cond#diffusers.UNet2DConditionModel)) — A `UNet2DConditionModel` to denoise the encoded image latents.
* **mel** ([Mel](https://huggingface.co/docs/diffusers/v0.21.0/en/api/pipelines/audio_diffusion#diffusers.Mel)) — Transform audio into a spectrogram.
* **scheduler** ([DDIMScheduler](https://huggingface.co/docs/diffusers/v0.21.0/en/api/schedulers/ddim#diffusers.DDIMScheduler) or [DDPMScheduler](https://huggingface.co/docs/diffusers/v0.21.0/en/api/schedulers/ddpm#diffusers.DDPMScheduler)) — A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of [DDIMScheduler](https://huggingface.co/docs/diffusers/v0.21.0/en/api/schedulers/ddim#diffusers.DDIMScheduler) or [DDPMScheduler](https://huggingface.co/docs/diffusers/v0.21.0/en/api/schedulers/ddpm#diffusers.DDPMScheduler).

Pipeline for audio diffusion.

This model inherits from [DiffusionPipeline](https://huggingface.co/docs/diffusers/v0.21.0/en/api/pipelines/overview#diffusers.DiffusionPipeline). Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).

**\_\_call\_\_**

[\<source>](https://github.com/huggingface/diffusers/blob/v0.21.0/src/diffusers/pipelines/audio_diffusion/pipeline_audio_diffusion.py#L70)

( batch\_size: int = 1audio\_file: str = Noneraw\_audio: ndarray = Noneslice: int = 0start\_step: int = 0steps: int = Nonegenerator: Generator = Nonemask\_start\_secs: float = 0mask\_end\_secs: float = 0step\_generator: Generator = Noneeta: float = 0noise: Tensor = Noneencoding: Tensor = Nonereturn\_dict = True ) → `List[PIL Image]`

Parameters

* **batch\_size** (`int`) — Number of samples to generate.
* **audio\_file** (`str`) — An audio file that must be on disk due to [Librosa](https://librosa.org/) limitation.
* **raw\_audio** (`np.ndarray`) — The raw audio file as a NumPy array.
* **slice** (`int`) — Slice number of audio to convert.
* **start\_step** (int) — Step to start diffusion from.
* **steps** (`int`) — Number of denoising steps (defaults to `50` for DDIM and `1000` for DDPM).
* **generator** (`torch.Generator`) — A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make generation deterministic.
* **mask\_start\_secs** (`float`) — Number of seconds of audio to mask (not generate) at start.
* **mask\_end\_secs** (`float`) — Number of seconds of audio to mask (not generate) at end.
* **step\_generator** (`torch.Generator`) — A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) used to denoise. None
* **eta** (`float`) — Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies to the [DDIMScheduler](https://huggingface.co/docs/diffusers/v0.21.0/en/api/schedulers/ddim#diffusers.DDIMScheduler), and is ignored in other schedulers.
* **noise** (`torch.Tensor`) — A noise tensor of shape `(batch_size, 1, height, width)` or `None`.
* **encoding** (`torch.Tensor`) — A tensor for [UNet2DConditionModel](https://huggingface.co/docs/diffusers/v0.21.0/en/api/models/unet2d-cond#diffusers.UNet2DConditionModel) of shape `(batch_size, seq_length, cross_attention_dim)`.
* **return\_dict** (`bool`) — Whether or not to return a [AudioPipelineOutput](https://huggingface.co/docs/diffusers/v0.21.0/en/api/pipelines/dance_diffusion#diffusers.AudioPipelineOutput), [ImagePipelineOutput](https://huggingface.co/docs/diffusers/v0.21.0/en/api/pipelines/latent_diffusion_uncond#diffusers.ImagePipelineOutput) or a plain tuple.

Returns

`List[PIL Image]`

A list of Mel spectrograms (`float`, `List[np.ndarray]`) with the sample rate and raw audio.

The call function to the pipeline for generation.

Examples:

For audio diffusion:

Copied

```
import torch
from IPython.display import Audio
from diffusers import DiffusionPipeline

device = "cuda" if torch.cuda.is_available() else "cpu"
pipe = DiffusionPipeline.from_pretrained("teticio/audio-diffusion-256").to(device)

output = pipe()
display(output.images[0])
display(Audio(output.audios[0], rate=mel.get_sample_rate()))
```

For latent audio diffusion:

Copied

```
import torch
from IPython.display import Audio
from diffusers import DiffusionPipeline

device = "cuda" if torch.cuda.is_available() else "cpu"
pipe = DiffusionPipeline.from_pretrained("teticio/latent-audio-diffusion-256").to(device)

output = pipe()
display(output.images[0])
display(Audio(output.audios[0], rate=pipe.mel.get_sample_rate()))
```

For other tasks like variation, inpainting, outpainting, etc:

Copied

```
output = pipe(
    raw_audio=output.audios[0, 0],
    start_step=int(pipe.get_default_steps() / 2),
    mask_start_secs=1,
    mask_end_secs=1,
)
display(output.images[0])
display(Audio(output.audios[0], rate=pipe.mel.get_sample_rate()))
```

**encode**

[\<source>](https://github.com/huggingface/diffusers/blob/v0.21.0/src/diffusers/pipelines/audio_diffusion/pipeline_audio_diffusion.py#L270)

( images: typing.List\[PIL.Image.Image]steps: int = 50 ) → `np.ndarray`

Parameters

* **images** (`List[PIL Image]`) — List of images to encode.
* **steps** (`int`) — Number of encoding steps to perform (defaults to `50`).

Returns

`np.ndarray`

A noise tensor of shape `(batch_size, 1, height, width)`.

Reverse the denoising step process to recover a noisy image from the generated image.

**get\_default\_steps**

[\<source>](https://github.com/huggingface/diffusers/blob/v0.21.0/src/diffusers/pipelines/audio_diffusion/pipeline_audio_diffusion.py#L61)

( ) → `int`

Returns

`int`

The number of steps.

Returns default number of steps recommended for inference.

**slerp**

[\<source>](https://github.com/huggingface/diffusers/blob/v0.21.0/src/diffusers/pipelines/audio_diffusion/pipeline_audio_diffusion.py#L311)

( x0: Tensorx1: Tensoralpha: float ) → `torch.Tensor`

Parameters

* **x0** (`torch.Tensor`) — The first tensor to interpolate between.
* **x1** (`torch.Tensor`) — Second tensor to interpolate between.
* **alpha** (`float`) — Interpolation between 0 and 1

Returns

`torch.Tensor`

The interpolated tensor.

Spherical Linear intERPolation.

### AudioPipelineOutput

#### class diffusers.AudioPipelineOutput

[\<source>](https://github.com/huggingface/diffusers/blob/v0.21.0/src/diffusers/pipelines/pipeline_utils.py#L126)

( audios: ndarray )

Parameters

* **audios** (`np.ndarray`) — List of denoised audio samples of a NumPy array of shape `(batch_size, num_channels, sample_rate)`.

Output class for audio pipelines.

### ImagePipelineOutput

#### class diffusers.ImagePipelineOutput

[\<source>](https://github.com/huggingface/diffusers/blob/v0.21.0/src/diffusers/pipelines/pipeline_utils.py#L112)

( images: typing.Union\[typing.List\[PIL.Image.Image], numpy.ndarray] )

Parameters

* **images** (`List[PIL.Image.Image]` or `np.ndarray`) — List of denoised PIL images of length `batch_size` or NumPy array of shape `(batch_size, height, width, num_channels)`.

Output class for image pipelines.

### Mel

#### class diffusers.Mel

[\<source>](https://github.com/huggingface/diffusers/blob/v0.21.0/src/diffusers/pipelines/audio_diffusion/mel.py#L37)

( x\_res: int = 256y\_res: int = 256sample\_rate: int = 22050n\_fft: int = 2048hop\_length: int = 512top\_db: int = 80n\_iter: int = 32 )

Parameters

* **x\_res** (`int`) — x resolution of spectrogram (time).
* **y\_res** (`int`) — y resolution of spectrogram (frequency bins).
* **sample\_rate** (`int`) — Sample rate of audio.
* **n\_fft** (`int`) — Number of Fast Fourier Transforms.
* **hop\_length** (`int`) — Hop length (a higher number is recommended if `y_res` < 256).
* **top\_db** (`int`) — Loudest decibel value.
* **n\_iter** (`int`) — Number of iterations for Griffin-Lim Mel inversion.

**audio\_slice\_to\_image**

[\<source>](https://github.com/huggingface/diffusers/blob/v0.21.0/src/diffusers/pipelines/audio_diffusion/mel.py#L143)

( slice: int ) → `PIL Image`

Parameters

* **slice** (`int`) — Slice number of audio to convert (out of `get_number_of_slices()`).

Returns

`PIL Image`

A grayscale image of `x_res x y_res`.

Convert slice of audio to spectrogram.

**get\_audio\_slice**

[\<source>](https://github.com/huggingface/diffusers/blob/v0.21.0/src/diffusers/pipelines/audio_diffusion/mel.py#L121)

( slice: int = 0 ) → `np.ndarray`

Parameters

* **slice** (`int`) — Slice number of audio (out of `get_number_of_slices()`).

Returns

`np.ndarray`

The audio slice as a NumPy array.

Get slice of audio.

**get\_number\_of\_slices**

[\<source>](https://github.com/huggingface/diffusers/blob/v0.21.0/src/diffusers/pipelines/audio_diffusion/mel.py#L112)

( ) → `int`

Returns

`int`

Number of spectograms audio can be sliced into.

Get number of slices in audio.

**get\_sample\_rate**

[\<source>](https://github.com/huggingface/diffusers/blob/v0.21.0/src/diffusers/pipelines/audio_diffusion/mel.py#L134)

( ) → `int`

Returns

`int`

Sample rate of audio.

Get sample rate.

**image\_to\_audio**

[\<source>](https://github.com/huggingface/diffusers/blob/v0.21.0/src/diffusers/pipelines/audio_diffusion/mel.py#L162)

( image: Image ) → audio (`np.ndarray`)

Parameters

* **image** (`PIL Image`) — An grayscale image of `x_res x y_res`.

Returns

audio (`np.ndarray`)

The audio as a NumPy array.

Converts spectrogram to audio.

**load\_audio**

[\<source>](https://github.com/huggingface/diffusers/blob/v0.21.0/src/diffusers/pipelines/audio_diffusion/mel.py#L94)

( audio\_file: str = Noneraw\_audio: ndarray = None )

Parameters

* **audio\_file** (`str`) — An audio file that must be on disk due to [Librosa](https://librosa.org/) limitation.
* **raw\_audio** (`np.ndarray`) — The raw audio file as a NumPy array.

Load audio.

**set\_resolution**

[\<source>](https://github.com/huggingface/diffusers/blob/v0.21.0/src/diffusers/pipelines/audio_diffusion/mel.py#L80)

( x\_res: inty\_res: int )

Parameters

* **x\_res** (`int`) — x resolution of spectrogram (time).
* **y\_res** (`int`) — y resolution of spectrogram (frequency bins).

Set resolution.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://boinc-ai.gitbook.io/diffusers/api/pipelines/audio-diffusion.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
