Audio Diffusion
Audio Diffusion
Audio Diffusion is by Robert Dargavel Smith, and it leverages the recent advances in image generation from diffusion models by converting audio samples to and from Mel spectrogram images.
The original codebase, training scripts and example notebooks can be found at teticio/audio-diffusion.
Make sure to check out the Schedulers guide to learn how to explore the tradeoff between scheduler speed and quality, and see the reuse components across pipelines section to learn how to efficiently load the same components into multiple pipelines.
AudioDiffusionPipeline
class diffusers.AudioDiffusionPipeline
( vqvae: AutoencoderKLunet: UNet2DConditionModelmel: Melscheduler: typing.Union[diffusers.schedulers.scheduling_ddim.DDIMScheduler, diffusers.schedulers.scheduling_ddpm.DDPMScheduler] )
Parameters
vqae (AutoencoderKL) — Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations.
unet (UNet2DConditionModel) — A
UNet2DConditionModel
to denoise the encoded image latents.mel (Mel) — Transform audio into a spectrogram.
scheduler (DDIMScheduler or DDPMScheduler) — A scheduler to be used in combination with
unet
to denoise the encoded image latents. Can be one of DDIMScheduler or DDPMScheduler.
Pipeline for audio diffusion.
This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).
__call__
( batch_size: int = 1audio_file: str = Noneraw_audio: ndarray = Noneslice: int = 0start_step: int = 0steps: int = Nonegenerator: Generator = Nonemask_start_secs: float = 0mask_end_secs: float = 0step_generator: Generator = Noneeta: float = 0noise: Tensor = Noneencoding: Tensor = Nonereturn_dict = True ) → List[PIL Image]
Parameters
batch_size (
int
) — Number of samples to generate.audio_file (
str
) — An audio file that must be on disk due to Librosa limitation.raw_audio (
np.ndarray
) — The raw audio file as a NumPy array.slice (
int
) — Slice number of audio to convert.start_step (int) — Step to start diffusion from.
steps (
int
) — Number of denoising steps (defaults to50
for DDIM and1000
for DDPM).generator (
torch.Generator
) — Atorch.Generator
to make generation deterministic.mask_start_secs (
float
) — Number of seconds of audio to mask (not generate) at start.mask_end_secs (
float
) — Number of seconds of audio to mask (not generate) at end.step_generator (
torch.Generator
) — Atorch.Generator
used to denoise. Noneeta (
float
) — Corresponds to parameter eta (η) from the DDIM paper. Only applies to the DDIMScheduler, and is ignored in other schedulers.noise (
torch.Tensor
) — A noise tensor of shape(batch_size, 1, height, width)
orNone
.encoding (
torch.Tensor
) — A tensor for UNet2DConditionModel of shape(batch_size, seq_length, cross_attention_dim)
.return_dict (
bool
) — Whether or not to return a AudioPipelineOutput, ImagePipelineOutput or a plain tuple.
Returns
List[PIL Image]
A list of Mel spectrograms (float
, List[np.ndarray]
) with the sample rate and raw audio.
The call function to the pipeline for generation.
Examples:
For audio diffusion:
Copied
For latent audio diffusion:
Copied
For other tasks like variation, inpainting, outpainting, etc:
Copied
encode
( images: typing.List[PIL.Image.Image]steps: int = 50 ) → np.ndarray
Parameters
images (
List[PIL Image]
) — List of images to encode.steps (
int
) — Number of encoding steps to perform (defaults to50
).
Returns
np.ndarray
A noise tensor of shape (batch_size, 1, height, width)
.
Reverse the denoising step process to recover a noisy image from the generated image.
get_default_steps
( ) → int
Returns
int
The number of steps.
Returns default number of steps recommended for inference.
slerp
( x0: Tensorx1: Tensoralpha: float ) → torch.Tensor
Parameters
x0 (
torch.Tensor
) — The first tensor to interpolate between.x1 (
torch.Tensor
) — Second tensor to interpolate between.alpha (
float
) — Interpolation between 0 and 1
Returns
torch.Tensor
The interpolated tensor.
Spherical Linear intERPolation.
AudioPipelineOutput
class diffusers.AudioPipelineOutput
( audios: ndarray )
Parameters
audios (
np.ndarray
) — List of denoised audio samples of a NumPy array of shape(batch_size, num_channels, sample_rate)
.
Output class for audio pipelines.
ImagePipelineOutput
class diffusers.ImagePipelineOutput
( images: typing.Union[typing.List[PIL.Image.Image], numpy.ndarray] )
Parameters
images (
List[PIL.Image.Image]
ornp.ndarray
) — List of denoised PIL images of lengthbatch_size
or NumPy array of shape(batch_size, height, width, num_channels)
.
Output class for image pipelines.
Mel
class diffusers.Mel
( x_res: int = 256y_res: int = 256sample_rate: int = 22050n_fft: int = 2048hop_length: int = 512top_db: int = 80n_iter: int = 32 )
Parameters
x_res (
int
) — x resolution of spectrogram (time).y_res (
int
) — y resolution of spectrogram (frequency bins).sample_rate (
int
) — Sample rate of audio.n_fft (
int
) — Number of Fast Fourier Transforms.hop_length (
int
) — Hop length (a higher number is recommended ify_res
< 256).top_db (
int
) — Loudest decibel value.n_iter (
int
) — Number of iterations for Griffin-Lim Mel inversion.
audio_slice_to_image
( slice: int ) → PIL Image
Parameters
slice (
int
) — Slice number of audio to convert (out ofget_number_of_slices()
).
Returns
PIL Image
A grayscale image of x_res x y_res
.
Convert slice of audio to spectrogram.
get_audio_slice
( slice: int = 0 ) → np.ndarray
Parameters
slice (
int
) — Slice number of audio (out ofget_number_of_slices()
).
Returns
np.ndarray
The audio slice as a NumPy array.
Get slice of audio.
get_number_of_slices
( ) → int
Returns
int
Number of spectograms audio can be sliced into.
Get number of slices in audio.
get_sample_rate
( ) → int
Returns
int
Sample rate of audio.
Get sample rate.
image_to_audio
( image: Image ) → audio (np.ndarray
)
Parameters
image (
PIL Image
) — An grayscale image ofx_res x y_res
.
Returns
audio (np.ndarray
)
The audio as a NumPy array.
Converts spectrogram to audio.
load_audio
( audio_file: str = Noneraw_audio: ndarray = None )
Parameters
audio_file (
str
) — An audio file that must be on disk due to Librosa limitation.raw_audio (
np.ndarray
) — The raw audio file as a NumPy array.
Load audio.
set_resolution
( x_res: inty_res: int )
Parameters
x_res (
int
) — x resolution of spectrogram (time).y_res (
int
) — y resolution of spectrogram (frequency bins).
Set resolution.
Last updated