Audio Spectrogram Transformer
Last updated
Last updated
The Audio Spectrogram Transformer model was proposed in by Yuan Gong, Yu-An Chung, James Glass. The Audio Spectrogram Transformer applies a to audio, by turning audio into an image (spectrogram). The model obtains state-of-the-art results for audio classification.
The abstract from the paper is the following:
In the past decade, convolutional neural networks (CNNs) have been widely adopted as the main building block for end-to-end audio classification models, which aim to learn a direct mapping from audio spectrograms to corresponding labels. To better capture long-range global context, a recent trend is to add a self-attention mechanism on top of the CNN, forming a CNN-attention hybrid model. However, it is unclear whether the reliance on a CNN is necessary, and if neural networks purely based on attention are sufficient to obtain good performance in audio classification. In this paper, we answer the question by introducing the Audio Spectrogram Transformer (AST), the first convolution-free, purely attention-based model for audio classification. We evaluate AST on various audio classification benchmarks, where it achieves new state-of-the-art results of 0.485 mAP on AudioSet, 95.6% accuracy on ESC-50, and 98.1% accuracy on Speech Commands V2.
Tips:
When fine-tuning the Audio Spectrogram Transformer (AST) on your own dataset, it’s recommended to take care of the input normalization (to make sure the input has mean of 0 and std of 0.5). takes care of this. Note that it uses the AudioSet mean and std by default. You can check to see how the authors compute the stats for a downstream dataset.
Note that the AST needs a low learning rate (the authors use a 10 times smaller learning rate compared to their CNN model proposed in the ) and converges quickly, so please search for a suitable learning rate and learning rate scheduler for your task.
Audio pectrogram Transformer architecture. Taken from the .
A list of official BOINC AI and community (indicated by 🌎) resources to help you get started with the Audio Spectrogram Transformer.
Audio Classification
If you’re interested in submitting a resource to be included here, please feel free to open a Pull Request and we’ll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
( hidden_size = 768num_hidden_layers = 12num_attention_heads = 12intermediate_size = 3072hidden_act = 'gelu'hidden_dropout_prob = 0.0attention_probs_dropout_prob = 0.0initializer_range = 0.02layer_norm_eps = 1e-12patch_size = 16qkv_bias = Truefrequency_stride = 10time_stride = 10max_length = 1024num_mel_bins = 128**kwargs )
Parameters
hidden_size (int
, optional, defaults to 768) — Dimensionality of the encoder layers and the pooler layer.
num_hidden_layers (int
, optional, defaults to 12) — Number of hidden layers in the Transformer encoder.
num_attention_heads (int
, optional, defaults to 12) — Number of attention heads for each attention layer in the Transformer encoder.
intermediate_size (int
, optional, defaults to 3072) — Dimensionality of the “intermediate” (i.e., feed-forward) layer in the Transformer encoder.
hidden_act (str
or function
, optional, defaults to "gelu"
) — The non-linear activation function (function or string) in the encoder and pooler. If string, "gelu"
, "relu"
, "selu"
and "gelu_new"
are supported.
hidden_dropout_prob (float
, optional, defaults to 0.1) — The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_probs_dropout_prob (float
, optional, defaults to 0.1) — The dropout ratio for the attention probabilities.
initializer_range (float
, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (float
, optional, defaults to 1e-12) — The epsilon used by the layer normalization layers.
patch_size (int
, optional, defaults to 16
) — The size (resolution) of each patch.
qkv_bias (bool
, optional, defaults to True
) — Whether to add a bias to the queries, keys and values.
frequency_stride (int
, optional, defaults to 10) — Frequency stride to use when patchifying the spectrograms.
time_stride (int
, optional, defaults to 10) — Temporal stride to use when patchifying the spectrograms.
max_length (int
, optional, defaults to 1024) — Temporal dimension of the spectrograms.
num_mel_bins (int
, optional, defaults to 128) — Frequency dimension of the spectrograms (number of Mel-frequency bins).
Example:
Copied
( feature_size = 1sampling_rate = 16000num_mel_bins = 128max_length = 1024padding_value = 0.0do_normalize = Truemean = -4.2677393std = 4.5689974return_attention_mask = False**kwargs )
Parameters
feature_size (int
, optional, defaults to 1) — The feature dimension of the extracted features.
sampling_rate (int
, optional, defaults to 16000) — The sampling rate at which the audio files should be digitalized expressed in hertz (Hz).
num_mel_bins (int
, optional, defaults to 128) — Number of Mel-frequency bins.
max_length (int
, optional, defaults to 1024) — Maximum length to which to pad/truncate the extracted features.
do_normalize (bool
, optional, defaults to True
) — Whether or not to normalize the log-Mel features using mean
and std
.
mean (float
, optional, defaults to -4.2677393) — The mean value used to normalize the log-Mel features. Uses the AudioSet mean by default.
std (float
, optional, defaults to 4.5689974) — The standard deviation value used to normalize the log-Mel features. Uses the AudioSet standard deviation by default.
Constructs a Audio Spectrogram Transformer (AST) feature extractor.
This class extracts mel-filter bank features from raw speech using TorchAudio, pads/truncates them to a fixed length and normalizes them using a mean and standard deviation.
__call__
( raw_speech: typing.Union[numpy.ndarray, typing.List[float], typing.List[numpy.ndarray], typing.List[typing.List[float]]]sampling_rate: typing.Optional[int] = Nonereturn_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None**kwargs )
Parameters
raw_speech (np.ndarray
, List[float]
, List[np.ndarray]
, List[List[float]]
) — The sequence or batch of sequences to be padded. Each sequence can be a numpy array, a list of float values, a list of numpy arrays or a list of list of float values. Must be mono channel audio, not stereo, i.e. single float per timestep.
sampling_rate (int
, optional) — The sampling rate at which the raw_speech
input was sampled. It is strongly recommended to pass sampling_rate
at the forward call to prevent silent errors.
'tf'
: Return TensorFlow tf.constant
objects.
'pt'
: Return PyTorch torch.Tensor
objects.
'np'
: Return Numpy np.ndarray
objects.
Main method to featurize and prepare for the model one or several sequence(s).
( config: ASTConfig )
Parameters
forward
Parameters
head_mask (torch.FloatTensor
of shape (num_heads,)
or (num_layers, num_heads)
, optional) — Mask to nullify selected heads of the self-attention modules. Mask values selected in [0, 1]
:
1 indicates the head is not masked,
0 indicates the head is masked.
output_attentions (bool
, optional) — Whether or not to return the attentions tensors of all attention layers. See attentions
under returned tensors for more detail.
output_hidden_states (bool
, optional) — Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for more detail.
Returns
last_hidden_state (torch.FloatTensor
of shape (batch_size, sequence_length, hidden_size)
) — Sequence of hidden-states at the output of the last layer of the model.
pooler_output (torch.FloatTensor
of shape (batch_size, hidden_size)
) — Last layer hidden-state of the first token of the sequence (classification token) after further processing through the layers used for the auxiliary pretraining task. E.g. for BERT-family of models, this returns the classification token after processing through a linear layer and a tanh activation function. The linear layer weights are trained from the next sentence prediction (classification) objective during pretraining.
hidden_states (tuple(torch.FloatTensor)
, optional, returned when output_hidden_states=True
is passed or when config.output_hidden_states=True
) — Tuple of torch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size)
.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (tuple(torch.FloatTensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) — Tuple of torch.FloatTensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
Example:
Copied
( config: ASTConfig )
Parameters
Audio Spectrogram Transformer model with an audio classification head on top (a linear layer on top of the pooled output) e.g. for datasets like AudioSet, Speech Commands v2.
forward
Parameters
head_mask (torch.FloatTensor
of shape (num_heads,)
or (num_layers, num_heads)
, optional) — Mask to nullify selected heads of the self-attention modules. Mask values selected in [0, 1]
:
1 indicates the head is not masked,
0 indicates the head is masked.
output_attentions (bool
, optional) — Whether or not to return the attentions tensors of all attention layers. See attentions
under returned tensors for more detail.
output_hidden_states (bool
, optional) — Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for more detail.
labels (torch.LongTensor
of shape (batch_size,)
, optional) — Labels for computing the audio classification/regression loss. Indices should be in [0, ..., config.num_labels - 1]
. If config.num_labels == 1
a regression loss is computed (Mean-Square loss), If config.num_labels > 1
a classification loss is computed (Cross-Entropy).
Returns
loss (torch.FloatTensor
of shape (1,)
, optional, returned when labels
is provided) — Classification (or regression if config.num_labels==1) loss.
logits (torch.FloatTensor
of shape (batch_size, config.num_labels)
) — Classification (or regression if config.num_labels==1) scores (before SoftMax).
hidden_states (tuple(torch.FloatTensor)
, optional, returned when output_hidden_states=True
is passed or when config.output_hidden_states=True
) — Tuple of torch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size)
.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (tuple(torch.FloatTensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) — Tuple of torch.FloatTensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
Example:
Copied
This model was contributed by . The original code can be found .
A notebook illustrating inference with AST for audio classification can be found .
is supported by this and .
See also: .
This is the configuration class to store the configuration of a . It is used to instantiate an AST model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the AST architecture.
Configuration objects inherit from and can be used to control the model outputs. Read the documentation from for more information.
return_attention_mask (bool
, optional, defaults to False
) — Whether or not should return attention_mask
.
This feature extractor inherits from which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.
return_tensors (str
or , optional) — If set, will return tensors instead of list of python integers. Acceptable values are:
config () — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the method to load the model weights.
The bare AST Model transformer outputting raw hidden-states without any specific head on top. This model is a PyTorch subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
( input_values: typing.Optional[torch.Tensor] = Nonehead_mask: typing.Optional[torch.Tensor] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = None ) → or tuple(torch.FloatTensor)
input_values (torch.FloatTensor
of shape (batch_size, max_length, num_mel_bins)
) — Float values mel features extracted from the raw audio waveform. Raw audio waveform can be obtained by loading a .flac
or .wav
audio file into an array of type List[float]
or a numpy.ndarray
, e.g. via the soundfile library (pip install soundfile
). To prepare the array into input_features
, the should be used for extracting the mel features, padding and conversion into a tensor of type torch.FloatTensor
. See
return_dict (bool
, optional) — Whether or not to return a instead of a plain tuple.
or tuple(torch.FloatTensor)
A or a tuple of torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various elements depending on the configuration () and inputs.
The forward method, overrides the __call__
special method.
config () — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the method to load the model weights.
This model is a PyTorch subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
( input_values: typing.Optional[torch.Tensor] = Nonehead_mask: typing.Optional[torch.Tensor] = Nonelabels: typing.Optional[torch.Tensor] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = None ) → or tuple(torch.FloatTensor)
input_values (torch.FloatTensor
of shape (batch_size, max_length, num_mel_bins)
) — Float values mel features extracted from the raw audio waveform. Raw audio waveform can be obtained by loading a .flac
or .wav
audio file into an array of type List[float]
or a numpy.ndarray
, e.g. via the soundfile library (pip install soundfile
). To prepare the array into input_features
, the should be used for extracting the mel features, padding and conversion into a tensor of type torch.FloatTensor
. See
return_dict (bool
, optional) — Whether or not to return a instead of a plain tuple.
or tuple(torch.FloatTensor)
A or a tuple of torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various elements depending on the configuration () and inputs.
The forward method, overrides the __call__
special method.