Audio classification
Audio classification - just like with text - assigns a class label output from the input data. The only difference is instead of text inputs, you have raw audio waveforms. Some practical applications of audio classification include identifying speaker intent, language classification, and even animal species by their sounds.
This guide will show you how to:
Use your finetuned model for inference.
The task illustrated in this tutorial is supported by the following model architectures:
Audio Spectrogram Transformer, Data2VecAudio, Hubert, SEW, SEW-D, UniSpeech, UniSpeechSat, Wav2Vec2, Wav2Vec2-Conformer, WavLM, Whisper
Before you begin, make sure you have all the necessary libraries installed:
Copied
We encourage you to login to your BOINC AI account so you can upload and share your model with the community. When prompted, enter your token to login:
Copied
Load MInDS-14 dataset
Start by loading the MInDS-14 dataset from the 🌍 Datasets library:
Copied
Split the dataset’s train
split into a smaller train and test set with the train_test_split method. This’ll give you a chance to experiment and make sure everything works before spending more time on the full dataset.
Copied
Then take a look at the dataset:
Copied
While the dataset contains a lot of useful information, like lang_id
and english_transcription
, you’ll focus on the audio
and intent_class
in this guide. Remove the other columns with the remove_columns method:
Copied
Take a look at an example now:
Copied
There are two fields:
audio
: a 1-dimensionalarray
of the speech signal that must be called to load and resample the audio file.intent_class
: represents the class id of the speaker’s intent.
To make it easier for the model to get the label name from the label id, create a dictionary that maps the label name to an integer and vice versa:
Copied
Now you can convert the label id to a label name:
Copied
Preprocess
The next step is to load a Wav2Vec2 feature extractor to process the audio signal:
Copied
The MInDS-14 dataset has a sampling rate of 8000khz (you can find this information in it’s dataset card), which means you’ll need to resample the dataset to 16000kHz to use the pretrained Wav2Vec2 model:
Copied
Now create a preprocessing function that:
Calls the
audio
column to load, and if necessary, resample the audio file.Checks if the sampling rate of the audio file matches the sampling rate of the audio data a model was pretrained with. You can find this information in the Wav2Vec2 model card.
Set a maximum input length to batch longer inputs without truncating them.
Copied
To apply the preprocessing function over the entire dataset, use 🌍 Datasets map function. You can speed up map
by setting batched=True
to process multiple elements of the dataset at once. Remove the columns you don’t need, and rename intent_class
to label
because that’s the name the model expects:
Copied
Evaluate
Including a metric during training is often helpful for evaluating your model’s performance. You can quickly load a evaluation method with the 🌍 Evaluate library. For this task, load the accuracy metric (see the 🌍 Evaluate quick tour to learn more about how to load and compute a metric):
Copied
Then create a function that passes your predictions and labels to compute
to calculate the accuracy:
Copied
Your compute_metrics
function is ready to go now, and you’ll return to it when you setup your training.
Train
PytorchHide Pytorch content
If you aren’t familiar with finetuning a model with the Trainer, take a look at the basic tutorial here!
You’re ready to start training your model now! Load Wav2Vec2 with AutoModelForAudioClassification along with the number of expected labels, and the label mappings:
Copied
At this point, only three steps remain:
Define your training hyperparameters in TrainingArguments. The only required parameter is
output_dir
which specifies where to save your model. You’ll push this model to the Hub by settingpush_to_hub=True
(you need to be signed in to BOINC AI to upload your model). At the end of each epoch, the Trainer will evaluate the accuracy and save the training checkpoint.Pass the training arguments to Trainer along with the model, dataset, tokenizer, data collator, and
compute_metrics
function.Call train() to finetune your model.
Copied
Once training is completed, share your model to the Hub with the push_to_hub() method so everyone can use your model:
Copied
For a more in-depth example of how to finetune a model for audio classification, take a look at the corresponding PyTorch notebook.
Inference
Great, now that you’ve finetuned a model, you can use it for inference!
Load an audio file you’d like to run inference on. Remember to resample the sampling rate of the audio file to match the sampling rate of the model if you need to!
Copied
The simplest way to try out your finetuned model for inference is to use it in a pipeline(). Instantiate a pipeline
for audio classification with your model, and pass your audio file to it:
Copied
You can also manually replicate the results of the pipeline
if you’d like:
PytorchHide Pytorch content
Load a feature extractor to preprocess the audio file and return the input
as PyTorch tensors:
Copied
Pass your inputs to the model and return the logits:
Copied
Get the class with the highest probability, and use the model’s id2label
mapping to convert it to a label:
Copied
Last updated