Automatic speech recognition
Automatic speech recognition (ASR) converts a speech signal to text, mapping a sequence of audio inputs to text outputs. Virtual assistants like Siri and Alexa use ASR models to help users everyday, and there are many other useful user-facing applications like live captioning and note-taking during meetings.
This guide will show you how to:
Use your finetuned model for inference.
The task illustrated in this tutorial is supported by the following model architectures:
Data2VecAudio, Hubert, M-CTC-T, SEW, SEW-D, UniSpeech, UniSpeechSat, Wav2Vec2, Wav2Vec2-Conformer, WavLM
Before you begin, make sure you have all the necessary libraries installed:
Copied
pip install transformers datasets evaluate jiwerWe encourage you to login to your BOINC AI account so you can upload and share your model with the community. When prompted, enter your token to login:
Copied
>>> from boincai_hub import notebook_login
>>> notebook_login()Load MInDS-14 dataset
Start by loading a smaller subset of the MInDS-14 dataset from the π Datasets library. Thisβll give you a chance to experiment and make sure everything works before spending more time training on the full dataset.
Copied
Split the datasetβs train split into a train and test set with the ~Dataset.train_test_split method:
Copied
Then take a look at the dataset:
Copied
While the dataset contains a lot of useful information, like lang_id and english_transcription, youβll focus on the audio and transcription in this guide. Remove the other columns with the remove_columns method:
Copied
Take a look at the example again:
Copied
There are two fields:
audio: a 1-dimensionalarrayof the speech signal that must be called to load and resample the audio file.transcription: the target text.
Preprocess
The next step is to load a Wav2Vec2 processor to process the audio signal:
Copied
The MInDS-14 dataset has a sampling rate of 8000kHz (you can find this information in its dataset card), which means youβll need to resample the dataset to 16000kHz to use the pretrained Wav2Vec2 model:
Copied
As you can see in the transcription above, the text contains a mix of upper and lowercase characters. The Wav2Vec2 tokenizer is only trained on uppercase characters so youβll need to make sure the text matches the tokenizerβs vocabulary:
Copied
Now create a preprocessing function that:
Calls the
audiocolumn to load and resample the audio file.Extracts the
input_valuesfrom the audio file and tokenize thetranscriptioncolumn with the processor.
Copied
To apply the preprocessing function over the entire dataset, use π Datasets map function. You can speed up map by increasing the number of processes with the num_proc parameter. Remove the columns you donβt need with the remove_columns method:
Copied
π Transformers doesnβt have a data collator for ASR, so youβll need to adapt the DataCollatorWithPadding to create a batch of examples. Itβll also dynamically pad your text and labels to the length of the longest element in its batch (instead of the entire dataset) so they are a uniform length. While it is possible to pad your text in the tokenizer function by setting padding=True, dynamic padding is more efficient.
Unlike other data collators, this specific data collator needs to apply a different padding method to input_values and labels:
Copied
Now instantiate your DataCollatorForCTCWithPadding:
Copied
Evaluate
Including a metric during training is often helpful for evaluating your modelβs performance. You can quickly load a evaluation method with the πEvaluate library. For this task, load the word error rate (WER) metric (see the π Evaluate quick tour to learn more about how to load and compute a metric):
Copied
Then create a function that passes your predictions and labels to compute to calculate the WER:
Copied
Your compute_metrics function is ready to go now, and youβll return to it when you setup your training.
Train
PytorchHide Pytorch content
If you arenβt familiar with finetuning a model with the Trainer, take a look at the basic tutorial here!
Youβre ready to start training your model now! Load Wav2Vec2 with AutoModelForCTC. Specify the reduction to apply with the ctc_loss_reduction parameter. It is often better to use the average instead of the default summation:
Copied
At this point, only three steps remain:
Define your training hyperparameters in TrainingArguments. The only required parameter is
output_dirwhich specifies where to save your model. Youβll push this model to the Hub by settingpush_to_hub=True(you need to be signed in to BOINC AI to upload your model). At the end of each epoch, the Trainer will evaluate the WER and save the training checkpoint.Pass the training arguments to Trainer along with the model, dataset, tokenizer, data collator, and
compute_metricsfunction.Call train() to finetune your model.
Copied
Once training is completed, share your model to the Hub with the push_to_hub() method so everyone can use your model:
Copied
For a more in-depth example of how to finetune a model for automatic speech recognition, take a look at this blog post for English ASR and this post for multilingual ASR.
Inference
Great, now that youβve finetuned a model, you can use it for inference!
Load an audio file youβd like to run inference on. Remember to resample the sampling rate of the audio file to match the sampling rate of the model if you need to!
Copied
The simplest way to try out your finetuned model for inference is to use it in a pipeline(). Instantiate a pipeline for automatic speech recognition with your model, and pass your audio file to it:
Copied
The transcription is decent, but it could be better! Try finetuning your model on more examples to get even better results!
You can also manually replicate the results of the pipeline if youβd like:
PytorchHide Pytorch content
Load a processor to preprocess the audio file and transcription and return the input as PyTorch tensors:
Copied
Pass your inputs to the model and return the logits:
Copied
Get the predicted input_ids with the highest probability, and use the processor to decode the predicted input_ids back into text:
Copied
Last updated