Create an audio dataset
Last updated
Last updated
You can share a dataset with your team or with anyone in the community by creating a dataset repository on the BOINC AI Hub:
Copied
There are several methods for creating and sharing an audio dataset:
Create an audio dataset from local files in python with . This is an easy way that requires only a few steps in python.
Create an audio dataset repository with the AudioFolder
builder. This is a no-code solution for quickly creating an audio dataset with several thousand audio files.
Create an audio dataset by writing a loading script. This method is for advanced users and requires more effort and coding, but you have greater flexibility over how a dataset is defined, downloaded, and generated which can be useful for more complex or large scale audio datasets.
You can control access to your dataset by requiring users to share their contact information first. Check out the guide for more information about how to enable this feature on the Hub.
You can load your own dataset using the paths to your audio files. Use the function to take a column of audio file paths, and cast it to the feature:
Copied
Copied
This will create a dataset repository containing your audio dataset:
Copied
The AudioFolder
is a dataset builder designed to quickly load an audio dataset with several thousand audio files without requiring you to write any code. Any additional information about your dataset - such as transcription, speaker accent, or speaker intent - is automatically loaded by AudioFolder
as long as you include this information in a metadata file (metadata.csv
/metadata.jsonl
).
Create a dataset repository on the BOINC AI Hub and upload your dataset directory following the AudioFolder
structure:
Copied
The data
folder can be any name you want.
It can be helpful to store your metadata as a jsonl
file if the data columns contain a more complex format (like a list of floats) to avoid parsing errors or reading complex values as strings.
The metadata file should include a file_name
column to link an audio file to it’s metadata:
Copied
Then you can store your dataset in a directory structure like this:
Copied
Copied
You can also use audiofolder
to load datasets involving multiple splits. To do so, your dataset directory might have the following structure:
Copied
Note that if audio files are located not right next to a metadata file, file_name
column should be a full relative path to an audio file, not just its filename.
For audio datasets that don’t have any associated metadata, AudioFolder
automatically infers the class labels of the dataset based on the directory name. It might be useful for audio classification tasks. Your dataset directory might look like:
Copied
Load the dataset with AudioFolder
, and it will create a label
column from the directory name (language id):
Copied
If all audio files are contained in a single directory or if they are not on the same level of directory structure, label
column won’t be added automatically. If you need it, set drop_labels=False
explicitly.
Write a dataset loading script to manually create a dataset. It defines a dataset’s splits and configurations, and handles downloading and generating the dataset examples. The script should have the same name as your dataset folder or repository:
Copied
The data
folder can be any name you want, it doesn’t have to be data
. This folder is optional, unless you’re hosting your dataset on the Hub.
This directory structure allows your dataset to be loaded in one line:
Copied
Users can preview a dataset in the dataset viewer.
Here is an example using TAR archives:
Copied
In addition to learning how to create a streamable dataset, you’ll also learn how to:
Create a dataset builder class.
Create dataset configurations.
Add dataset metadata.
Download and define the dataset splits.
Generate the dataset.
Upload the dataset to the Hub.
_info
stores information about your dataset like its description, license, and features.
_split_generators
downloads the dataset and defines its splits.
_generate_examples
generates the dataset’s samples containing the audio data and other features specified in info
for each split.
Copied
Multiple configurations
Copied
Copied
Now if users want to load the Balinese (bal
) configuration, they can use the configuration name:
Copied
Copied
There is a lot of information you can include about your dataset, but some important ones are:
description
provides a concise description of the dataset.
homepage
provides a link to the dataset homepage.
license
specify the permissions for using a dataset as defined by the license type.
citation
is a BibTeX citation of the dataset.
Copied
Now that you’ve added some information about your dataset, the next step is to download the dataset and define the splits.
a relative path to a file inside a Hub dataset repository (for example, in the data/
folder)
a URL to a file hosted somewhere else
a (nested) list or dictionary of file names or URLs
Copied
Files inside TAR archives are accessed and yielded sequentially. This means you need to have the metadata associated with the audio files in the TAR file in hand first so you can yield it with its corresponding audio file.
Copied
Copied
Put these two steps together, and the whole _generate_examples
method looks like:
Copied
Congratulations, you can now load your dataset from the Hub! 🥳
Copied
Download and define the dataset splits
Copied
Copied
Generate the dataset
Here _generate_examples
accepts local_extracted_archive
, audio_files
, metadata_path
, and path_to_clips
from the previous method as arguments.
TAR files are accessed and yielded sequentially. This means you need to have the metadata in metadata_path
associated with the audio files in the TAR file in hand first so that you can yield it with its corresponding audio file further:
Copied
Copied
Put both of these steps together, and the whole _generate_examples
method should look like:
Copied
Then upload the dataset to the BOINC AI Hub using :
💡 Take a look at the to learn more about how AudioFolder
creates dataset splits based on your dataset repository structure.
Users can now load your dataset and the associated metadata by specifying audiofolder
in and the dataset directory in data_dir
:
Some audio datasets, like those found in , have separate metadata files for each split. Provided the metadata features are the same for each split, audiofolder
can be used to load all splits at once. If the metadata features differ across each split, you should load them with separate load_dataset()
calls.
This guide will show you how to create a dataset loading script for audio datasets, which is a bit different from . Audio datasets are commonly stored in tar.gz
archives which requires a particular approach to support streaming mode. While streaming is not required, we highly encourage implementing streaming support in your audio dataset because:
Users without a lot of disk space can use your dataset without downloading it. Learn more about streaming in the guide!
The best way to learn is to open up an existing audio dataset loading script, like , and follow along!
This guide shows how to process audio data stored in TAR archives - the most frequent case for audio datasets. Check out dataset for an example of an audio script which uses ZIP archives.
To help you get started, we created a loading script you can copy and use as a starting point!
is the base class for datasets generated from a dictionary generator. Within this class, there are three methods to help create your dataset:
Start by creating your dataset class as a subclass of and add the three methods. Don’t worry about filling in each of these methods yet, you’ll develop those over the next few sections:
In some cases, a dataset may have more than one configuration. For example, dataset has several configurations corresponding to different languages.
To create different configurations, use the class to create a subclass of your dataset. The only required parameter is the name
of the configuration, which must be passed to the configuration’s superclass __init__()
. Otherwise, you can specify any custom parameters you want in your configuration class.
Define your configurations in the BUILDER_CONFIGS
class variable inside . In this example, the author imports the languages from a separate release_stats.py
from their repository, and then loops through each language to create a configuration:
Typically, users need to specify a configuration to load in , otherwise a ValueError
is raised. You can avoid this by setting a default dataset configuration to load in DEFAULT_CONFIG_NAME
.
Adding information about your dataset helps users to learn more about it. This information is stored in the class which is returned by the info
method. Users can access this information by:
features
specify the dataset column types. Since you’re creating an audio loading script, you’ll need to include the feature and the sampling_rate
of the dataset.
You’ll notice a lot of the dataset information is defined earlier in the loading script which can make it easier to read. There are also other ~Dataset.Features
you can input, so be sure to check out the full list and for more details.
Use the method to download metadata file at _PROMPTS_URLS
and audio TAR archive at _DATA_URL
. This method returns the path to the local file/archive. In streaming mode, it doesn’t download the file(s) and just returns a URL to stream the data from. This method accepts:
After you’ve downloaded the dataset, use the to organize the audio files and sentence prompts in each split. Name each split with a standard name like: Split.TRAIN
, Split.TEST
, and SPLIT.Validation
.
In the gen_kwargs
parameter, specify the file path to the prompts_path
and path_to_clips
. For audio_files
, you’ll need to use to iterate over the audio files in the TAR archive. This enables streaming for your dataset. All of these file paths are passed onto the next step where you’ll actually generate the dataset.
This implementation does not extract downloaded archives. If you want to extract files after download, you need to additionally use , see the section.
The last method in the class actually generates the samples in the dataset. It yields a dataset according to the structure specified in features
from the info
method. As you can see, generate_examples
accepts the prompts_path
, path_to_clips
, and audio_files
from the previous method as arguments.
Finally, iterate over files in audio_files
and yield them along with their corresponding metadata. yields a tuple of (path
, f
) where path
is a relative path to a file inside TAR archive and f
is a file object itself.
Once your script is ready, and .
In the example above downloaded archives are not extracted and therefore examples do not contain information about where they are stored locally. To explain how to do the extraction in a way that it also supports streaming, we will briefly go through the loading script.
Use the method to download the audio data at _AUDIO_URL
.
To extract audio TAR archive locally, use the . You can use this method only in non-streaming mode (when dl_manager.is_streaming=False
). This returns a local path to the extracted archive directory:
Use the method to iterate over the archive at audio_path
, just like in the Vivos example above. doesn’t provide any information about the full paths of files from the archive, even if it has been extracted. As a result, you need to pass the local_extracted_archive
path to the next step in gen_kwargs
, in order to preserve information about where the archive was extracted to. This is required to construct the correct paths to the local files when you generate the examples.
The reason you need to use a combination of and is because files in TAR archives can’t be accessed directly by their paths. Instead, you’ll need to iterate over the files within the archive! You can use and with TAR archives only in non-streaming mode, otherwise it would throw an error.
Use the method to download the metadata file specified in _METADATA_URL
. This method returns a path to a local file in non-streaming mode. In streaming mode, it doesn’t download file locally and returns the same URL.
Now use the to organize the audio files and metadata in each split. Name each split with a standard name like: Split.TRAIN
, Split.TEST
, and SPLIT.Validation
.
In the gen_kwargs
parameter, specify the file paths to local_extracted_archive
, audio_files
, metadata_path
, and path_to_clips
. Remember, for audio_files
, you need to use to iterate over the audio files in the TAR archives. This enables streaming for your dataset! All of these file paths are passed onto the next step where the dataset samples are generated.
Now you can yield the files in audio_files
archive. When you use , it yielded a tuple of (path
, f
) where path
is a relative path to a file inside the archive, and f
is the file object itself. To get the full path to the locally extracted file, join the path of the directory (local_extracted_path
) where the archive is extracted to and the relative audio file path (path
):