Create a data
Last updated
Last updated
Sometimes, you may need to create a dataset if you’re working with your own data. Creating a dataset with 🌍 Datasets confers all the advantages of the library to your dataset: fast loading and processing, , , and more. You can easily and rapidly create a dataset with 🌍 Datasets low-code approaches, reducing the time it takes to start training a model. In many cases, it is as easy as your data files into a dataset repository on the Hub.
In this tutorial, you’ll learn how to use 🌍 Datasets low-code methods for creating all types of datasets:
Folder-based builders for quickly creating an image or audio dataset
from_
methods for creating datasets from local files
There are two folder-based builders, ImageFolder
and AudioFolder
. These are low-code methods for quickly creating an image or speech and audio dataset with several thousand examples. They are great for rapidly prototyping computer vision and speech models before scaling to a larger dataset. Folder-based builders takes your data and automatically generates the dataset’s features, splits, and labels. Under the hood:
ImageFolder
uses the feature to decode an image file. Many image extension formats are supported, such as jpg and png, but other formats are also supported. You can check the complete of supported image extensions.
AudioFolder
uses the feature to decode an audio file. Audio extensions such as wav and mp3 are supported, and you can check the complete of supported audio extensions.
The dataset splits are generated from the repository structure, and the label names are automatically inferred from the directory name.
For example, if your image dataset (it is the same for an audio dataset) is stored like this:
Copied
Then this is how the folder-based builder generates an example:
Copied
Copied
Any additional information about your dataset, such as text captions or transcriptions, can be included with a metadata.csv
file in the folder containing your dataset. The metadata file needs to have a file_name
column that links the image or audio file to its corresponding metadata:
Copied
You can also create a dataset from local files by specifying the path to the data files. There are two ways you can create a dataset using the from_
methods:
Copied
Copied
Copied
Copied
We didn’t mention this in the tutorial, but you can also create a dataset with a loading script. A loading script is a more manual and code-intensive method for creating a dataset, but it also gives you the most flexibility and control over how a dataset is generated. It lets you configure additional options such as creating multiple configurations within a dataset, or enabling your dataset to be streamed.
Now that you know how to create a dataset, consider sharing it on the Hub so the community can also benefit from your work! Go on to the next section to learn how to share your dataset.
Create the image dataset by specifying imagefolder
in :
An audio dataset is created in the same way, except you specify audiofolder
in instead:
To learn more about each of these folder-based builders, check out the and or guides.
The method is the most memory-efficient way to create a dataset from a due to a generators iterative behavior. This is especially useful when you’re working with a really large dataset that may not fit in memory, since the dataset is generated on disk progressively and then memory-mapped.
A generator-based needs to be iterated over with a for
loop for example:
The method is a straightforward way to create a dataset from a dictionary:
To create an image or audio dataset, chain the method with and specify the column and feature type. For example, to create an audio dataset:
To learn more about how to write loading scripts, take a look at the , , and guides.