Datasets
  • 🌍GET STARTED
    • Datasets
    • Quickstart
    • Installation
  • 🌍TUTORIALS
    • Overview
    • Load a dataset from the Hub
    • Know your dataset
    • Preprocess
    • Evaluate predictions
    • Create a data
    • Share a dataset to the Hub
  • 🌍HOW-TO GUIDES
    • Overview
    • 🌍GENERAL USAGE
      • Load
      • Process
      • Stream
      • Use with TensorFlow
      • Use with PyTorch
      • Use with JAX
      • Use with Spark
      • Cache management
      • Cloud storage
      • Search index
      • Metrics
      • Beam Datasets
    • 🌍AUDIO
      • Load audio data
      • Process audio data
      • Create an audio dataset
    • 🌍VISION
      • Load image data
      • Process image data
      • Create an image dataset
      • Depth estimation
      • Image classification
      • Semantic segmentation
      • Object detection
    • 🌍TEXT
      • Load text data
      • Process text data
    • 🌍TABULAR
      • Load tabular data
    • 🌍DATASET REPOSITORY
      • Share
      • Create a dataset card
      • Structure your repository
      • Create a dataset loading script
  • 🌍CONCEPTUAL GUIDES
    • Datasets with Arrow
    • The cache
    • Dataset or IterableDataset
    • Dataset features
    • Build and load
    • Batch mapping
    • All about metrics
  • 🌍REFERENCE
    • Main classes
    • Builder classes
    • Loading methods
    • Table Classes
    • Logging methods
    • Task templates
Powered by GitBook
On this page
  • Create a dataset
  • Folder-based builders
  • From local files
  • Next steps
  1. TUTORIALS

Create a data

PreviousEvaluate predictionsNextShare a dataset to the Hub

Last updated 1 year ago

Create a dataset

Sometimes, you may need to create a dataset if you’re working with your own data. Creating a dataset with 🌍 Datasets confers all the advantages of the library to your dataset: fast loading and processing, , , and more. You can easily and rapidly create a dataset with 🌍 Datasets low-code approaches, reducing the time it takes to start training a model. In many cases, it is as easy as your data files into a dataset repository on the Hub.

In this tutorial, you’ll learn how to use 🌍 Datasets low-code methods for creating all types of datasets:

  • Folder-based builders for quickly creating an image or audio dataset

  • from_ methods for creating datasets from local files

Folder-based builders

There are two folder-based builders, ImageFolder and AudioFolder. These are low-code methods for quickly creating an image or speech and audio dataset with several thousand examples. They are great for rapidly prototyping computer vision and speech models before scaling to a larger dataset. Folder-based builders takes your data and automatically generates the dataset’s features, splits, and labels. Under the hood:

  • ImageFolder uses the feature to decode an image file. Many image extension formats are supported, such as jpg and png, but other formats are also supported. You can check the complete of supported image extensions.

  • AudioFolder uses the feature to decode an audio file. Audio extensions such as wav and mp3 are supported, and you can check the complete of supported audio extensions.

The dataset splits are generated from the repository structure, and the label names are automatically inferred from the directory name.

For example, if your image dataset (it is the same for an audio dataset) is stored like this:

Copied

pokemon/train/grass/bulbasaur.png
pokemon/train/fire/charmander.png
pokemon/train/water/squirtle.png

pokemon/test/grass/ivysaur.png
pokemon/test/fire/charmeleon.png
pokemon/test/water/wartortle.png

Then this is how the folder-based builder generates an example:

Copied

>>> from datasets import ImageFolder

>>> dataset = load_dataset("imagefolder", data_dir="/path/to/pokemon")

Copied

>>> from datasets import AudioFolder

>>> dataset = load_dataset("audiofolder", data_dir="/path/to/folder")

Any additional information about your dataset, such as text captions or transcriptions, can be included with a metadata.csv file in the folder containing your dataset. The metadata file needs to have a file_name column that links the image or audio file to its corresponding metadata:

Copied

file_name, text
bulbasaur.png, There is a plant seed on its back right from the day this Pokémon is born.
charmander.png, It has a preference for hot things.
squirtle.png, When it retracts its long neck into its shell, it squirts out water with vigorous force.

From local files

You can also create a dataset from local files by specifying the path to the data files. There are two ways you can create a dataset using the from_ methods:

  • Copied

    >>> from datasets import Dataset
    >>> def gen():
    ...     yield {"pokemon": "bulbasaur", "type": "grass"}
    ...     yield {"pokemon": "squirtle", "type": "water"}
    >>> ds = Dataset.from_generator(gen)
    >>> ds[0]
    {"pokemon": "bulbasaur", "type": "grass"}

    Copied

    >>> from datasets import IterableDataset
    >>> ds = IterableDataset.from_generator(gen)
    >>> for example in ds:
    ...     print(example)
    {"pokemon": "bulbasaur", "type": "grass"}
    {"pokemon": "squirtle", "type": "water"}
  • Copied

    >>> from datasets import Dataset
    >>> ds = Dataset.from_dict({"pokemon": ["bulbasaur", "squirtle"], "type": ["grass", "water"]})
    >>> ds[0]
    {"pokemon": "bulbasaur", "type": "grass"}

    Copied

    >>> audio_dataset = Dataset.from_dict({"audio": ["path/to/audio_1", ..., "path/to/audio_n"]}).cast_column("audio", Audio())

Next steps

We didn’t mention this in the tutorial, but you can also create a dataset with a loading script. A loading script is a more manual and code-intensive method for creating a dataset, but it also gives you the most flexibility and control over how a dataset is generated. It lets you configure additional options such as creating multiple configurations within a dataset, or enabling your dataset to be streamed.

Now that you know how to create a dataset, consider sharing it on the Hub so the community can also benefit from your work! Go on to the next section to learn how to share your dataset.

Create the image dataset by specifying imagefolder in :

An audio dataset is created in the same way, except you specify audiofolder in instead:

To learn more about each of these folder-based builders, check out the and or guides.

The method is the most memory-efficient way to create a dataset from a due to a generators iterative behavior. This is especially useful when you’re working with a really large dataset that may not fit in memory, since the dataset is generated on disk progressively and then memory-mapped.

A generator-based needs to be iterated over with a for loop for example:

The method is a straightforward way to create a dataset from a dictionary:

To create an image or audio dataset, chain the method with and specify the column and feature type. For example, to create an audio dataset:

To learn more about how to write loading scripts, take a look at the , , and guides.

🌍
stream enormous datasets
memory-mapping
dragging and dropping
Image
list
Audio
list
load_dataset()
load_dataset()
ImageFolder
AudioFolder
from_generator()
generator
IterableDataset
from_dict()
cast_column()
from_dict()
image loading script
audio loading script
text loading script