Create an image dataset
Last updated
Last updated
There are two methods for creating and sharing an image dataset. This guide will show you how to:
Create an image dataset with ImageFolder
and some metadata. This is a no-code solution for quickly creating an image dataset with several thousand images.
Create an image dataset by writing a loading script. This method is a bit more involved, but you have greater flexibility over how a dataset is defined, downloaded, and generated which can be useful for more complex or large scale image datasets.
You can control access to your dataset by requiring users to share their contact information first. Check out the guide for more information about how to enable this feature on the Hub.
The ImageFolder
is a dataset builder designed to quickly load an image dataset with several thousand images without requiring you to write any code.
💡 Take a look at the to learn more about how ImageFolder
creates dataset splits based on your dataset repository structure.
ImageFolder
automatically infers the class labels of your dataset based on the directory name. Store your dataset in a directory structure like:
Copied
Copied
You can also use imagefolder
to load datasets involving multiple splits. To do so, your dataset directory should have the following structure:
Copied
If all image files are contained in a single directory or if they are not on the same level of directory structure, label
column won’t be added automatically. If you need it, set drop_labels=False
explicitly.
If there is additional information you’d like to include about your dataset, like text captions or bounding boxes, add it as a metadata.csv
file in your folder. This lets you quickly create datasets for different computer vision tasks like text captioning or object detection. You can also use a JSONL file metadata.jsonl
.
Copied
You can also zip your images:
Copied
Your metadata.csv
file must have a file_name
column which links image files with their metadata:
Copied
or using metadata.jsonl
:
Copied
If metadata files are present, the inferred labels based on the directory name are dropped by default. To include those labels, set drop_labels=False
in load_dataset
.
Image captioning datasets have text describing an image. An example metadata.csv
may look like:
Copied
Load the dataset with ImageFolder
, and it will create a text
column for the image captions:
Copied
Object detection datasets have bounding boxes and categories identifying objects in an image. An example metadata.jsonl
may look like:
Copied
Load the dataset with ImageFolder
, and it will create a objects
column with the bounding boxes and the categories:
Copied
Copied
Write a dataset loading script to share a dataset. It defines a dataset’s splits and configurations, and handles downloading and generating a dataset. The script is located in the same folder or repository as the dataset and should have the same name.
Copied
This structure allows your dataset to be loaded in one line:
Copied
Create a dataset builder class.
Create dataset configurations.
Add dataset metadata.
Download and define the dataset splits.
Generate the dataset.
Generate the dataset metadata (optional).
Upload the dataset to the Hub.
info
stores information about your dataset like its description, license, and features.
split_generators
downloads the dataset and defines its splits.
generate_examples
generates the images and labels for each split.
Copied
Multiple configurations
Copied
Define your subsets with Food101Config
in a list in BUILDER_CONFIGS
.
For each configuration, provide a name, description, and where to download the images and labels from.
Copied
Now if users want to load the breakfast
configuration, they can use the configuration name:
Copied
Copied
There is a lot of information you can specify about your dataset, but some important ones to include are:
description
provides a concise description of the dataset.
supervised_keys
specify the input feature and label.
homepage
provides a link to the dataset homepage.
citation
is a BibTeX citation of the dataset.
license
states the dataset’s license.
You’ll notice a lot of the dataset information is defined earlier in the loading script which makes it easier to read. There are also other ~Datasets.Features
you can input, so be sure to check out the full list for more details.
Copied
Now that you’ve added some information about your dataset, the next step is to download the dataset and generate the splits.
a name to a file inside a Hub dataset repository (in other words, the data/
folder)
a URL to a file hosted somewhere else
a list or dictionary of file names or URLs
In the Food-101 loading script, you’ll notice again the URLs are defined earlier in the script.
Copied
To stream a TAR archive file, the metadata_path
needs to be opened and read first. TAR files are accessed and yielded sequentially. This means you need to have the metadata information in hand first so you can yield it with its corresponding image.
Now you can write a function for opening and loading examples from the dataset:
Copied
The dataset metadata can be generated and stored in the dataset card (README.md
file).
Run the following command to generate your dataset metadata in README.md
and make sure your new loading script works correctly:
Copied
If your loading script passed the test, you should now have the dataset_info
YAML fields in the header of the README.md
file in your dataset folder.
Congratulations, you can now load your dataset from the Hub! 🥳
Copied
Then users can load your dataset by specifying imagefolder
in and the directory in data_dir
:
Once you’ve created a dataset, you can share it to the Hub with the method. Make sure you have the library installed and you’re logged in to your BOINC AI account (see the for more details).
Upload your dataset with :
This guide will show you how to create a dataset loading script for image datasets, which is a bit different from . You’ll learn how to:
The best way to learn is to open up an existing image dataset loading script, like , and follow along!
To help you get started, we created a loading script you can copy and use as a starting point!
is the base class for datasets generated from a dictionary generator. Within this class, there are three methods to help create your dataset:
Start by creating your dataset class as a subclass of and add the three methods. Don’t worry about filling in each of these methods yet, you’ll develop those over the next few sections:
In some cases, a dataset may have more than one configuration. For example, if you check out the , you’ll notice there are three subsets.
To create different configurations, use the class to create a subclass for your dataset. Provide the links to download the images and labels in data_url
and metadata_urls
:
Now you can define your subsets at the top of . Imagine you want to create two subsets in the Food-101 dataset based on whether it is a breakfast or dinner food.
Adding information about your dataset is useful for users to learn more about it. This information is stored in the class which is returned by the info
method. Users can access this information by:
features
specify the dataset column types. Since you’re creating an image loading script, you’ll need to include the feature.
Use the method to download the dataset and any other metadata you’d like to associate with it. This method accepts:
After you’ve downloaded the dataset, use the to organize the images and labels in each split. Name each split with a standard name like: Split.TRAIN
, Split.TEST
, and SPLIT.Validation
.
In the gen_kwargs
parameter, specify the file paths to the images
to iterate over and load. If necessary, you can use to iterate over images in TAR archives. You can also specify the associated labels in the metadata_path
. The images
and metadata_path
are actually passed onto the next step where you’ll actually generate the dataset.
To stream a TAR archive file, you need to use ! The function does not support TAR archives in streaming mode.
The last method in the class actually generates the images and labels in the dataset. It yields a dataset according to the stucture specified in features
from the info
method. As you can see, generate_examples
accepts the images
and metadata_path
from the previous method as arguments.
Once your script is ready, and .