Datasets
  • 🌍GET STARTED
    • Datasets
    • Quickstart
    • Installation
  • 🌍TUTORIALS
    • Overview
    • Load a dataset from the Hub
    • Know your dataset
    • Preprocess
    • Evaluate predictions
    • Create a data
    • Share a dataset to the Hub
  • 🌍HOW-TO GUIDES
    • Overview
    • 🌍GENERAL USAGE
      • Load
      • Process
      • Stream
      • Use with TensorFlow
      • Use with PyTorch
      • Use with JAX
      • Use with Spark
      • Cache management
      • Cloud storage
      • Search index
      • Metrics
      • Beam Datasets
    • 🌍AUDIO
      • Load audio data
      • Process audio data
      • Create an audio dataset
    • 🌍VISION
      • Load image data
      • Process image data
      • Create an image dataset
      • Depth estimation
      • Image classification
      • Semantic segmentation
      • Object detection
    • 🌍TEXT
      • Load text data
      • Process text data
    • 🌍TABULAR
      • Load tabular data
    • 🌍DATASET REPOSITORY
      • Share
      • Create a dataset card
      • Structure your repository
      • Create a dataset loading script
  • 🌍CONCEPTUAL GUIDES
    • Datasets with Arrow
    • The cache
    • Dataset or IterableDataset
    • Dataset features
    • Build and load
    • Batch mapping
    • All about metrics
  • 🌍REFERENCE
    • Main classes
    • Builder classes
    • Loading methods
    • Table Classes
    • Logging methods
    • Task templates
Powered by GitBook
On this page
  • Create an image dataset
  • ImageFolder
  • Loading script
  1. HOW-TO GUIDES
  2. VISION

Create an image dataset

PreviousProcess image dataNextDepth estimation

Last updated 1 year ago

Create an image dataset

There are two methods for creating and sharing an image dataset. This guide will show you how to:

  • Create an image dataset with ImageFolder and some metadata. This is a no-code solution for quickly creating an image dataset with several thousand images.

  • Create an image dataset by writing a loading script. This method is a bit more involved, but you have greater flexibility over how a dataset is defined, downloaded, and generated which can be useful for more complex or large scale image datasets.

You can control access to your dataset by requiring users to share their contact information first. Check out the guide for more information about how to enable this feature on the Hub.

ImageFolder

The ImageFolder is a dataset builder designed to quickly load an image dataset with several thousand images without requiring you to write any code.

💡 Take a look at the to learn more about how ImageFolder creates dataset splits based on your dataset repository structure.

ImageFolder automatically infers the class labels of your dataset based on the directory name. Store your dataset in a directory structure like:

Copied

folder/train/dog/golden_retriever.png
folder/train/dog/german_shepherd.png
folder/train/dog/chihuahua.png

folder/train/cat/maine_coon.png
folder/train/cat/bengal.png
folder/train/cat/birman.png

Copied

>>> from datasets import load_dataset

>>> dataset = load_dataset("imagefolder", data_dir="/path/to/folder")

You can also use imagefolder to load datasets involving multiple splits. To do so, your dataset directory should have the following structure:

Copied

folder/train/dog/golden_retriever.png
folder/train/cat/maine_coon.png
folder/test/dog/german_shepherd.png
folder/test/cat/bengal.png

If all image files are contained in a single directory or if they are not on the same level of directory structure, label column won’t be added automatically. If you need it, set drop_labels=False explicitly.

If there is additional information you’d like to include about your dataset, like text captions or bounding boxes, add it as a metadata.csv file in your folder. This lets you quickly create datasets for different computer vision tasks like text captioning or object detection. You can also use a JSONL file metadata.jsonl.

Copied

folder/train/metadata.csv
folder/train/0001.png
folder/train/0002.png
folder/train/0003.png

You can also zip your images:

Copied

folder/metadata.csv
folder/train.zip
folder/test.zip
folder/valid.zip

Your metadata.csv file must have a file_name column which links image files with their metadata:

Copied

file_name,additional_feature
0001.png,This is a first value of a text feature you added to your images
0002.png,This is a second value of a text feature you added to your images
0003.png,This is a third value of a text feature you added to your images

or using metadata.jsonl:

Copied

{"file_name": "0001.png", "additional_feature": "This is a first value of a text feature you added to your images"}
{"file_name": "0002.png", "additional_feature": "This is a second value of a text feature you added to your images"}
{"file_name": "0003.png", "additional_feature": "This is a third value of a text feature you added to your images"}

If metadata files are present, the inferred labels based on the directory name are dropped by default. To include those labels, set drop_labels=False in load_dataset.

Image captioning

Image captioning datasets have text describing an image. An example metadata.csv may look like:

Copied

file_name,text
0001.png,This is a golden retriever playing with a ball
0002.png,A german shepherd
0003.png,One chihuahua

Load the dataset with ImageFolder, and it will create a text column for the image captions:

Copied

>>> dataset = load_dataset("imagefolder", data_dir="/path/to/folder", split="train")
>>> dataset[0]["text"]
"This is a golden retriever playing with a ball"

Object detection

Object detection datasets have bounding boxes and categories identifying objects in an image. An example metadata.jsonl may look like:

Copied

{"file_name": "0001.png", "objects": {"bbox": [[302.0, 109.0, 73.0, 52.0]], "categories": [0]}}
{"file_name": "0002.png", "objects": {"bbox": [[810.0, 100.0, 57.0, 28.0]], "categories": [1]}}
{"file_name": "0003.png", "objects": {"bbox": [[160.0, 31.0, 248.0, 616.0], [741.0, 68.0, 202.0, 401.0]], "categories": [2, 2]}}

Load the dataset with ImageFolder, and it will create a objects column with the bounding boxes and the categories:

Copied

>>> dataset = load_dataset("imagefolder", data_dir="/path/to/folder", split="train")
>>> dataset[0]["objects"]
{"bbox": [[302.0, 109.0, 73.0, 52.0]], "categories": [0]}

Upload dataset to the Hub

Copied

>>> from datasets import load_dataset

>>> dataset = load_dataset("imagefolder", data_dir="/path/to/folder", split="train")
>>> dataset.push_to_hub("stevhliu/my-image-captioning-dataset")

Loading script

Write a dataset loading script to share a dataset. It defines a dataset’s splits and configurations, and handles downloading and generating a dataset. The script is located in the same folder or repository as the dataset and should have the same name.

Copied

my_dataset/
├── README.md
├── my_dataset.py
└── data/  # optional, may contain your images or TAR archives

This structure allows your dataset to be loaded in one line:

Copied

>>> from datasets import load_dataset
>>> dataset = load_dataset("path/to/my_dataset")
  • Create a dataset builder class.

  • Create dataset configurations.

  • Add dataset metadata.

  • Download and define the dataset splits.

  • Generate the dataset.

  • Generate the dataset metadata (optional).

  • Upload the dataset to the Hub.

Create a dataset builder class

  • info stores information about your dataset like its description, license, and features.

  • split_generators downloads the dataset and defines its splits.

  • generate_examples generates the images and labels for each split.

Copied

class Food101(datasets.GeneratorBasedBuilder):
    """Food-101 Images dataset"""

    def _info(self):

    def _split_generators(self, dl_manager):

    def _generate_examples(self, images, metadata_path):

Multiple configurations

Copied

class Food101Config(datasets.BuilderConfig):
    """Builder Config for Food-101"""
 
    def __init__(self, data_url, metadata_urls, **kwargs):
        """BuilderConfig for Food-101.
        Args:
          data_url: `string`, url to download the zip file from.
          metadata_urls: dictionary with keys 'train' and 'validation' containing the archive metadata URLs
          **kwargs: keyword arguments forwarded to super.
        """
        super(Food101Config, self).__init__(version=datasets.Version("1.0.0"), **kwargs)
        self.data_url = data_url
        self.metadata_urls = metadata_urls
  1. Define your subsets with Food101Config in a list in BUILDER_CONFIGS.

  2. For each configuration, provide a name, description, and where to download the images and labels from.

Copied

class Food101(datasets.GeneratorBasedBuilder):
    """Food-101 Images dataset"""
 
    BUILDER_CONFIGS = [
        Food101Config(
            name="breakfast",
            description="Food types commonly eaten during breakfast.",
            data_url="https://link-to-breakfast-foods.zip",
            metadata_urls={
                "train": "https://link-to-breakfast-foods-train.txt", 
                "validation": "https://link-to-breakfast-foods-validation.txt"
            },
        ,
        Food101Config(
            name="dinner",
            description="Food types commonly eaten during dinner.",
            data_url="https://link-to-dinner-foods.zip",
            metadata_urls={
                "train": "https://link-to-dinner-foods-train.txt", 
                "validation": "https://link-to-dinner-foods-validation.txt"
            },
        )...
    ]

Now if users want to load the breakfast configuration, they can use the configuration name:

Copied

>>> from datasets import load_dataset
>>> ds = load_dataset("food101", "breakfast", split="train")

Add dataset metadata

Copied

>>> from datasets import load_dataset_builder
>>> ds_builder = load_dataset_builder("food101")
>>> ds_builder.info

There is a lot of information you can specify about your dataset, but some important ones to include are:

  1. description provides a concise description of the dataset.

  2. supervised_keys specify the input feature and label.

  3. homepage provides a link to the dataset homepage.

  4. citation is a BibTeX citation of the dataset.

  5. license states the dataset’s license.

You’ll notice a lot of the dataset information is defined earlier in the loading script which makes it easier to read. There are also other ~Datasets.Features you can input, so be sure to check out the full list for more details.

Copied

def _info(self):
    return datasets.DatasetInfo(
        description=_DESCRIPTION,
        features=datasets.Features(
            {
                "image": datasets.Image(),
                "label": datasets.ClassLabel(names=_NAMES),
            }
        ),
        supervised_keys=("image", "label"),
        homepage=_HOMEPAGE,
        citation=_CITATION,
        license=_LICENSE,
        task_templates=[ImageClassification(image_column="image", label_column="label")],
    )

Download and define the dataset splits

Now that you’ve added some information about your dataset, the next step is to download the dataset and generate the splits.

    • a name to a file inside a Hub dataset repository (in other words, the data/ folder)

    • a URL to a file hosted somewhere else

    • a list or dictionary of file names or URLs

    In the Food-101 loading script, you’ll notice again the URLs are defined earlier in the script.

Copied

def _split_generators(self, dl_manager):
    archive_path = dl_manager.download(_BASE_URL)
    split_metadata_paths = dl_manager.download(_METADATA_URLS)
    return [
        datasets.SplitGenerator(
            name=datasets.Split.TRAIN,
            gen_kwargs={
                "images": dl_manager.iter_archive(archive_path),
                "metadata_path": split_metadata_paths["train"],
            },
        ),
        datasets.SplitGenerator(
            name=datasets.Split.VALIDATION,
            gen_kwargs={
                "images": dl_manager.iter_archive(archive_path),
                "metadata_path": split_metadata_paths["test"],
            },
        ),
    ]

Generate the dataset

To stream a TAR archive file, the metadata_path needs to be opened and read first. TAR files are accessed and yielded sequentially. This means you need to have the metadata information in hand first so you can yield it with its corresponding image.

Now you can write a function for opening and loading examples from the dataset:

Copied

def _generate_examples(self, images, metadata_path):
    """Generate images and labels for splits."""
    with open(metadata_path, encoding="utf-8") as f:
        files_to_keep = set(f.read().split("\n"))
    for file_path, file_obj in images:
        if file_path.startswith(_IMAGES_DIR):
            if file_path[len(_IMAGES_DIR) : -len(".jpg")] in files_to_keep:
                label = file_path.split("/")[2]
                yield file_path, {
                    "image": {"path": file_path, "bytes": file_obj.read()},
                    "label": label,
                }

Generate the dataset metadata (optional)

The dataset metadata can be generated and stored in the dataset card (README.md file).

Run the following command to generate your dataset metadata in README.md and make sure your new loading script works correctly:

Copied

datasets-cli test path/to/<your-dataset-loading-script> --save_info --all_configs

If your loading script passed the test, you should now have the dataset_info YAML fields in the header of the README.md file in your dataset folder.

Upload the dataset to the Hub

Congratulations, you can now load your dataset from the Hub! 🥳

Copied

>>> from datasets import load_dataset
>>> load_dataset("<username>/my_dataset")

Then users can load your dataset by specifying imagefolder in and the directory in data_dir:

Once you’ve created a dataset, you can share it to the Hub with the method. Make sure you have the library installed and you’re logged in to your BOINC AI account (see the for more details).

Upload your dataset with :

This guide will show you how to create a dataset loading script for image datasets, which is a bit different from . You’ll learn how to:

The best way to learn is to open up an existing image dataset loading script, like , and follow along!

To help you get started, we created a loading script you can copy and use as a starting point!

is the base class for datasets generated from a dictionary generator. Within this class, there are three methods to help create your dataset:

Start by creating your dataset class as a subclass of and add the three methods. Don’t worry about filling in each of these methods yet, you’ll develop those over the next few sections:

In some cases, a dataset may have more than one configuration. For example, if you check out the , you’ll notice there are three subsets.

To create different configurations, use the class to create a subclass for your dataset. Provide the links to download the images and labels in data_url and metadata_urls:

Now you can define your subsets at the top of . Imagine you want to create two subsets in the Food-101 dataset based on whether it is a breakfast or dinner food.

Adding information about your dataset is useful for users to learn more about it. This information is stored in the class which is returned by the info method. Users can access this information by:

features specify the dataset column types. Since you’re creating an image loading script, you’ll need to include the feature.

Use the method to download the dataset and any other metadata you’d like to associate with it. This method accepts:

After you’ve downloaded the dataset, use the to organize the images and labels in each split. Name each split with a standard name like: Split.TRAIN, Split.TEST, and SPLIT.Validation.

In the gen_kwargs parameter, specify the file paths to the images to iterate over and load. If necessary, you can use to iterate over images in TAR archives. You can also specify the associated labels in the metadata_path. The images and metadata_path are actually passed onto the next step where you’ll actually generate the dataset.

To stream a TAR archive file, you need to use ! The function does not support TAR archives in streaming mode.

The last method in the class actually generates the images and labels in the dataset. It yields a dataset according to the stucture specified in features from the info method. As you can see, generate_examples accepts the images and metadata_path from the previous method as arguments.

Once your script is ready, and .

🌍
🌍
Gated datasets
Split pattern hierarchy
load_dataset()
push_to_hub()
boincai_hub
Upload with Python tutorial
push_to_hub()
creating a loading script for text datasets
Food-101
template
GeneratorBasedBuilder
GeneratorBasedBuilder
Imagenette dataset
BuilderConfig
GeneratorBasedBuilder
DatasetInfo
Image
DownloadManager.download()
SplitGenerator
DownloadManager.iter_archive()
DownloadManager.iter_archive()
DownloadManager.download_and_extract()
GeneratorBasedBuilder
create a dataset card
upload it to the Hub