Create a dataset for training

Create a dataset for training

There are many datasets on the Hub to train a model on, but if you can’t find one you’re interested in or want to use your own, you can create a dataset with the 🌍 Datasets library. The dataset structure depends on the task you want to train your model on. The most basic dataset structure is a directory of images for tasks like unconditional image generation. Another dataset structure may be a directory of images and a text file containing their corresponding text captions for tasks like text-to-image generation.

This guide will show you two ways to create a dataset to finetune on:

  • provide a folder of images to the --train_data_dir argument

  • upload a dataset to the Hub and pass the dataset repository id to the --dataset_name argument

πŸ’‘ Learn more about how to create an image dataset for training in the Create an image dataset guide.

Provide a dataset as a folder

For unconditional generation, you can provide your own dataset as a folder of images. The training script uses the ImageFolder builder from 🌍 Datasets to automatically build a dataset from the folder. Your directory structure should look like:

Copied

data_dir/xxx.png
data_dir/xxy.png
data_dir/[...]/xxz.png

Pass the path to the dataset directory to the --train_data_dir argument, and then you can start training:

Copied

accelerate launch train_unconditional.py \
    --train_data_dir <path-to-train-directory> \
    <other-arguments>

Upload your data to the Hub

πŸ’‘ For more details and context about creating and uploading a dataset to the Hub, take a look at the Image search with 🌍 Datasets post.

Start by creating a dataset with the ImageFolder feature, which creates an image column containing the PIL-encoded images.

You can use the data_dir or data_files parameters to specify the location of the dataset. The data_files parameter supports mapping specific files to dataset splits like train or test:

Copied

Then use the push_to_hub method to upload the dataset to the Hub:

Copied

Now the dataset is available for training by passing the dataset name to the --dataset_name argument:

Copied

Next steps

Now that you’ve created a dataset, you can plug it into the train_data_dir (if your dataset is local) or dataset_name (if your dataset is on the Hub) arguments of a training script.

For your next steps, feel free to try and use your dataset to train a model for unconditional generation or text-to-image generation!

Last updated