Datasets
  • ๐ŸŒGET STARTED
    • Datasets
    • Quickstart
    • Installation
  • ๐ŸŒTUTORIALS
    • Overview
    • Load a dataset from the Hub
    • Know your dataset
    • Preprocess
    • Evaluate predictions
    • Create a data
    • Share a dataset to the Hub
  • ๐ŸŒHOW-TO GUIDES
    • Overview
    • ๐ŸŒGENERAL USAGE
      • Load
      • Process
      • Stream
      • Use with TensorFlow
      • Use with PyTorch
      • Use with JAX
      • Use with Spark
      • Cache management
      • Cloud storage
      • Search index
      • Metrics
      • Beam Datasets
    • ๐ŸŒAUDIO
      • Load audio data
      • Process audio data
      • Create an audio dataset
    • ๐ŸŒVISION
      • Load image data
      • Process image data
      • Create an image dataset
      • Depth estimation
      • Image classification
      • Semantic segmentation
      • Object detection
    • ๐ŸŒTEXT
      • Load text data
      • Process text data
    • ๐ŸŒTABULAR
      • Load tabular data
    • ๐ŸŒDATASET REPOSITORY
      • Share
      • Create a dataset card
      • Structure your repository
      • Create a dataset loading script
  • ๐ŸŒCONCEPTUAL GUIDES
    • Datasets with Arrow
    • The cache
    • Dataset or IterableDataset
    • Dataset features
    • Build and load
    • Batch mapping
    • All about metrics
  • ๐ŸŒREFERENCE
    • Main classes
    • Builder classes
    • Loading methods
    • Table Classes
    • Logging methods
    • Task templates
Powered by GitBook
On this page
  • Build and load
  • ELI5: load_dataset
  • Building a dataset
  • Maintaining integrity
  • Security
  1. CONCEPTUAL GUIDES

Build and load

PreviousDataset featuresNextBatch mapping

Last updated 1 year ago

Build and load

Nearly every deep learning workflow begins with loading a dataset, which makes it one of the most important steps. With ๐ŸŒ Datasets, there are more than 900 datasets available to help you get started with your NLP task. All you have to do is call: to take your first step. This function is a true workhorse in every sense because it builds and loads every dataset you use.

ELI5: load_dataset

Letโ€™s begin with a basic Explain Like Iโ€™m Five.

A dataset is a directory that contains:

  • Some data files in generic formats (JSON, CSV, Parquet, text, etc.)

  • A dataset card named README.md that contains documentation about the dataset as well as a YAML header to define the datasets tags and configurations

  • An optional dataset script if it requires some code to read the data files. This is sometimes used to load files of specific formats and structures.

The function fetches the requested dataset locally or from the BOINC AI Hub. The Hub is a central repository where all the BOINC AI datasets and models are stored.

If the dataset only contains data files, then automatically infers how to load the data files from their extensions (json, csv, parquet, txt, etc.). Under the hood, ๐ŸŒ Datasets will use an appropriate based on the data files format. There exist one builder per data file format in ๐ŸŒ Datasets:

  • for text

  • for CSV and TSV

  • for JSON and JSONL

  • for Parquet

  • for Arrow (streaming file format)

  • for SQL databases

  • for image folders

  • for audio folders

๐ŸŒ Datasets downloads the dataset files from the original URL, generates the dataset and caches it in an Arrow table on your drive. If youโ€™ve downloaded the dataset before, then ๐ŸŒ Datasets will reload it from the cache to save you the trouble of downloading it again.

Now that you have a high-level understanding about how datasets are built, letโ€™s take a closer look at the nuts and bolts of how all this works.

Building a dataset

BuilderConfig

Attribute
Description

name

Short name of the dataset.

version

Dataset version identifier.

data_dir

Stores the path to a local folder containing the data files.

data_files

Stores paths to local data files.

description

Description of the dataset.

DatasetBuilder

  1. DatasetBuilder._generate_examples reads and parses the data files for a split. Then it yields dataset examples according to the format specified in the features from DatasetBuilder._info(). The input of DatasetBuilder._generate_examples is actually the filepath provided in the keyword arguments of the last method.

Maintaining integrity

  • The number of splits in the generated DatasetDict.

  • The number of samples in each split of the generated DatasetDict.

  • The list of downloaded files.

  • The SHA256 checksums of the downloaded files (disabled by defaut).

If the dataset doesnโ€™t pass the verifications, it is likely that the original host of the dataset made some changes in the data files.

Security

Moreover the datasets without a namespace (originally contributed on our GitHub repository) have all been reviewed by our maintainers. The code of these datasets is considered safe. It concerns datasets that are not under a namespace, e.g. โ€œsquadโ€ or โ€œglueโ€, unlike the other datasets that are named โ€œusername/dataset_nameโ€ or โ€œorg/dataset_nameโ€.

If the dataset has a dataset script, then it downloads and imports it from the BOINC AI Hub. Code in the dataset script defines a custom the dataset information (description, features, URL to the original files, etc.), and tells ๐ŸŒ Datasets how to generate and display examples from it.

Read the section to learn more about how to share a dataset. This section also provides a step-by-step guide on how to write your own dataset loading script!

When you load a dataset for the first time, ๐ŸŒ Datasets takes the raw data file and builds it into a table of rows and typed columns. There are two main classes responsible for building a dataset: and .

is the configuration class of . The contains the following basic attributes about a dataset:

If you want to add additional attributes to your dataset such as the class labels, you can subclass the base class. There are two ways to populate the attributes of a class or subclass:

Provide a list of predefined class (or subclass) instances in the datasets DatasetBuilder.BUILDER_CONFIGS() attribute.

When you call , any keyword arguments that are not specific to the method will be used to set the associated attributes of the class. This will override the predefined attributes if a specific configuration was selected.

You can also set the to any custom subclass of .

accesses all the attributes inside to build the actual dataset.

There are three main methods in :

DatasetBuilder._info() is in charge of defining the dataset attributes. When you call dataset.info, ๐ŸŒ Datasets returns the information stored here. Likewise, the are also specified here. Remember, the are like the skeleton of the dataset. It provides the names and types of each column.

DatasetBuilder._split_generator downloads or retrieves the requested data files, organizes them into splits, and defines specific arguments for the generation process. This method has a that downloads files or fetches them from your local filesystem. Within the , there is a method that accepts a dictionary of URLs to the original data files, and downloads the requested files. Accepted inputs include: a single URL or path, or a list/dictionary of URLs or paths. Any compressed file types like TAR, GZIP and ZIP archives will be automatically extracted.

Once the files are downloaded, organizes them into splits. The contains the name of the split, and any keyword arguments that are provided to the DatasetBuilder._generate_examples method. The keyword arguments can be specific to each split, and typically comprise at least the local path to the data files for each split.

The dataset is generated with a Python generator, which doesnโ€™t load all the data in memory. As a result, the generator can handle large datasets. However, before the generated samples are flushed to the dataset file on disk, they are stored in an ArrowWriter buffer. This means the generated samples are written by batch. If your dataset samples consumes a lot of memory (images or videos), then make sure to specify a low value for the DEFAULT_WRITER_BATCH_SIZE attribute in . We recommend not exceeding a size of 200 MB.

To ensure a dataset is complete, will perform a series of tests on the downloaded files to make sure everything is there. This way, you donโ€™t encounter any surprises when your requested dataset doesnโ€™t get generated as expected. verifies:

If it is your own dataset, youโ€™ll need to recompute the information above and update the README.md file in your dataset repository. Take a look at this to learn how to generate and update this metadata.

In this case, an error is raised to alert that the dataset has changed. To ignore the error, one needs to specify verification_mode="no_checks" in . Anytime you see a verification error, feel free to open a discussion or pull request in the corresponding dataset โ€œCommunityโ€ tab, so that the integrity checks for that dataset are updated.

The dataset repositories on the Hub are scanned for malware, see more information .

๐ŸŒ
load_dataset()
load_dataset()
load_dataset()
DatasetBuilder
datasets.packaged_modules.text.Text
datasets.packaged_modules.csv.Csv
datasets.packaged_modules.json.Json
datasets.packaged_modules.parquet.Parquet
datasets.packaged_modules.arrow.Arrow
datasets.packaged_modules.sql.Sql
datasets.packaged_modules.imagefolder.ImageFolder
datasets.packaged_modules.audiofolder.AudioFolder
DatasetBuilder
Share
BuilderConfig
DatasetBuilder
BuilderConfig
DatasetBuilder
BuilderConfig
BuilderConfig
BuilderConfig
BuilderConfig
load_dataset()
BuilderConfig
DatasetBuilder.BUILDER_CONFIG_CLASS
BuilderConfig
DatasetBuilder
BuilderConfig
DatasetBuilder
Features
Features
DownloadManager
DownloadManager
DownloadManager.download_and_extract()
SplitGenerator
SplitGenerator
DatasetBuilder
load_dataset()
load_dataset()
section
load_dataset()
here