Datasets
  • ๐ŸŒGET STARTED
    • Datasets
    • Quickstart
    • Installation
  • ๐ŸŒTUTORIALS
    • Overview
    • Load a dataset from the Hub
    • Know your dataset
    • Preprocess
    • Evaluate predictions
    • Create a data
    • Share a dataset to the Hub
  • ๐ŸŒHOW-TO GUIDES
    • Overview
    • ๐ŸŒGENERAL USAGE
      • Load
      • Process
      • Stream
      • Use with TensorFlow
      • Use with PyTorch
      • Use with JAX
      • Use with Spark
      • Cache management
      • Cloud storage
      • Search index
      • Metrics
      • Beam Datasets
    • ๐ŸŒAUDIO
      • Load audio data
      • Process audio data
      • Create an audio dataset
    • ๐ŸŒVISION
      • Load image data
      • Process image data
      • Create an image dataset
      • Depth estimation
      • Image classification
      • Semantic segmentation
      • Object detection
    • ๐ŸŒTEXT
      • Load text data
      • Process text data
    • ๐ŸŒTABULAR
      • Load tabular data
    • ๐ŸŒDATASET REPOSITORY
      • Share
      • Create a dataset card
      • Structure your repository
      • Create a dataset loading script
  • ๐ŸŒCONCEPTUAL GUIDES
    • Datasets with Arrow
    • The cache
    • Dataset or IterableDataset
    • Dataset features
    • Build and load
    • Batch mapping
    • All about metrics
  • ๐ŸŒREFERENCE
    • Main classes
    • Builder classes
    • Loading methods
    • Table Classes
    • Logging methods
    • Task templates
Powered by GitBook
On this page
  • Load a dataset from the Hub
  • Load a dataset
  • Splits
  • Configurations
  1. TUTORIALS

Load a dataset from the Hub

PreviousOverviewNextKnow your dataset

Last updated 1 year ago

Load a dataset from the Hub

Finding high-quality datasets that are reproducible and accessible can be difficult. One of ๐ŸŒ Datasets main goals is to provide a simple way to load a dataset of any format or type. The easiest way to get started is to discover an existing dataset on the - a community-driven collection of datasets for tasks in NLP, computer vision, and audio - and use ๐ŸŒ Datasets to download and generate the dataset.

This tutorial uses the and datasets, but feel free to load any dataset you want and follow along. Head over to the Hub now and find a dataset for your task!

Load a dataset

Before you take the time to download a dataset, itโ€™s often helpful to quickly get some general information about a dataset. A datasetโ€™s information is stored inside and can include information such as the dataset description, features, and dataset size.

Use the function to load a dataset builder and inspect a datasetโ€™s attributes without committing to downloading it:

Copied

>>> from datasets import load_dataset_builder
>>> ds_builder = load_dataset_builder("rotten_tomatoes")

# Inspect dataset description
>>> ds_builder.info.description
Movie Review Dataset. This is a dataset of containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. This data was first used in Bo Pang and Lillian Lee, ``Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales.'', Proceedings of the ACL, 2005.

# Inspect dataset features
>>> ds_builder.info.features
{'label': ClassLabel(num_classes=2, names=['neg', 'pos'], id=None),
 'text': Value(dtype='string', id=None)}

Copied

>>> from datasets import load_dataset

>>> dataset = load_dataset("rotten_tomatoes", split="train")

Splits

Copied

>>> from datasets import get_dataset_split_names

>>> get_dataset_split_names("rotten_tomatoes")
['train', 'validation', 'test']

Copied

>>> from datasets import load_dataset

>>> dataset = load_dataset("rotten_tomatoes", split="train")
>>> dataset
Dataset({
    features: ['text', 'label'],
    num_rows: 8530
})

Copied

>>> from datasets import load_dataset

>>> dataset = load_dataset("rotten_tomatoes")
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
})

Configurations

Copied

>>> from datasets import get_dataset_config_names

>>> configs = get_dataset_config_names("PolyAI/minds14")
>>> print(configs)
['cs-CZ', 'de-DE', 'en-AU', 'en-GB', 'en-US', 'es-ES', 'fr-FR', 'it-IT', 'ko-KR', 'nl-NL', 'pl-PL', 'pt-PT', 'ru-RU', 'zh-CN', 'all']

Then load the configuration you want:

Copied

>>> from datasets import load_dataset

>>> mindsFR = load_dataset("PolyAI/minds14", "fr-FR", split="train")

If youโ€™re happy with the dataset, then load it with :

A split is a specific subset of a dataset like train and test. List a datasetโ€™s split names with the function:

Then you can load a specific split with the split parameter. Loading a dataset split returns a object:

If you donโ€™t specify a split, ๐ŸŒ Datasets returns a object instead:

Some datasets contain several sub-datasets. For example, the dataset has several sub-datasets, each one containing audio data in a different language. These sub-datasets are known as configurations, and you must explicitly select one when loading the dataset. If you donโ€™t provide a configuration name, ๐ŸŒ Datasets will raise a ValueError and remind you to choose a configuration.

Use the function to retrieve a list of all the possible configurations available to your dataset:

๐ŸŒ
BOINC AI Hub
rotten_tomatoes
MInDS-14
DatasetInfo
load_dataset_builder()
load_dataset()
get_dataset_split_names()
Dataset
DatasetDict
MInDS-14
get_dataset_config_names()