Load a dataset from the Hub
Last updated
Last updated
Finding high-quality datasets that are reproducible and accessible can be difficult. One of ๐ Datasets main goals is to provide a simple way to load a dataset of any format or type. The easiest way to get started is to discover an existing dataset on the - a community-driven collection of datasets for tasks in NLP, computer vision, and audio - and use ๐ Datasets to download and generate the dataset.
This tutorial uses the and datasets, but feel free to load any dataset you want and follow along. Head over to the Hub now and find a dataset for your task!
Before you take the time to download a dataset, itโs often helpful to quickly get some general information about a dataset. A datasetโs information is stored inside and can include information such as the dataset description, features, and dataset size.
Use the function to load a dataset builder and inspect a datasetโs attributes without committing to downloading it:
Copied
Copied
Copied
Copied
Copied
Copied
Then load the configuration you want:
Copied
If youโre happy with the dataset, then load it with :
A split is a specific subset of a dataset like train
and test
. List a datasetโs split names with the function:
Then you can load a specific split with the split
parameter. Loading a dataset split
returns a object:
If you donโt specify a split
, ๐ Datasets returns a object instead:
Some datasets contain several sub-datasets. For example, the dataset has several sub-datasets, each one containing audio data in a different language. These sub-datasets are known as configurations, and you must explicitly select one when loading the dataset. If you donโt provide a configuration name, ๐ Datasets will raise a ValueError
and remind you to choose a configuration.
Use the function to retrieve a list of all the possible configurations available to your dataset: