Datasets
  • 🌍GET STARTED
    • Datasets
    • Quickstart
    • Installation
  • 🌍TUTORIALS
    • Overview
    • Load a dataset from the Hub
    • Know your dataset
    • Preprocess
    • Evaluate predictions
    • Create a data
    • Share a dataset to the Hub
  • 🌍HOW-TO GUIDES
    • Overview
    • 🌍GENERAL USAGE
      • Load
      • Process
      • Stream
      • Use with TensorFlow
      • Use with PyTorch
      • Use with JAX
      • Use with Spark
      • Cache management
      • Cloud storage
      • Search index
      • Metrics
      • Beam Datasets
    • 🌍AUDIO
      • Load audio data
      • Process audio data
      • Create an audio dataset
    • 🌍VISION
      • Load image data
      • Process image data
      • Create an image dataset
      • Depth estimation
      • Image classification
      • Semantic segmentation
      • Object detection
    • 🌍TEXT
      • Load text data
      • Process text data
    • 🌍TABULAR
      • Load tabular data
    • 🌍DATASET REPOSITORY
      • Share
      • Create a dataset card
      • Structure your repository
      • Create a dataset loading script
  • 🌍CONCEPTUAL GUIDES
    • Datasets with Arrow
    • The cache
    • Dataset or IterableDataset
    • Dataset features
    • Build and load
    • Batch mapping
    • All about metrics
  • 🌍REFERENCE
    • Main classes
    • Builder classes
    • Loading methods
    • Table Classes
    • Logging methods
    • Task templates
Powered by GitBook
On this page
  • Load
  • BOINC AI Hub
  • Local loading script
  • Local and remote files
  • Multiprocessing
  • In-memory data
  • Offline
  • Slice splits
  • Troubleshooting
  • Metrics
  1. HOW-TO GUIDES
  2. GENERAL USAGE

Load

PreviousGENERAL USAGENextProcess

Last updated 1 year ago

Load

Your data can be stored in various places; they can be on your local machine’s disk, in a Github repository, and in in-memory data structures like Python dictionaries and Pandas DataFrames. Wherever a dataset is stored, 🌍 Datasets can help you load it.

This guide will show you how to load a dataset from:

  • The Hub without a dataset loading script

  • Local loading script

  • Local files

  • In-memory data

  • Offline

  • A specific slice of a split

For more details specific to loading other dataset modalities, take a look at the , the , or the .

BOINC AI Hub

Datasets are loaded from a dataset loading script that downloads and generates the dataset. However, you can also load a dataset from any dataset repository on the Hub without a loading script! Begin by and upload your data files. Now you can use the function to load the dataset.

For example, try loading the files from this by providing the repository namespace and dataset name. This dataset repository contains CSV files, and the code below loads the dataset from the CSV files:

Copied

>>> from datasets import load_dataset
>>> dataset = load_dataset("lhoestq/demo1")

Some datasets may have more than one version based on Git tags, branches, or commits. Use the revision parameter to specify the dataset version you want to load:

Copied

>>> dataset = load_dataset(
...   "lhoestq/custom_squad",
...   revision="main"  # tag name, or branch name, or commit hash
... )

A dataset without a loading script by default loads all the data into the train split. Use the data_files parameter to map data files to splits like train, validation and test:

Copied

>>> data_files = {"train": "train.csv", "test": "test.csv"}
>>> dataset = load_dataset("namespace/your_dataset_name", data_files=data_files)

You can also load a specific subset of the files with the data_files or data_dir parameter. These parameters can accept a relative path which resolves to the base path corresponding to where the dataset is loaded from.

Copied

>>> from datasets import load_dataset

# load files that match the grep pattern
>>> c4_subset = load_dataset("allenai/c4", data_files="en/c4-train.0000*-of-01024.json.gz")

# load dataset from the en directory on the Hub
>>> c4_subset = load_dataset("allenai/c4", data_dir="en")

The split parameter can also map a data file to a specific split:

Copied

>>> data_files = {"validation": "en/c4-validation.*.json.gz"}
>>> c4_validation = load_dataset("allenai/c4", data_files=data_files, split="validation")

Local loading script

  • The local path to the loading script file.

  • The local path to the directory containing the loading script file (only if the script file has the same name as the directory).

Copied

>>> dataset = load_dataset("path/to/local/loading_script/loading_script.py", split="train")
>>> dataset = load_dataset("path/to/local/loading_script", split="train")  # equivalent because the file has the same name as the directory

Edit loading script

You can also edit a loading script from the Hub to add your own modifications. Download the dataset repository locally so any data files referenced by a relative path in the loading script can be loaded:

Copied

git clone https://huggingface.co/datasets/eli5

Copied

>>> from datasets import load_dataset
>>> eli5 = load_dataset("path/to/local/eli5")

Local and remote files

CSV

🌍 Datasets can read a dataset made up of one or several CSV files (in this case, pass your CSV files as a list):

Copied

>>> from datasets import load_dataset
>>> dataset = load_dataset("csv", data_files="my_file.csv")

JSON

Copied

>>> from datasets import load_dataset
>>> dataset = load_dataset("json", data_files="my_file.json")

JSON files have diverse formats, but we think the most efficient format is to have multiple JSON objects; each line represents an individual row of data. For example:

Copied

{"a": 1, "b": 2.0, "c": "foo", "d": false}
{"a": 4, "b": -5.5, "c": null, "d": true}

Another JSON format you may encounter is a nested field, in which case you’ll need to specify the field argument as shown in the following:

Copied

{"version": "0.1.0",
 "data": [{"a": 1, "b": 2.0, "c": "foo", "d": false},
          {"a": 4, "b": -5.5, "c": null, "d": true}]
}

>>> from datasets import load_dataset
>>> dataset = load_dataset("json", data_files="my_file.json", field="data")

To load remote JSON files via HTTP, pass the URLs instead:

Copied

>>> base_url = "https://rajpurkar.github.io/SQuAD-explorer/dataset/"
>>> dataset = load_dataset("json", data_files={"train": base_url + "train-v1.1.json", "validation": base_url + "dev-v1.1.json"}, field="data")

While these are the most common JSON formats, you’ll see other datasets that are formatted differently. 🌍 Datasets recognizes these other formats and will fallback accordingly on the Python JSON loading methods to handle them.

Parquet

Parquet files are stored in a columnar format, unlike row-based files like a CSV. Large datasets may be stored in a Parquet file because it is more efficient and faster at returning your query.

To load a Parquet file:

Copied

>>> from datasets import load_dataset
>>> dataset = load_dataset("parquet", data_files={'train': 'train.parquet', 'test': 'test.parquet'})

To load remote Parquet files via HTTP, pass the URLs instead:

Copied

>>> base_url = "https://storage.googleapis.com/huggingface-nlp/cache/datasets/wikipedia/20200501.en/1.0.0/"
>>> data_files = {"train": base_url + "wikipedia-train.parquet"}
>>> wiki = load_dataset("parquet", data_files=data_files, split="train")

Arrow

Arrow files are stored in an in-memory columnar format, unlike row-based formats like CSV and uncompressed formats like Parquet.

To load an Arrow file:

Copied

>>> from datasets import load_dataset
>>> dataset = load_dataset("arrow", data_files={'train': 'train.arrow', 'test': 'test.arrow'})

To load remote Arrow files via HTTP, pass the URLs instead:

Copied

>>> base_url = "https://storage.googleapis.com/huggingface-nlp/cache/datasets/wikipedia/20200501.en/1.0.0/"
>>> data_files = {"train": base_url + "wikipedia-train.arrow"}
>>> wiki = load_dataset("arrow", data_files=data_files, split="train")

Copied

>>> from datasets import Dataset
>>> dataset = Dataset.from_file("data.arrow")

For now only the Arrow streaming format is supported. The Arrow IPC file format (also known as Feather V2) is not supported.

SQL

Copied

>>> from datasets import Dataset
# load entire table
>>> dataset = Dataset.from_sql("data_table_name", con="sqlite:///sqlite_file.db")
# load from query
>>> dataset = Dataset.from_sql("SELECT text FROM table WHERE length(text) > 100 LIMIT 10", con="sqlite:///sqlite_file.db")

Multiprocessing

When a dataset is made of several files (that we call β€œshards”), it is possible to significantly speed up the dataset downloading and preparation step.

You can choose how many processes you’d like to use to prepare a dataset in parallel using num_proc. In this case, each process is given a subset of shards to prepare:

Copied

from datasets import load_dataset

oscar_afrikaans = load_dataset("oscar-corpus/OSCAR-2201", "af", num_proc=8)
imagenet = load_dataset("imagenet-1k", num_proc=8)
ml_librispeech_spanish = load_dataset("facebook/multilingual_librispeech", "spanish", num_proc=8)

In-memory data

Python dictionary

Copied

>>> from datasets import Dataset
>>> my_dict = {"a": [1, 2, 3]}
>>> dataset = Dataset.from_dict(my_dict)

Python list of dictionaries

Load a list of Python dictionaries with from_list():

Copied

>>> from datasets import Dataset
>>> my_list = [{"a": 1}, {"a": 2}, {"a": 3}]
>>> dataset = Dataset.from_list(my_list)

Python generator

Copied

>>> from datasets import Dataset
>>> def my_gen():
...     for i in range(1, 4):
...         yield {"a": i}
...
>>> dataset = Dataset.from_generator(my_gen)

This approach supports loading data larger than available memory.

You can also define a sharded dataset by passing lists to gen_kwargs:

Copied

>>> def gen(shards):
...     for shard in shards:
...         with open(shard) as f:
...             for line in f:
...                 yield {"line": line}
...
>>> shards = [f"data{i}.txt" for i in range(32)]
>>> ds = IterableDataset.from_generator(gen, gen_kwargs={"shards": shards})
>>> ds = ds.shuffle(seed=42, buffer_size=10_000)  # shuffles the shards order + uses a shuffle buffer
>>> from torch.utils.data import DataLoader
>>> dataloader = DataLoader(ds.with_format("torch"), num_workers=4)  # give each worker a subset of 32/4=8 shards

Pandas DataFrame

Copied

>>> from datasets import Dataset
>>> import pandas as pd
>>> df = pd.DataFrame({"a": [1, 2, 3]})
>>> dataset = Dataset.from_pandas(df)

Offline

Even if you don’t have an internet connection, it is still possible to load a dataset. As long as you’ve downloaded a dataset from the Hub repository before, it should be cached. This means you can reload the dataset from the cache and use it offline.

If you know you won’t have internet access, you can run 🌍 Datasets in full offline mode. This saves time because instead of waiting for the Dataset builder download to time out, 🌍 Datasets will look directly in the cache. Set the environment variable HF_DATASETS_OFFLINE to 1 to enable full offline mode.

Slice splits

Concatenate a train and test split by:

Copied

String APIReadInstruction>>> train_test_ds = datasets.load_dataset("bookcorpus", split="train+test")

Select specific rows of the train split:

Copied

String APIReadInstruction>>> train_10_20_ds = datasets.load_dataset("bookcorpus", split="train[10:20]")

Or select a percentage of a split with:

Copied

String APIReadInstruction>>> train_10pct_ds = datasets.load_dataset("bookcorpus", split="train[:10%]")

Select a combination of percentages from each split:

Copied

String APIReadInstruction>>> train_10_80pct_ds = datasets.load_dataset("bookcorpus", split="train[:10%]+train[-80%:]")

Finally, you can even create cross-validated splits. The example below creates 10-fold cross-validated splits. Each validation dataset is a 10% chunk, and the training dataset makes up the remaining complementary 90% chunk:

Copied

String APIReadInstruction>>> val_ds = datasets.load_dataset("bookcorpus", split=[f"train[{k}%:{k+10}%]" for k in range(0, 100, 10)])
>>> train_ds = datasets.load_dataset("bookcorpus", split=[f"train[:{k}%]+train[{k+10}%:]" for k in range(0, 100, 10)])

Percent slicing and rounding

The default behavior is to round the boundaries to the nearest integer for datasets where the requested slice boundaries do not divide evenly by 100. As shown below, some slices may contain more examples than others. For instance, if the following train split includes 999 records, then:

Copied

# 19 records, from 500 (included) to 519 (excluded).
>>> train_50_52_ds = datasets.load_dataset("bookcorpus", split="train[50%:52%]")
# 20 records, from 519 (included) to 539 (excluded).
>>> train_52_54_ds = datasets.load_dataset("bookcorpus", split="train[52%:54%]")

If you want equal sized splits, use pct1_dropremainder rounding instead. This treats the specified percentage boundaries as multiples of 1%.

Copied

# 18 records, from 450 (included) to 468 (excluded).
>>> train_50_52pct1_ds = datasets.load_dataset("bookcorpus", split=datasets.ReadInstruction("train", from_=50, to=52, unit="%", rounding="pct1_dropremainder"))
# 18 records, from 468 (included) to 486 (excluded).
>>> train_52_54pct1_ds = datasets.load_dataset("bookcorpus", split=datasets.ReadInstruction("train",from_=52, to=54, unit="%", rounding="pct1_dropremainder"))
# Or equivalently:
>>> train_50_52pct1_ds = datasets.load_dataset("bookcorpus", split="train[50%:52%](pct1_dropremainder)")
>>> train_52_54pct1_ds = datasets.load_dataset("bookcorpus", split="train[52%:54%](pct1_dropremainder)")

pct1_dropremainder rounding may truncate the last examples in a dataset if the number of examples in your dataset don’t divide evenly by 100.

Troubleshooting

Sometimes, you may get unexpected results when you load a dataset. Two of the most common issues you may encounter are manually downloading a dataset and specifying features of a dataset.

Manual download

Copied

>>> dataset = load_dataset("matinf", "summarization")
Downloading and preparing dataset matinf/summarization (download: Unknown size, generated: 246.89 MiB, post-processed: Unknown size, total: 246.89 MiB) to /root/.cache/huggingface/datasets/matinf/summarization/1.0.0/82eee5e71c3ceaf20d909bca36ff237452b4e4ab195d3be7ee1c78b53e6f540e...
AssertionError: The dataset matinf with config summarization requires manual data. 
Please follow the manual download instructions: To use MATINF you have to download it manually. Please fill this google form (https://forms.gle/nkH4LVE4iNQeDzsc9). You will receive a download link and a password once you complete the form. Please extract all files in one folder and load the dataset with: *datasets.load_dataset('matinf', data_dir='path/to/folder/folder_name')*. 
Manual data can be loaded with `datasets.load_dataset(matinf, data_dir='<path/to/manual/data>') 

Specify features

Copied

>>> class_names = ["sadness", "joy", "love", "anger", "fear", "surprise"]
>>> emotion_features = Features({'text': Value('string'), 'label': ClassLabel(names=class_names)})

Copied

>>> dataset = load_dataset('csv', data_files=file_dict, delimiter=';', column_names=['text', 'label'], features=emotion_features)

Now when you look at your dataset features, you can see it uses the custom labels you defined:

Copied

>>> dataset['train'].features
{'text': Value(dtype='string', id=None),
'label': ClassLabel(num_classes=6, names=['sadness', 'joy', 'love', 'anger', 'fear', 'surprise'], names_file=None, id=None)}

Metrics

When the metric you want to use is not supported by 🌍 Datasets, you can write and use your own metric script. Load your metric by providing the path to your local metric loading script:

Copied

>>> from datasets import load_metric
>>> metric = load_metric('PATH/TO/MY/METRIC/SCRIPT')

>>> # Example of typical usage
>>> for batch in dataset:
...     inputs, references = batch
...     predictions = model(inputs)
...     metric.add_batch(predictions=predictions, references=references)
>>> score = metric.compute()

Load configurations

Copied

>>> from datasets import load_metric
>>> metric = load_metric('bleurt', name='bleurt-base-128')
>>> metric = load_metric('bleurt', name='bleurt-base-512')

Distributed setup

When working in a distributed or parallel processing environment, loading and computing a metric can be tricky because these processes are executed in parallel on separate subsets of the data. 🌍 Datasets supports distributed usage with a few additional arguments when you load a metric.

For example, imagine you are training and evaluating on eight parallel processes. Here’s how you would load a metric in this distributed setting:

  1. Define the total number of processes with the num_process argument.

  2. Set the process rank as an integer between zero and num_process - 1.

Copied

>>> from datasets import load_metric
>>> metric = load_metric('glue', 'mrpc', num_process=num_process, process_id=rank)

In some instances, you may be simultaneously running multiple independent distributed evaluations on the same server and files. To avoid any conflicts, it is important to provide an experiment_id to distinguish the separate evaluations:

Copied

>>> from datasets import load_metric
>>> metric = load_metric('glue', 'mrpc', num_process=num_process, process_id=process_id, experiment_id="My_experiment_10")

Refer to the tutorial for more details on how to create a dataset repository on the Hub, and how to upload your data files.

If you don’t specify which data files to use, will return all the data files. This can take a long time if you load a large dataset like C4, which is approximately 13TB of data.

You may have a 🌍 Datasets loading script locally on your computer. In this case, load the dataset by passing one of the following paths to :

Make your edits to the loading script and then load it by passing its local path to :

Datasets can be loaded from local files stored on your computer and from remote files. The datasets are most likely stored as a csv, json, txt or parquet file. The function can load each of these file types.

For more details, check out the guide.

JSON files are loaded directly with as shown below:

Arrow is the file format used by 🌍 Datasets under the hood, therefore you can load a local Arrow file using directly:

Unlike , memory maps the Arrow file without preparing the dataset in the cache, saving you disk space. The cache directory to store intermediate processing results will be the Arrow file directory in that case.

Read database contents with by specifying the URI to connect to your database. You can read both table names and queries:

For more details, check out the guide.

🌍 Datasets will also allow you to create a directly from in-memory data structures like Python dictionaries and Pandas DataFrames.

Load Python dictionaries with :

Create a dataset from a Python generator with :

Load Pandas DataFrames with :

For more details, check out the guide.

You can also choose only to load specific slices of a split. There are two options for slicing a split: using strings or the API. Strings are more compact and readable for simple cases, while is easier to use with variable slicing parameters.

Certain datasets require you to manually download the dataset files due to licensing incompatibility or if the files are hidden behind a login page. This causes to throw an AssertionError. But 🌍 Datasets provides detailed instructions for downloading the missing files. After you’ve downloaded the files, use the data_dir argument to specify the path to the files you just downloaded.

For example, if you try to download a configuration from the dataset:

If you’ve already downloaded a dataset from the Hub with a loading script to your computer, then you need to pass an absolute path to the data_dir or data_files parameter to load that dataset. Otherwise, if you pass a relative path, will load the directory from the repository on the Hub instead of the local directory.

When you create a dataset from local files, the are automatically inferred by . However, the dataset’s features may not always align with your expectations, or you may want to define the features yourself. The following example shows how you can add custom labels with the feature.

Start by defining your own labels with the class:

Next, specify the features parameter in with the features you just created:

Metrics is deprecated in 🌍 Datasets. To learn more about how to use metrics, take a look at the library 🌍 ! In addition to metrics, you can find more tools for evaluating models and datasets.

See the guide for more details on how to write your own metric loading script.

It is possible for a metric to have different configurations. The configurations are stored in the config_name parameter in attribute. When you load a metric, provide the configuration name as shown in the following:

Load your metric with with these arguments:

Once you’ve loaded a metric for distributed usage, you can compute the metric as usual. Behind the scenes, gathers all the predictions and references from the nodes, and computes the final metric.

🌍
🌍
load audio dataset guide
load image dataset guide
load text dataset guide
creating a dataset repository
load_dataset()
demo repository
Upload a dataset to the Hub
load_dataset()
load_dataset()
load_dataset()
load_dataset()
how to load tabular datasets from CSV files
load_dataset()
Dataset.from_file()
load_dataset()
Dataset.from_file()
from_sql()
how to load tabular datasets from SQL databases
Dataset
from_dict()
from_generator()
from_pandas()
how to load tabular datasets from Pandas DataFrames
ReadInstruction
ReadInstruction
load_dataset()
MATINF
load_dataset()
Features
Apache Arrow
ClassLabel
Features
load_dataset()
Evaluate
Metrics
MetricInfo
load_metric()
Metric.compute()