Know your dataset

Know your dataset

There are two types of dataset objects, a regular Datasetarrow-up-right and then an ✨ IterableDatasetarrow-up-right ✨. A Datasetarrow-up-right provides fast random access to the rows, and memory-mapping so that loading even large datasets only uses a relatively small amount of device memory. But for really, really big datasets that won’t even fit on disk or in memory, an IterableDatasetarrow-up-right allows you to access and use the dataset without waiting for it to download completely!

This tutorial will show you how to load and access a Datasetarrow-up-right and an IterableDatasetarrow-up-right.

Dataset

When you load a dataset split, you’ll get a Datasetarrow-up-right object. You can do many things with a Datasetarrow-up-right object, which is why it’s important to learn how to manipulate and interact with the data stored inside.

This tutorial uses the rotten_tomatoesarrow-up-right dataset, but feel free to load any dataset you’d like and follow along!

Copied

>>> from datasets import load_dataset

>>> dataset = load_dataset("rotten_tomatoes", split="train")

Indexing

A Datasetarrow-up-right contains columns of data, and each column can be a different type of data. The index, or axis label, is used to access examples from the dataset. For example, indexing by the row returns a dictionary of an example from the dataset:

Copied

# Get the first row in the dataset
>>> dataset[0]
{'label': 1,
 'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'}

Use the - operator to start from the end of the dataset:

Copied

Indexing by the column name returns a list of all the values in the column:

Copied

You can combine row and column name indexing to return a specific value at a position:

Copied

But it is important to remember that indexing order matters, especially when working with large audio and image datasets. Indexing by the column name returns all the values in the column first, then loads the value at that position. For large datasets, it may be slower to index by the column name first.

Copied

Slicing

Slicing returns a slice - or subset - of the dataset, which is useful for viewing several rows at once. To slice a dataset, use the : operator to specify a range of positions.

Copied

IterableDataset

An IterableDatasetarrow-up-right is loaded when you set the streaming parameter to True in load_dataset()arrow-up-right:

Copied

You can also create an IterableDatasetarrow-up-right from an existing Datasetarrow-up-right, but it is faster than streaming mode because the dataset is streamed from local files:

Copied

An IterableDatasetarrow-up-right progressively iterates over a dataset one example at a time, so you don’t have to wait for the whole dataset to download before you can use it. As you can imagine, this is quite useful for large datasets you want to use immediately!

However, this means an IterableDatasetarrow-up-right’s behavior is different from a regular Datasetarrow-up-right. You don’t get random access to examples in an IterableDatasetarrow-up-right. Instead, you should iterate over its elements, for example, by calling next(iter()) or with a for loop to return the next item from the IterableDatasetarrow-up-right:

Copied

You can return a subset of the dataset with a specific number of examples in it with IterableDataset.take()arrow-up-right:

Copied

But unlike slicingarrow-up-right, IterableDataset.take()arrow-up-right creates a new IterableDatasetarrow-up-right.

Next steps

Interested in learning more about the differences between these two types of datasets? Learn more about them in the Differences between Dataset and IterableDatasetarrow-up-right conceptual guide.

To get more hands-on with these datasets types, check out the Processarrow-up-right guide to learn how to preprocess a Datasetarrow-up-right or the Streamarrow-up-right guide to learn how to preprocess an IterableDatasetarrow-up-right.

Last updated