Datasets
  • ๐ŸŒGET STARTED
    • Datasets
    • Quickstart
    • Installation
  • ๐ŸŒTUTORIALS
    • Overview
    • Load a dataset from the Hub
    • Know your dataset
    • Preprocess
    • Evaluate predictions
    • Create a data
    • Share a dataset to the Hub
  • ๐ŸŒHOW-TO GUIDES
    • Overview
    • ๐ŸŒGENERAL USAGE
      • Load
      • Process
      • Stream
      • Use with TensorFlow
      • Use with PyTorch
      • Use with JAX
      • Use with Spark
      • Cache management
      • Cloud storage
      • Search index
      • Metrics
      • Beam Datasets
    • ๐ŸŒAUDIO
      • Load audio data
      • Process audio data
      • Create an audio dataset
    • ๐ŸŒVISION
      • Load image data
      • Process image data
      • Create an image dataset
      • Depth estimation
      • Image classification
      • Semantic segmentation
      • Object detection
    • ๐ŸŒTEXT
      • Load text data
      • Process text data
    • ๐ŸŒTABULAR
      • Load tabular data
    • ๐ŸŒDATASET REPOSITORY
      • Share
      • Create a dataset card
      • Structure your repository
      • Create a dataset loading script
  • ๐ŸŒCONCEPTUAL GUIDES
    • Datasets with Arrow
    • The cache
    • Dataset or IterableDataset
    • Dataset features
    • Build and load
    • Batch mapping
    • All about metrics
  • ๐ŸŒREFERENCE
    • Main classes
    • Builder classes
    • Loading methods
    • Table Classes
    • Logging methods
    • Task templates
Powered by GitBook
On this page
  • Know your dataset
  • Dataset
  • IterableDataset
  • Next steps
  1. TUTORIALS

Know your dataset

PreviousLoad a dataset from the HubNextPreprocess

Last updated 1 year ago

Know your dataset

There are two types of dataset objects, a regular and then an โœจ โœจ. A provides fast random access to the rows, and memory-mapping so that loading even large datasets only uses a relatively small amount of device memory. But for really, really big datasets that wonโ€™t even fit on disk or in memory, an allows you to access and use the dataset without waiting for it to download completely!

This tutorial will show you how to load and access a and an .

Dataset

When you load a dataset split, youโ€™ll get a object. You can do many things with a object, which is why itโ€™s important to learn how to manipulate and interact with the data stored inside.

This tutorial uses the dataset, but feel free to load any dataset youโ€™d like and follow along!

Copied

>>> from datasets import load_dataset

>>> dataset = load_dataset("rotten_tomatoes", split="train")

Indexing

A contains columns of data, and each column can be a different type of data. The index, or axis label, is used to access examples from the dataset. For example, indexing by the row returns a dictionary of an example from the dataset:

Copied

# Get the first row in the dataset
>>> dataset[0]
{'label': 1,
 'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'}

Use the - operator to start from the end of the dataset:

Copied

# Get the last row in the dataset
>>> dataset[-1]
{'label': 0,
 'text': 'things really get weird , though not particularly scary : the movie is all portent and no content .'}

Indexing by the column name returns a list of all the values in the column:

Copied

>>> dataset["text"]
['the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .',
 'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson\'s expanded vision of j . r . r . tolkien\'s middle-earth .',
 'effective but too-tepid biopic',
 ...,
 'things really get weird , though not particularly scary : the movie is all portent and no content .']

You can combine row and column name indexing to return a specific value at a position:

Copied

>>> dataset[0]["text"]
'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'

But it is important to remember that indexing order matters, especially when working with large audio and image datasets. Indexing by the column name returns all the values in the column first, then loads the value at that position. For large datasets, it may be slower to index by the column name first.

Copied

>>> with Timer():
...    dataset[0]['text']
Elapsed time: 0.0031 seconds

>>> with Timer():
...   dataset["text"][0]
Elapsed time: 0.0094 seconds

Slicing

Slicing returns a slice - or subset - of the dataset, which is useful for viewing several rows at once. To slice a dataset, use the : operator to specify a range of positions.

Copied

# Get the first three rows
>>> dataset[:3]
{'label': [1, 1, 1],
 'text': ['the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .',
  'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson\'s expanded vision of j . r . r . tolkien\'s middle-earth .',
  'effective but too-tepid biopic']}

# Get rows between three and six
>>> dataset[3:6]
{'label': [1, 1, 1],
 'text': ['if you sometimes like to go to the movies to have fun , wasabi is a good place to start .',
  "emerges as something rare , an issue movie that's so honest and keenly observed that it doesn't feel like one .",
  'the film provides some great insight into the neurotic mindset of all comics -- even those who have reached the absolute top of the game .']}

IterableDataset

Copied

>>> from datasets import load_dataset

>>> iterable_dataset = load_dataset("food101", split="train", streaming=True)
>>> for example in iterable_dataset:
...     print(example)
...     break
{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=384x512 at 0x7F0681F5C520>, 'label': 6}

Copied

>>> from datasets import load_dataset

>>> dataset = load_dataset("rotten_tomatoes", split="train")
>>> iterable_dataset = dataset.to_iterable_dataset()

Copied

>>> next(iter(iterable_dataset))
{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=384x512 at 0x7F0681F59B50>,
 'label': 6}

>>> for example in iterable_dataset:
...     print(example)
...     break
{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=384x512 at 0x7F7479DE82B0>, 'label': 6}

Copied

# Get first three examples
>>> list(iterable_dataset.take(3))
[{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=384x512 at 0x7F7479DEE9D0>,
  'label': 6},
 {'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=512x512 at 0x7F7479DE8190>,
  'label': 6},
 {'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=512x383 at 0x7F7479DE8310>,
  'label': 6}]

Next steps

An is loaded when you set the streaming parameter to True in :

You can also create an from an existing , but it is faster than streaming mode because the dataset is streamed from local files:

An progressively iterates over a dataset one example at a time, so you donโ€™t have to wait for the whole dataset to download before you can use it. As you can imagine, this is quite useful for large datasets you want to use immediately!

However, this means an โ€™s behavior is different from a regular . You donโ€™t get random access to examples in an . Instead, you should iterate over its elements, for example, by calling next(iter()) or with a for loop to return the next item from the :

You can return a subset of the dataset with a specific number of examples in it with :

But unlike , creates a new .

Interested in learning more about the differences between these two types of datasets? Learn more about them in the conceptual guide.

To get more hands-on with these datasets types, check out the guide to learn how to preprocess a or the guide to learn how to preprocess an .

๐ŸŒ
Dataset
IterableDataset
Dataset
IterableDataset
Dataset
IterableDataset
Dataset
Dataset
rotten_tomatoes
Dataset
IterableDataset
load_dataset()
IterableDataset
Dataset
IterableDataset
IterableDataset
Dataset
IterableDataset
IterableDataset
IterableDataset.take()
slicing
IterableDataset.take()
IterableDataset
Differences between Dataset and IterableDataset
Process
Dataset
Stream
IterableDataset