Datasets
  • 🌍GET STARTED
    • Datasets
    • Quickstart
    • Installation
  • 🌍TUTORIALS
    • Overview
    • Load a dataset from the Hub
    • Know your dataset
    • Preprocess
    • Evaluate predictions
    • Create a data
    • Share a dataset to the Hub
  • 🌍HOW-TO GUIDES
    • Overview
    • 🌍GENERAL USAGE
      • Load
      • Process
      • Stream
      • Use with TensorFlow
      • Use with PyTorch
      • Use with JAX
      • Use with Spark
      • Cache management
      • Cloud storage
      • Search index
      • Metrics
      • Beam Datasets
    • 🌍AUDIO
      • Load audio data
      • Process audio data
      • Create an audio dataset
    • 🌍VISION
      • Load image data
      • Process image data
      • Create an image dataset
      • Depth estimation
      • Image classification
      • Semantic segmentation
      • Object detection
    • 🌍TEXT
      • Load text data
      • Process text data
    • 🌍TABULAR
      • Load tabular data
    • 🌍DATASET REPOSITORY
      • Share
      • Create a dataset card
      • Structure your repository
      • Create a dataset loading script
  • 🌍CONCEPTUAL GUIDES
    • Datasets with Arrow
    • The cache
    • Dataset or IterableDataset
    • Dataset features
    • Build and load
    • Batch mapping
    • All about metrics
  • 🌍REFERENCE
    • Main classes
    • Builder classes
    • Loading methods
    • Table Classes
    • Logging methods
    • Task templates
Powered by GitBook
On this page
  • Stream
  • Convert from a Dataset
  • Shuffle
  • Reshuffle
  • Split dataset
  • Interleave
  • Rename, remove, and cast
  • Map
  • Stream in a training loop
  1. HOW-TO GUIDES
  2. GENERAL USAGE

Stream

PreviousProcessNextUse with TensorFlow

Last updated 1 year ago

Stream

Dataset streaming lets you work with a dataset without downloading it. The data is streamed as you iterate over the dataset. This is especially helpful when:

  • You don’t want to wait for an extremely large dataset to download.

  • The dataset size exceeds the amount of available disk space on your computer.

  • You want to quickly explore just a few samples of a dataset.

Copied

>>> from datasets import load_dataset
>>> dataset = load_dataset('oscar-corpus/OSCAR-2201', 'en', split='train', streaming=True)
>>> print(next(iter(dataset)))
{'id': 0, 'text': 'Founded in 2015, Golden Bees is a leading programmatic recruitment platform dedicated to employers, HR agencies and job boards. The company has developed unique HR-custom technologies and predictive algorithms to identify and attract the best candidates for a job opportunity.', ...

Dataset streaming also lets you work with a dataset made of local files without doing any conversion. In this case, the data is streamed from the local files as you iterate over the dataset. This is especially helpful when:

  • You don’t want to wait for an extremely large local dataset to be converted to Arrow.

  • The converted files size would exceed the amount of available disk space on your computer.

  • You want to quickly explore just a few samples of a dataset.

Copied

>>> from datasets import load_dataset
>>> data_files = {'train': 'path/to/OSCAR-2201/compressed/en_meta/*.jsonl.gz'}
>>> dataset = load_dataset('json', data_files=data_files, split='train', streaming=True)
>>> print(next(iter(dataset)))
{'id': 0, 'text': 'Founded in 2015, Golden Bees is a leading programmatic recruitment platform dedicated to employers, HR agencies and job boards. The company has developed unique HR-custom technologies and predictive algorithms to identify and attract the best candidates for a job opportunity.', ...

Convert from a Dataset

Copied

>>> from datasets import load_dataset

# faster πŸ‡
>>> dataset = load_dataset("food101")
>>> iterable_dataset = dataset.to_iterable_dataset()

# slower 🐒
>>> iterable_dataset = load_dataset("food101", streaming=True)

Copied

>>> import torch
>>> from datasets import load_dataset

>>> dataset = load_dataset("food101")
>>> iterable_dataset = dataset.to_iterable_dataset(num_shards=64) # shard the dataset
>>> iterable_dataset = iterable_dataset.shuffle(buffer_size=10_000)  # shuffles the shards order and use a shuffle buffer when you start iterating
dataloader = torch.utils.data.DataLoader(iterable_dataset, num_workers=4)  # assigns 64 / 4 = 16 shards from the shuffled list of shards to each worker when you start iterating

Shuffle

Copied

>>> from datasets import load_dataset
>>> dataset = load_dataset('oscar', "unshuffled_deduplicated_en", split='train', streaming=True)
>>> shuffled_dataset = dataset.shuffle(seed=42, buffer_size=10_000)

Reshuffle

Sometimes you may want to reshuffle the dataset after each epoch. This will require you to set a different seed for each epoch. Use IterableDataset.set_epoch() in between epochs to tell the dataset what epoch you’re on.

Your seed effectively becomes: initial seed + current epoch.

Copied

>>> for epoch in range(epochs):
...     shuffled_dataset.set_epoch(epoch)
...     for example in shuffled_dataset:
...         ...

Split dataset

You can split your dataset one of two ways:

Copied

>>> dataset = load_dataset('oscar', "unshuffled_deduplicated_en", split='train', streaming=True)
>>> dataset_head = dataset.take(2)
>>> list(dataset_head)
[{'id': 0, 'text': 'Mtendere Village was...'}, {'id': 1, 'text': 'Lily James cannot fight the music...'}]

Copied

>>> train_dataset = shuffled_dataset.skip(1000)

take and skip prevent future calls to shuffle because they lock in the order of the shards. You should shuffle your dataset before splitting it.

Interleave

Copied

>>> from datasets import interleave_datasets
>>> en_dataset = load_dataset('oscar', "unshuffled_deduplicated_en", split='train', streaming=True)
>>> fr_dataset = load_dataset('oscar', "unshuffled_deduplicated_fr", split='train', streaming=True)

>>> multilingual_dataset = interleave_datasets([en_dataset, fr_dataset])
>>> list(multilingual_dataset.take(2))
[{'text': 'Mtendere Village was inspired by the vision...'}, {'text': "MΓ©dia de dΓ©bat d'idΓ©es, de culture et de littΓ©rature..."}]

Define sampling probabilities from each of the original datasets for more control over how each of them are sampled and combined. Set the probabilities argument with your desired sampling probabilities:

Copied

>>> multilingual_dataset_with_oversampling = interleave_datasets([en_dataset, fr_dataset], probabilities=[0.8, 0.2], seed=42)
>>> list(multilingual_dataset_with_oversampling.take(2))
[{'text': 'Mtendere Village was inspired by the vision...'}, {'text': 'Lily James cannot fight the music...'}]

Around 80% of the final dataset is made of the en_dataset, and 20% of the fr_dataset.

You can also specify the stopping_strategy. The default strategy, first_exhausted, is a subsampling strategy, i.e the dataset construction is stopped as soon one of the dataset runs out of samples. You can specify stopping_strategy=all_exhausted to execute an oversampling strategy. In this case, the dataset construction is stopped as soon as every samples in every dataset has been added at least once. In practice, it means that if a dataset is exhausted, it will return to the beginning of this dataset until the stop criterion has been reached. Note that if no sampling probabilities are specified, the new dataset will have max_length_datasets*nb_dataset samples.

Rename, remove, and cast

The following methods allow you to modify the columns of a dataset. These methods are useful for renaming or removing columns and changing columns to a new set of features.

Rename

Copied

>>> from datasets import load_dataset
>>> dataset = load_dataset('mc4', 'en', streaming=True, split='train')
>>> dataset = dataset.rename_column("text", "content")

Remove

Copied

>>> from datasets import load_dataset
>>> dataset = load_dataset('mc4', 'en', streaming=True, split='train')
>>> dataset = dataset.remove_columns('timestamp')

Cast

Copied

>>> from datasets import load_dataset
>>> dataset = load_dataset('glue', 'mrpc', split='train', streaming=True)
>>> dataset.features
{'sentence1': Value(dtype='string', id=None),
'sentence2': Value(dtype='string', id=None),
'label': ClassLabel(num_classes=2, names=['not_equivalent', 'equivalent'], names_file=None, id=None),
'idx': Value(dtype='int32', id=None)}

>>> from datasets import ClassLabel, Value
>>> new_features = dataset.features.copy()
>>> new_features["label"] = ClassLabel(names=['negative', 'positive'])
>>> new_features["idx"] = Value('int64')
>>> dataset = dataset.cast(new_features)
>>> dataset.features
{'sentence1': Value(dtype='string', id=None),
'sentence2': Value(dtype='string', id=None),
'label': ClassLabel(num_classes=2, names=['negative', 'positive'], names_file=None, id=None),
'idx': Value(dtype='int64', id=None)}

Casting only works if the original feature type and new feature type are compatible. For example, you can cast a column with the feature type Value('int32') to Value('bool') if the original column only contains ones and zeros.

Copied

>>> dataset.features
{'audio': Audio(sampling_rate=44100, mono=True, id=None)}

>>> dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
>>> dataset.features
{'audio': Audio(sampling_rate=16000, mono=True, id=None)}

Map

It allows you to apply a processing function to each example in a dataset, independently or in batches. This function can even create new rows and columns.

Copied

>>> def add_prefix(example):
...     example['text'] = 'My text: ' + example['text']
...     return example

Copied

>>> from datasets import load_dataset
>>> dataset = load_dataset('oscar', 'unshuffled_deduplicated_en', streaming=True, split='train')
>>> updated_dataset = dataset.map(add_prefix)
>>> list(updated_dataset.take(3))
[{'id': 0, 'text': 'My text: Mtendere Village was inspired by...'},
 {'id': 1, 'text': 'My text: Lily James cannot fight the music...'},
 {'id': 2, 'text': 'My text: "I\'d love to help kickstart...'}]

Copied

>>> updated_dataset = dataset.map(add_prefix, remove_columns=["id"])
>>> list(updated_dataset.take(3))
[{'text': 'My text: Mtendere Village was inspired by...'},
 {'text': 'My text: Lily James cannot fight the music...'},
 {'text': 'My text: "I\'d love to help kickstart...'}]

Batch processing

Tokenization

Copied

>>> from datasets import load_dataset
>>> from transformers import AutoTokenizer
>>> dataset = load_dataset("mc4", "en", streaming=True, split="train")
>>> tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')
>>> def encode(examples):
...     return tokenizer(examples['text'], truncation=True, padding='max_length')
>>> dataset = dataset.map(encode, batched=True, remove_columns=["text", "timestamp", "url"])
>>> next(iter(dataset))
{'input_ids': 101, 8466, 1018, 1010, 4029, 2475, 2062, 18558, 3100, 2061, ...,1106, 3739, 102],
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..., 1, 1]}

Filter

Copied

>>> from datasets import load_dataset
>>> dataset = load_dataset('oscar', 'unshuffled_deduplicated_en', streaming=True, split='train')
>>> start_with_ar = dataset.filter(lambda example: example['text'].startswith('Ar'))
>>> next(iter(start_with_ar))
{'id': 4, 'text': 'Are you looking for Number the Stars (Essential Modern Classics)?...'}

Copied

>>> even_dataset = dataset.filter(lambda example, idx: idx % 2 == 0, with_indices=True)
>>> list(even_dataset.take(3))
[{'id': 0, 'text': 'Mtendere Village was inspired by the vision of Chief Napoleon Dzombe, ...'},
 {'id': 2, 'text': '"I\'d love to help kickstart continued development! And 0 EUR/month...'},
 {'id': 4, 'text': 'Are you looking for Number the Stars (Essential Modern Classics)? Normally, ...'}]

Stream in a training loop

PytorchHide Pytorch contentCopied

>>> seed, buffer_size = 42, 10_000
>>> dataset = dataset.shuffle(seed, buffer_size=buffer_size)

Lastly, create a simple training loop and start training:

Copied

>>> import torch
>>> from torch.utils.data import DataLoader
>>> from transformers import AutoModelForMaskedLM, DataCollatorForLanguageModeling
>>> from tqdm import tqdm
>>> dataset = dataset.with_format("torch")
>>> dataloader = DataLoader(dataset, collate_fn=DataCollatorForLanguageModeling(tokenizer))
>>> device = 'cuda' if torch.cuda.is_available() else 'cpu' 
>>> model = AutoModelForMaskedLM.from_pretrained("distilbert-base-uncased")
>>> model.train().to(device)
>>> optimizer = torch.optim.AdamW(params=model.parameters(), lr=1e-5)
>>> for epoch in range(3):
...     dataset.set_epoch(epoch)
...     for i, batch in enumerate(tqdm(dataloader, total=5)):
...         if i == 5:
...             break
...         batch = {k: v.to(device) for k, v in batch.items()}
...         outputs = model(**batch)
...         loss = outputs[0]
...         loss.backward()
...         optimizer.step()
...         optimizer.zero_grad()
...         if i % 10 == 0:
...             print(f"loss: {loss}")

For example, the English split of the dataset is 1.2 terabytes, but you can use it instantly with streaming. Stream a dataset by setting streaming=True in as shown below:

For example, you can stream a local dataset of hundreds of compressed JSONL files like to use it instantly:

Loading a dataset in streaming mode creates a new dataset type instance (instead of the classic object), known as an . This special type of dataset has its own set of processing methods shown below.

An is useful for iterative jobs like training a model. You shouldn’t use a for jobs that require random access to examples because you have to iterate all over it using a for loop. Getting the last example in an iterable dataset would require you to iterate over all the previous examples. You can find more details in the .

If you have an existing object, you can convert it to an with the function. This is actually faster than setting the streaming=True argument in because the data is streamed from local files.

The function supports sharding when the is instantiated. This is useful when working with big datasets, and you’d like to shuffle the dataset or to enable fast parallel loading with a PyTorch DataLoader.

Like a regular object, you can also shuffle a with .

The buffer_size argument controls the size of the buffer to randomly sample examples from. Let’s say your dataset has one million examples, and you set the buffer_size to ten thousand. will randomly select examples from the first ten thousand examples in the buffer. Selected examples in the buffer are replaced with new examples. By default, the buffer size is 1,000.

will also shuffle the order of the shards if the dataset is sharded into multiple files.

returns the first n examples in a dataset:

omits the first n examples in a dataset and returns the remaining examples:

can combine an with other datasets. The combined dataset returns alternating examples from each of the original datasets.

Use when you need to rename a column in your dataset. Features associated with the original column are actually moved under the new column name, instead of just replacing the original column in-place.

Provide with the name of the original column, and the new column name:

When you need to remove one or more columns, give the name of the column to remove. Remove more than one column by providing a list of column names:

changes the feature type of one or more columns. This method takes your new Features as its argument. The following sample code shows how to change the feature types of ClassLabel and Value:

Use to change the feature type of just one column. Pass the column name and its new feature type as arguments:

Similar to the function for a regular , 🌍 Datasets features for processing an . applies processing on-the-fly when examples are streamed.

The following example demonstrates how to tokenize a . The function needs to accept and output a dict:

Next, apply this function to the dataset with :

Let’s take a look at another example, except this time, you will remove a column with . When you remove a column, it is only removed after the example has been provided to the mapped function. This allows the mapped function to use the content of the columns before they are removed.

Specify the column to remove with the remove_columns argument in :

also supports working with batches of examples. Operate on batches by setting batched=True. The default batch size is 1000, but you can adjust it with the batch_size argument. This opens the door to many interesting applications such as tokenization, splitting long sentences into shorter chunks, and data augmentation.

See other examples of batch processing in the documentation. They work the same for iterable datasets.

You can filter rows in the dataset based on a predicate function using . It returns rows that match a specified condition:

can also filter by indices if you set with_indices=True:

can be integrated into a training loop. First, shuffle the dataset:

🌍
🌍
oscar-corpus/OSCAR-2201
load_dataset()
oscar-corpus/OSCAR-2201
Dataset
IterableDataset
IterableDataset
IterableDataset
Dataset vs. IterableDataset guide
Dataset
IterableDataset
to_iterable_dataset()
load_dataset()
to_iterable_dataset()
IterableDataset
Dataset
IterableDataset
IterableDataset.shuffle()
IterableDataset.shuffle()
IterableDataset.shuffle()
IterableDataset.take()
IterableDataset.skip()
interleave_datasets()
IterableDataset
IterableDataset.rename_column()
IterableDataset.rename_column()
IterableDataset.remove_columns()
IterableDataset.cast()
IterableDataset.cast_column()
Dataset.map()
Dataset
IterableDataset.map()
IterableDataset
IterableDataset.map()
IterableDataset
IterableDataset.map()
IterableDataset.map()
IterableDataset.map()
IterableDataset.map()
batched map processing
Dataset.filter()
Dataset.filter()
IterableDataset