Datasets
  • 🌍GET STARTED
    • Datasets
    • Quickstart
    • Installation
  • 🌍TUTORIALS
    • Overview
    • Load a dataset from the Hub
    • Know your dataset
    • Preprocess
    • Evaluate predictions
    • Create a data
    • Share a dataset to the Hub
  • 🌍HOW-TO GUIDES
    • Overview
    • 🌍GENERAL USAGE
      • Load
      • Process
      • Stream
      • Use with TensorFlow
      • Use with PyTorch
      • Use with JAX
      • Use with Spark
      • Cache management
      • Cloud storage
      • Search index
      • Metrics
      • Beam Datasets
    • 🌍AUDIO
      • Load audio data
      • Process audio data
      • Create an audio dataset
    • 🌍VISION
      • Load image data
      • Process image data
      • Create an image dataset
      • Depth estimation
      • Image classification
      • Semantic segmentation
      • Object detection
    • 🌍TEXT
      • Load text data
      • Process text data
    • 🌍TABULAR
      • Load tabular data
    • 🌍DATASET REPOSITORY
      • Share
      • Create a dataset card
      • Structure your repository
      • Create a dataset loading script
  • 🌍CONCEPTUAL GUIDES
    • Datasets with Arrow
    • The cache
    • Dataset or IterableDataset
    • Dataset features
    • Build and load
    • Batch mapping
    • All about metrics
  • 🌍REFERENCE
    • Main classes
    • Builder classes
    • Loading methods
    • Table Classes
    • Logging methods
    • Task templates
Powered by GitBook
On this page
  • Cache management
  • Cache directory
  • Download mode
  • Cache files
  • Enable or disable caching
  • Improve performance
  1. HOW-TO GUIDES
  2. GENERAL USAGE

Cache management

Cache management

When you download a dataset, the processing scripts and data are stored locally on your computer. The cache allows 🌍 Datasets to avoid re-downloading or processing the entire dataset every time you use it.

This guide will show you how to:

  • Change the cache directory.

  • Control how a dataset is loaded from the cache.

  • Clean up cache files in the directory.

  • Enable or disable caching.

Cache directory

The default cache directory is ~/.cache/boincai/datasets. Change the cache location by setting the shell environment variable, HF_DATASETS_CACHE to another directory:

Copied

$ export HF_DATASETS_CACHE="/path/to/another/directory"

When you load a dataset, you also have the option to change where the data is cached. Change the cache_dir parameter to the path you want:

Copied

>>> from datasets import load_dataset
>>> dataset = load_dataset('LOADING_SCRIPT', cache_dir="PATH/TO/MY/CACHE/DIR")

Similarly, you can change where a metric is cached with the cache_dir parameter:

Copied

>>> from datasets import load_metric
>>> metric = load_metric('glue', 'mrpc', cache_dir="MY/CACHE/DIRECTORY")

Download mode

Copied

>>> from datasets import load_dataset
>>> dataset = load_dataset('squad', download_mode='force_redownload')

Cache files

Copied

# Returns the number of removed cache files
>>> dataset.cleanup_cache_files()
2

Enable or disable caching

Copied

>>> updated_dataset = small_dataset.map(add_prefix, load_from_cache_file=False)

In the example above, 🌍 Datasets will execute the function add_prefix over the entire dataset again instead of loading the dataset from its previous state.

Copied

>>> from datasets import disable_caching
>>> disable_caching()

When you disable caching, 🌍 Datasets will no longer reload cached files when applying transforms to datasets. Any transform you apply on your dataset will be need to be reapplied.

You can also avoid caching your metric entirely, and keep it in CPU memory instead:

Copied

>>> from datasets import load_metric
>>> metric = load_metric('glue', 'mrpc', keep_in_memory=True)

Keeping the predictions in-memory is not possible in a distributed setting since the CPU memory spaces of the various processes are not shared.

Improve performance

Disabling the cache and copying the dataset in-memory will speed up dataset operations. There are two options for copying the dataset in-memory:

  1. Set datasets.config.IN_MEMORY_MAX_SIZE to a nonzero value (in bytes) that fits in your RAM memory.

  2. Set the environment variable HF_DATASETS_IN_MEMORY_MAX_SIZE to a nonzero value. Note that the first method takes higher precedence.

PreviousUse with SparkNextCloud storage

Last updated 1 year ago

After you download a dataset, control how it is loaded by with the download_mode parameter. By default, 🌍 Datasets will reuse a dataset if it exists. But if you need the original dataset without any processing functions applied, re-download the files as shown below:

Refer to for a full list of download modes.

Clean up the cache files in the directory with :

If you’re using a cached file locally, it will automatically reload the dataset with any previous transforms you applied to the dataset. Disable this behavior by setting the argument load_from_cache_file=False in :

Disable caching on a global scale with :

If you want to reuse a dataset from scratch, try setting the download_mode parameter in instead.

🌍
🌍
load_dataset()
DownloadMode
Dataset.cleanup_cache_files()
Dataset.map()
disable_caching()
load_dataset()