Cache management
Cache management
When you download a dataset, the processing scripts and data are stored locally on your computer. The cache allows π Datasets to avoid re-downloading or processing the entire dataset every time you use it.
This guide will show you how to:
Change the cache directory.
Control how a dataset is loaded from the cache.
Clean up cache files in the directory.
Enable or disable caching.
Cache directory
The default cache directory is ~/.cache/boincai/datasets
. Change the cache location by setting the shell environment variable, HF_DATASETS_CACHE
to another directory:
Copied
When you load a dataset, you also have the option to change where the data is cached. Change the cache_dir
parameter to the path you want:
Copied
Similarly, you can change where a metric is cached with the cache_dir
parameter:
Copied
Download mode
Copied
Cache files
Copied
Enable or disable caching
Copied
In the example above, π Datasets will execute the function add_prefix
over the entire dataset again instead of loading the dataset from its previous state.
Copied
When you disable caching, π Datasets will no longer reload cached files when applying transforms to datasets. Any transform you apply on your dataset will be need to be reapplied.
You can also avoid caching your metric entirely, and keep it in CPU memory instead:
Copied
Keeping the predictions in-memory is not possible in a distributed setting since the CPU memory spaces of the various processes are not shared.
Improve performance
Disabling the cache and copying the dataset in-memory will speed up dataset operations. There are two options for copying the dataset in-memory:
Set
datasets.config.IN_MEMORY_MAX_SIZE
to a nonzero value (in bytes) that fits in your RAM memory.Set the environment variable
HF_DATASETS_IN_MEMORY_MAX_SIZE
to a nonzero value. Note that the first method takes higher precedence.
Last updated