# Cache management

## Cache management

When you download a dataset, the processing scripts and data are stored locally on your computer. The cache allows 🌍 Datasets to avoid re-downloading or processing the entire dataset every time you use it.

This guide will show you how to:

* Change the cache directory.
* Control how a dataset is loaded from the cache.
* Clean up cache files in the directory.
* Enable or disable caching.

### Cache directory

The default cache directory is `~/.cache/boincai/datasets`. Change the cache location by setting the shell environment variable, `HF_DATASETS_CACHE` to another directory:

Copied

```
$ export HF_DATASETS_CACHE="/path/to/another/directory"
```

When you load a dataset, you also have the option to change where the data is cached. Change the `cache_dir` parameter to the path you want:

Copied

```
>>> from datasets import load_dataset
>>> dataset = load_dataset('LOADING_SCRIPT', cache_dir="PATH/TO/MY/CACHE/DIR")
```

Similarly, you can change where a metric is cached with the `cache_dir` parameter:

Copied

```
>>> from datasets import load_metric
>>> metric = load_metric('glue', 'mrpc', cache_dir="MY/CACHE/DIRECTORY")
```

### Download mode

After you download a dataset, control how it is loaded by [load\_dataset()](https://huggingface.co/docs/datasets/v2.14.5/en/package_reference/loading_methods#datasets.load_dataset) with the `download_mode` parameter. By default, 🌍 Datasets will reuse a dataset if it exists. But if you need the original dataset without any processing functions applied, re-download the files as shown below:

Copied

```
>>> from datasets import load_dataset
>>> dataset = load_dataset('squad', download_mode='force_redownload')
```

Refer to [DownloadMode](https://huggingface.co/docs/datasets/v2.14.5/en/package_reference/builder_classes#datasets.DownloadMode) for a full list of download modes.

### Cache files

Clean up the cache files in the directory with [Dataset.cleanup\_cache\_files()](https://huggingface.co/docs/datasets/v2.14.5/en/package_reference/main_classes#datasets.Dataset.cleanup_cache_files):

Copied

```
# Returns the number of removed cache files
>>> dataset.cleanup_cache_files()
2
```

### Enable or disable caching

If you’re using a cached file locally, it will automatically reload the dataset with any previous transforms you applied to the dataset. Disable this behavior by setting the argument `load_from_cache_file=False` in [Dataset.map()](https://huggingface.co/docs/datasets/v2.14.5/en/package_reference/main_classes#datasets.Dataset.map):

Copied

```
>>> updated_dataset = small_dataset.map(add_prefix, load_from_cache_file=False)
```

In the example above, 🌍 Datasets will execute the function `add_prefix` over the entire dataset again instead of loading the dataset from its previous state.

Disable caching on a global scale with [disable\_caching()](https://huggingface.co/docs/datasets/v2.14.5/en/package_reference/main_classes#datasets.disable_caching):

Copied

```
>>> from datasets import disable_caching
>>> disable_caching()
```

When you disable caching, 🌍 Datasets will no longer reload cached files when applying transforms to datasets. Any transform you apply on your dataset will be need to be reapplied.

If you want to reuse a dataset from scratch, try setting the `download_mode` parameter in [load\_dataset()](https://huggingface.co/docs/datasets/v2.14.5/en/package_reference/loading_methods#datasets.load_dataset) instead.

You can also avoid caching your metric entirely, and keep it in CPU memory instead:

Copied

```
>>> from datasets import load_metric
>>> metric = load_metric('glue', 'mrpc', keep_in_memory=True)
```

Keeping the predictions in-memory is not possible in a distributed setting since the CPU memory spaces of the various processes are not shared.

### Improve performance

Disabling the cache and copying the dataset in-memory will speed up dataset operations. There are two options for copying the dataset in-memory:

1. Set `datasets.config.IN_MEMORY_MAX_SIZE` to a nonzero value (in bytes) that fits in your RAM memory.
2. Set the environment variable `HF_DATASETS_IN_MEMORY_MAX_SIZE` to a nonzero value. Note that the first method takes higher precedence.
