# Download files

## Download files from the Hub

The `boincai_hub` library provides functions to download files from the repositories stored on the Hub. You can use these functions independently or integrate them into your own library, making it more convenient for your users to interact with the Hub. This guide will show you how to:

* Download and cache a single file.
* Download and cache an entire repository.
* Download files to a local folder.

### Download a single file

The [hf\_hub\_download()](https://huggingface.co/docs/huggingface_hub/v0.18.0.rc0/en/package_reference/file_download#huggingface_hub.hf_hub_download) function is the main function for downloading files from the Hub. It downloads the remote file, caches it on disk (in a version-aware way), and returns its local file path.

The returned filepath is a pointer to the HF local cache. Therefore, it is important to not modify the file to avoid having a corrupted cache. If you are interested in getting to know more about how files are cached, please refer to our [caching guide](https://huggingface.co/docs/huggingface_hub/guides/manage-cache).

#### From latest version

Select the file to download using the `repo_id`, `repo_type` and `filename` parameters. By default, the file will be considered as being part of a `model` repo.

Copied

```
>>> from boincai_hub import hf_hub_download
>>> hf_hub_download(repo_id="lysandre/arxiv-nlp", filename="config.json")
'/root/.cache/boincai/hub/models--lysandre--arxiv-nlp/snapshots/894a9adde21d9a3e3843e6d5aeaaf01875c7fade/config.json'

# Download from a dataset
>>> hf_hub_download(repo_id="google/fleurs", filename="fleurs.py", repo_type="dataset")
'/root/.cache/boincai/hub/datasets--google--fleurs/snapshots/199e4ae37915137c555b1765c01477c216287d34/fleurs.py'
```

#### From specific version

By default, the latest version from the `main` branch is downloaded. However, in some cases you want to download a file at a particular version (e.g. from a specific branch, a PR, a tag or a commit hash). To do so, use the `revision` parameter:

Copied

```
# Download from the `v1.0` tag
>>> hf_hub_download(repo_id="lysandre/arxiv-nlp", filename="config.json", revision="v1.0")

# Download from the `test-branch` branch
>>> hf_hub_download(repo_id="lysandre/arxiv-nlp", filename="config.json", revision="test-branch")

# Download from Pull Request #3
>>> hf_hub_download(repo_id="lysandre/arxiv-nlp", filename="config.json", revision="refs/pr/3")

# Download from a specific commit hash
>>> hf_hub_download(repo_id="lysandre/arxiv-nlp", filename="config.json", revision="877b84a8f93f2d619faa2a6e514a32beef88ab0a")
```

**Note:** When using the commit hash, it must be the full-length hash instead of a 7-character commit hash.

#### Construct a download URL

In case you want to construct the URL used to download a file from a repo, you can use [hf\_hub\_url()](https://huggingface.co/docs/huggingface_hub/v0.18.0.rc0/en/package_reference/file_download#huggingface_hub.hf_hub_url) which returns a URL. Note that it is used internally by [hf\_hub\_download()](https://huggingface.co/docs/huggingface_hub/v0.18.0.rc0/en/package_reference/file_download#huggingface_hub.hf_hub_download).

### Download an entire repository

[snapshot\_download()](https://huggingface.co/docs/huggingface_hub/v0.18.0.rc0/en/package_reference/file_download#huggingface_hub.snapshot_download) downloads an entire repository at a given revision. It uses internally [hf\_hub\_download()](https://huggingface.co/docs/huggingface_hub/v0.18.0.rc0/en/package_reference/file_download#huggingface_hub.hf_hub_download) which means all downloaded files are also cached on your local disk. Downloads are made concurrently to speed-up the process.

To download a whole repository, just pass the `repo_id` and `repo_type`:

Copied

```
>>> from boincai_hub import snapshot_download
>>> snapshot_download(repo_id="lysandre/arxiv-nlp")
'/home/lysandre/.cache/boincai/hub/models--lysandre--arxiv-nlp/snapshots/894a9adde21d9a3e3843e6d5aeaaf01875c7fade'

# Or from a dataset
>>> snapshot_download(repo_id="google/fleurs", repo_type="dataset")
'/home/lysandre/.cache/boincai/hub/datasets--google--fleurs/snapshots/199e4ae37915137c555b1765c01477c216287d34'
```

[snapshot\_download()](https://huggingface.co/docs/huggingface_hub/v0.18.0.rc0/en/package_reference/file_download#huggingface_hub.snapshot_download) downloads the latest revision by default. If you want a specific repository revision, use the `revision` parameter:

Copied

```
>>> from boincai_hub import snapshot_download
>>> snapshot_download(repo_id="lysandre/arxiv-nlp", revision="refs/pr/1")
```

#### Filter files to download

[snapshot\_download()](https://huggingface.co/docs/huggingface_hub/v0.18.0.rc0/en/package_reference/file_download#huggingface_hub.snapshot_download) provides an easy way to download a repository. However, you don’t always want to download the entire content of a repository. For example, you might want to prevent downloading all `.bin` files if you know you’ll only use the `.safetensors` weights. You can do that using `allow_patterns` and `ignore_patterns` parameters.

These parameters accept either a single pattern or a list of patterns. Patterns are Standard Wildcards (globbing patterns) as documented [here](https://tldp.org/LDP/GNU-Linux-Tools-Summary/html/x11655.htm). The pattern matching is based on [`fnmatch`](https://docs.python.org/3/library/fnmatch.html).

For example, you can use `allow_patterns` to only download JSON configuration files:

Copied

```
>>> from boincai_hub import snapshot_download
>>> snapshot_download(repo_id="lysandre/arxiv-nlp", allow_patterns="*.json")
```

On the other hand, `ignore_patterns` can exclude certain files from being downloaded. The following example ignores the `.msgpack` and `.h5` file extensions:

Copied

```
>>> from boincai_hub import snapshot_download
>>> snapshot_download(repo_id="lysandre/arxiv-nlp", ignore_patterns=["*.msgpack", "*.h5"])
```

Finally, you can combine both to precisely filter your download. Here is an example to download all json and markdown files except `vocab.json`.

Copied

```
>>> from boincai_hub import snapshot_download
>>> snapshot_download(repo_id="gpt2", allow_patterns=["*.md", "*.json"], ignore_patterns="vocab.json")
```

### Download file(s) to local folder

The recommended (and default) way to download files from the Hub is to use the [cache-system](https://huggingface.co/docs/huggingface_hub/guides/manage-cache). You can define your cache location by setting `cache_dir` parameter (both in [hf\_hub\_download()](https://huggingface.co/docs/huggingface_hub/v0.18.0.rc0/en/package_reference/file_download#huggingface_hub.hf_hub_download) and [snapshot\_download()](https://huggingface.co/docs/huggingface_hub/v0.18.0.rc0/en/package_reference/file_download#huggingface_hub.snapshot_download)).

However, in some cases you want to download files and move them to a specific folder. This is useful to get a workflow closer to what `git` commands offer. You can do that using the `local_dir` and `local_dir_use_symlinks` parameters:

* `local_dir` must be a path to a folder on your system. The downloaded files will keep the same file structure as in the repo. For example if `filename="data/train.csv"` and `local_dir="path/to/folder"`, then the returned filepath will be `"path/to/folder/data/train.csv"`.
* `local_dir_use_symlinks` defines how the file must be saved in your local folder.
  * The default behavior (`"auto"`) is to duplicate small files (<5MB) and use symlinks for bigger files. Symlinks allow to optimize both bandwidth and disk usage. However manually editing a symlinked file might corrupt the cache, hence the duplication for small files. The 5MB threshold can be configured with the `HF_HUB_LOCAL_DIR_AUTO_SYMLINK_THRESHOLD` environment variable.
  * If `local_dir_use_symlinks=True` is set, all files are symlinked for an optimal disk space optimization. This is for example useful when downloading a huge dataset with thousands of small files.
  * Finally, if you don’t want symlinks at all you can disable them (`local_dir_use_symlinks=False`). The cache directory will still be used to check wether the file is already cached or not. If already cached, the file is **duplicated** from the cache (i.e. saves bandwidth but increases disk usage). If the file is not already cached, it will be downloaded and moved directly to the local dir. This means that if you need to reuse it somewhere else later, it will be **re-downloaded**.

Here is a table that summarizes the different options to help you choose the parameters that best suit your use case.

| Parameters                                                                                       | File already cached |       Returned path       | Can read path? |                                           Can save to path?                                          |                   Optimized bandwidth                   |                    Optimized disk usage                   |
| ------------------------------------------------------------------------------------------------ | :-----------------: | :-----------------------: | :------------: | :--------------------------------------------------------------------------------------------------: | :-----------------------------------------------------: | :-------------------------------------------------------: |
| `local_dir=None`                                                                                 |                     |      symlink in cache     |        ✅       |                          <p>❌<br><em>(save would corrupt the cache)</em></p>                         |                            ✅                            |                             ✅                             |
| <p><code>local\_dir="path/to/folder"</code><br><code>local\_dir\_use\_symlinks="auto"</code></p> |                     | file or symlink in folder |        ✅       | <p>✅ <em>(for small files)</em><br>⚠️ <em>(for big files do not resolve path before saving)</em></p> |                            ✅                            |                             ✅                             |
| <p><code>local\_dir="path/to/folder"</code><br><code>local\_dir\_use\_symlinks=True</code></p>   |                     |     symlink in folder     |        ✅       |                       <p>⚠️<br><em>(do not resolve path before saving)</em></p>                      |                            ✅                            |                             ✅                             |
| <p><code>local\_dir="path/to/folder"</code><br><code>local\_dir\_use\_symlinks=False</code></p>  |          No         |       file in folder      |        ✅       |                                                   ✅                                                  | <p>❌<br><em>(if re-run, file is re-downloaded)</em></p> | <p>⚠️<br>(multiple copies if ran in multiple folders)</p> |
| <p><code>local\_dir="path/to/folder"</code><br><code>local\_dir\_use\_symlinks=False</code></p>  |         Yes         |       file in folder      |        ✅       |                                                   ✅                                                  |   <p>⚠️<br><em>(file has to be cached first)</em></p>   |         <p>❌<br><em>(file is duplicated)</em></p>         |

**Note:** if you are on a Windows machine, you need to enable developer mode or run `boincai_hub` as admin to enable symlinks. Check out the [cache limitations](https://huggingface.co/docs/huggingface_hub/guides/manage-cache#limitations) section for more details.

### Download from the CLI

You can use the `boincai-cli download` command from the terminal to directly download files from the Hub. Internally, it uses the same [hf\_hub\_download()](https://huggingface.co/docs/huggingface_hub/v0.18.0.rc0/en/package_reference/file_download#huggingface_hub.hf_hub_download) and [snapshot\_download()](https://huggingface.co/docs/huggingface_hub/v0.18.0.rc0/en/package_reference/file_download#huggingface_hub.snapshot_download) helpers described above and prints the returned path to the terminal:

Copied

```
>>> boincai-cli download gpt2 config.json
/home/wauplin/.cache/boincai/hub/models--gpt2/snapshots/11c5a3d5811f50298f278a704980280950aedb10/config.json
```

By default, the token saved locally (using `boincai-cli login`) will be used. If you want to authenticate explicitly, use the `--token` option:

Copied

```
>>> boincai-cli download gpt2 config.json --token=hf_****
/home/wauplin/.cache/boincai/hub/models--gpt2/snapshots/11c5a3d5811f50298f278a704980280950aedb10/config.json
```

You can download multiple files at once which displays a progress bar and returns the snapshot path in which the files are located:

Copied

```
>>> boincai-cli download gpt2 config.json model.safetensors
Fetching 2 files: 100%|████████████████████████████████████████████| 2/2 [00:00<00:00, 23831.27it/s]
/home/wauplin/.cache/boincai/hub/models--gpt2/snapshots/11c5a3d5811f50298f278a704980280950aedb10
```

If you want to silence the progress bars and potential warnings, use the `--quiet` option. This can prove useful if you want to pass the output to another command in a script.

Copied

```
>>> boincai-cli download gpt2 config.json model.safetensors
/home/wauplin/.cache/boincai/hub/models--gpt2/snapshots/11c5a3d5811f50298f278a704980280950aedb10
```

By default, files are downloaded to the cache directory defined by `HF_HOME` environment variable (or `~/.cache/boincai/hub` if not specified). You can override this by using the `--cache-dir` option:

Copied

```
>>> boincai-cli download gpt2 config.json --cache-dir=./cache
./cache/models--gpt2/snapshots/11c5a3d5811f50298f278a704980280950aedb10/config.json
```

If you want to download files to a local folder, without the cache directory structure, you can use `--local-dir`. Downloading to a local folder comes with its limitations which are listed in this [table](https://huggingface.co/docs/huggingface_hub/guides/download#download-files-to-local-folder).

Copied

```
>>> boincai-cli download gpt2 config.json --local-dir=./models/gpt2
./models/gpt2/config.json
```

There are more arguments you can specify to download from different repo types or revisions and to include/exclude files to download using glob patterns:

Copied

```
>>> boincai-cli download bigcode/the-stack --repo-type=dataset --revision=v1.2 --include="data/python/*" --exclu
de="*.json" --exclude="*.zip"
Fetching 206 files:   100%|████████████████████████████████████████████| 206/206 [02:31<2:31, ?it/s]
/home/wauplin/.cache/boincai/hub/datasets--bigcode--the-stack/snapshots/9ca8fa6acdbc8ce920a0cb58adcdafc495818ae7
```

For a full list of the arguments, you can run:

Copied

```
boincai-cli download --help
```

### Faster downloads

If you are running on a machine with high bandwidth, you can increase your download speed with [`hf_transfer`](https://github.com/huggingface/hf_transfer), a Rust-based library developed to speed up file transfers with the Hub. To enable it, install the package (`pip install hf_transfer`) and set `HF_HUB_ENABLE_HF_TRANSFER=1` as an environment variable.

`hf_transfer` is a power user tool! It is tested and production-ready, but it lacks user-friendly features like progress bars or advanced error handling. For more details, please take a look at this [section](https://huggingface.co/docs/huggingface_hub/hf_transfer).


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://boinc-ai.gitbook.io/hub-python-library/how-to-guides/download-files.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
