Cache
Manage BOINCAI_hub cache-system
Understand caching
The BOINC AI Hub cache-system is designed to be the central cache shared across libraries that depend on the Hub. It has been updated in v0.8.0 to prevent re-downloading same files between revisions.
The caching system is designed as follows:
Copied
The <CACHE_DIR>
is usually your user’s home directory. However, it is customizable with the cache_dir
argument on all methods, or by specifying either HF_HOME
or BOINC AI_HUB_CACHE
environment variable.
Models, datasets and spaces share a common root. Each of these repositories contains the repository type, the namespace (organization or username) if it exists and the repository name:
Copied
It is within these folders that all files will now be downloaded from the Hub. Caching ensures that a file isn’t downloaded twice if it already exists and wasn’t updated; but if it was updated, and you’re asking for the latest file, then it will download the latest file (while keeping the previous file intact in case you need it again).
In order to achieve this, all folders contain the same skeleton:
Copied
Each folder is designed to contain the following:
Refs
The refs
folder contains files which indicates the latest revision of the given reference. For example, if we have previously fetched a file from the main
branch of a repository, the refs
folder will contain a file named main
, which will itself contain the commit identifier of the current head.
If the latest commit of main
has aaaaaa
as identifier, then it will contain aaaaaa
.
If that same branch gets updated with a new commit, that has bbbbbb
as an identifier, then re-downloading a file from that reference will update the refs/main
file to contain bbbbbb
.
Blobs
The blobs
folder contains the actual files that we have downloaded. The name of each file is their hash.
Snapshots
The snapshots
folder contains symlinks to the blobs mentioned above. It is itself made up of several folders: one per known revision!
In the explanation above, we had initially fetched a file from the aaaaaa
revision, before fetching a file from the bbbbbb
revision. In this situation, we would now have two folders in the snapshots
folder: aaaaaa
and bbbbbb
.
In each of these folders, live symlinks that have the names of the files that we have downloaded. For example, if we had downloaded the README.md
file at revision aaaaaa
, we would have the following path:
Copied
That README.md
file is actually a symlink linking to the blob that has the hash of the file.
By creating the skeleton this way we open the mechanism to file sharing: if the same file was fetched in revision bbbbbb
, it would have the same hash and the file would not need to be re-downloaded.
.no_exist (advanced)
In addition to the blobs
, refs
and snapshots
folders, you might also find a .no_exist
folder in your cache. This folder keeps track of files that you’ve tried to download once but don’t exist on the Hub. Its structure is the same as the snapshots
folder with 1 subfolder per known revision:
Copied
Unlike the snapshots
folder, files are simple empty files (no symlinks). In this example, the file "config_that_does_not_exist.json"
does not exist on the Hub for the revision "aaaaaa"
. As it only stores empty files, this folder is neglectable is term of disk usage.
So now you might wonder, why is this information even relevant? In some cases, a framework tries to load optional files for a model. Saving the non-existence of optional files makes it faster to load a model as it saves 1 HTTP call per possible optional file. This is for example the case in transformers
where each tokenizer can support additional files. The first time you load the tokenizer on your machine, it will cache which optional files exists (and which doesn’t) to make the loading time faster for the next initializations.
Copied
In practice
In practice, your cache should look like the following tree:
Copied
Limitations
In order to have an efficient cache-system, boincai-hub
uses symlinks. However, symlinks are not supported on all machines. This is a known limitation especially on Windows. When this is the case, boincai_hub
do not use the blobs/
directory but directly stores the files in the snapshots/
directory instead. This workaround allows users to download and cache files from the Hub exactly the same way. Tools to inspect and delete the cache (see below) are also supported. However, the cache-system is less efficient as a single file might be downloaded several times if multiple revisions of the same repo is downloaded.
When symlinks are not supported, a warning message is displayed to the user to alert them they are using a degraded version of the cache-system. This warning can be disabled by setting the HF_HUB_DISABLE_SYMLINKS_WARNING
environment variable to true.
Caching assets
Copied
Assets in practice
In practice, your assets cache should look like the following tree:
Copied
Scan your cache
At the moment, cached files are never deleted from your local directory: when you download a new revision of a branch, previous files are kept in case you need them again. Therefore it can be useful to scan your cache directory in order to know which repos and revisions are taking the most disk space. boincai_hub
provides an helper to do so that can be used via boincai-cli
or in a python script.
Scan cache from the terminal
The easiest way to scan your HF cache-system is to use the scan-cache
command from boincai-cli
tool. This command scans the cache and prints a report with information like repo id, repo type, disk usage, refs and full local path.
The snippet below shows a scan report in a folder in which 4 models and 2 datasets are cached.
Copied
To get a more detailed report, use the --verbose
option. For each repo, you get a list of all revisions that have been downloaded. As explained above, the files that don’t change between 2 revisions are shared thanks to the symlinks. This means that the size of the repo on disk is expected to be less than the sum of the size of each of its revisions. For example, here bert-base-cased
has 2 revisions of 1.4G and 1.5G but the total disk usage is only 1.9G.
Copied
Grep example
Since the output is in tabular format, you can combine it with any grep
-like tools to filter the entries. Here is an example to filter only revisions from the “t5-small” model on a Unix-based machine.
Copied
Scan cache from Python
You can use it to get a detailed report structured around 4 dataclasses:
Here is a simple usage example. See reference for details.
Copied
Clean your cache
Delete strategy
The strategy to delete revisions is the following:
the
snapshot
folder containing the revision symlinks is deleted.blobs files that are targeted only by revisions to be deleted are deleted as well.
if a revision is linked to 1 or more
refs
, references are deleted.if all revisions from a repo are deleted, the entire cached repository is deleted.
Revision hashes are unique across all repositories. This means you don’t need to provide any repo_id
or repo_type
when removing revisions.
Clean cache from the terminal
The easiest way to delete some revisions from your HF cache-system is to use the delete-cache
command from boincai-cli
tool. The command has two modes. By default, a TUI (Terminal User Interface) is displayed to the user to select which revisions to delete. This TUI is currently in beta as it has not been tested on all platforms. If the TUI doesn’t work on your machine, you can disable it using the --disable-tui
flag.
Using the TUI
This is the default mode. To use it, you first need to install extra dependencies by running the following command:
Copied
Then run the command:
Copied
You should now see a list of revisions that you can select/deselect:
Instructions:
Press keyboard arrow keys
<up>
and<down>
to move the cursor.Press
<space>
to toggle (select/unselect) an item.When a revision is selected, the first line is updated to show you how much space will be freed.
Press
<enter>
to confirm your selection.If you want to cancel the operation and quit, you can select the first item (“None of the following”). If this item is selected, the delete process will be cancelled, no matter what other items are selected. Otherwise you can also press
<ctrl+c>
to quit the TUI.
Once you’ve selected the revisions you want to delete and pressed <enter>
, a last confirmation message will be prompted. Press <enter>
again and the deletion will be effective. If you want to cancel, enter n
.
Copied
Without TUI
As mentioned above, the TUI mode is currently in beta and is optional. It may be the case that it doesn’t work on your machine or that you don’t find it convenient.
Another approach is to use the --disable-tui
flag. The process is very similar as you will be asked to manually review the list of revisions to delete. However, this manual step will not take place in the terminal directly but in a temporary file generated on the fly and that you can manually edit.
This file has all the instructions you need in the header. Open it in your favorite text editor. To select/deselect a revision, simply comment/uncomment it with a #
. Once the manual review is done and the file is edited, you can save it. Go back to your terminal and press <enter>
. By default it will compute how much space would be freed with the updated list of revisions. You can continue to edit the file or confirm with "y"
.
Copied
Example of command file:
Copied
Clean cache from Python
Copied
Last updated