Downloading files
Last updated
Last updated
boincai_hub.hf_hub_download
( repo_id: strfilename: strsubfolder: typing.Optional[str] = Nonerepo_type: typing.Optional[str] = Nonerevision: typing.Optional[str] = Noneendpoint: typing.Optional[str] = Nonelibrary_name: typing.Optional[str] = Nonelibrary_version: typing.Optional[str] = Nonecache_dir: typing.Union[str, pathlib.Path, NoneType] = Nonelocal_dir: typing.Union[str, pathlib.Path, NoneType] = Nonelocal_dir_use_symlinks: typing.Union[bool, typing.Literal['auto']] = 'auto'user_agent: typing.Union[typing.Dict, str, NoneType] = Noneforce_download: bool = Falseforce_filename: typing.Optional[str] = Noneproxies: typing.Optional[typing.Dict] = Noneetag_timeout: float = 10resume_download: bool = Falsetoken: typing.Union[bool, str, NoneType] = Nonelocal_files_only: bool = Falselegacy_cache_layout: bool = False )
Parameters
repo_id (str
) β A user or an organization name and a repo name separated by a /
.
filename (str
) β The name of the file in the repo.
subfolder (str
, optional) β An optional value corresponding to a folder inside the model repo.
repo_type (str
, optional) β Set to "dataset"
or "space"
if downloading from a dataset or space, None
or "model"
if downloading from a model. Default is None
.
revision (str
, optional) β An optional Git revision id which can be a branch name, a tag, or a commit hash.
endpoint (str
, optional) β Hugging Face Hub base url. Will default to . Otherwise, one can set the HF_ENDPOINT
environment variable.
library_name (str
, optional) β The name of the library to which the object corresponds.
library_version (str
, optional) β The version of the library.
cache_dir (str
, Path
, optional) β Path to the folder where cached files are stored.
local_dir (str
or Path
, optional) β If provided, the downloaded file will be placed under this directory, either as a symlink (default) or a regular file (see description for more details).
local_dir_use_symlinks ("auto"
or bool
, defaults to "auto"
) β To be used with local_dir
. If set to βautoβ, the cache directory will be used and the file will be either duplicated or symlinked to the local directory depending on its size. It set to True
, a symlink will be created, no matter the file size. If set to False
, the file will either be duplicated from cache (if already exists) or downloaded from the Hub and not cached. See description for more details.
user_agent (dict
, str
, optional) β The user-agent info in the form of a dictionary or a string.
force_download (bool
, optional, defaults to False
) β Whether the file should be downloaded even if it already exists in the local cache.
proxies (dict
, optional) β Dictionary mapping protocol to the URL of the proxy passed to requests.request
.
etag_timeout (float
, optional, defaults to 10
) β When fetching ETag, how many seconds to wait for the server to send data before giving up which is passed to requests.request
.
resume_download (bool
, optional, defaults to False
) β If True
, resume a previously interrupted download.
token (str
, bool
, optional) β A token to be used for the download.
If True
, the token is read from the HuggingFace config folder.
If a string, itβs used as the authentication token.
local_files_only (bool
, optional, defaults to False
) β If True
, avoid downloading the file and return the path to the local cached file if it exists.
legacy_cache_layout (bool
, optional, defaults to False
) β If True
, uses the legacy file cache layout i.e. just call then cached_download
. This is deprecated as the new cache layout is more powerful.
Download a given file if itβs not already present in the local cache.
The new cache file layout looks like this:
The cache directory contains one subfolder per repo_id (namespaced by repo type)
inside each repo folder:
refs is a list of the latest known revision => commit_hash pairs
blobs contains the actual file blobs (identified by their git-sha or sha256, depending on whether theyβre LFS files or not)
snapshots contains one subfolder per commit, each βcommitβ contains the subset of the files that have been resolved at that particular commit. Each filename is a symlink to the blob at that particular commit.
If local_dir
is provided, the file structure from the repo will be replicated in this location. You can configure how you want to move those files:
If local_dir_use_symlinks="auto"
(default), files are downloaded and stored in the cache directory as blob files. Small files (<5MB) are duplicated in local_dir
while a symlink is created for bigger files. The goal is to be able to manually edit and save small files without corrupting the cache while saving disk space for binary files. The 5MB threshold can be configured with the HF_HUB_LOCAL_DIR_AUTO_SYMLINK_THRESHOLD
environment variable.
If local_dir_use_symlinks=True
, files are downloaded, stored in the cache directory and symlinked in local_dir
. This is optimal in term of disk usage but files must not be manually edited.
If local_dir_use_symlinks=False
and the blob files exist in the cache directory, they are duplicated in the local dir. This means disk usage is not optimized.
Finally, if local_dir_use_symlinks=False
and the blob files do not exist in the cache directory, then the files are downloaded and directly placed under local_dir
. This means if you need to download them again later, they will be re-downloaded entirely.
Copied
Raises the following errors:
huggingface_hub.hf_hub_url
( repo_id: strfilename: strsubfolder: typing.Optional[str] = Nonerepo_type: typing.Optional[str] = Nonerevision: typing.Optional[str] = Noneendpoint: typing.Optional[str] = None )
Parameters
repo_id (str
) β A namespace (user or an organization) name and a repo name separated by a /
.
filename (str
) β The name of the file in the repo.
subfolder (str
, optional) β An optional value corresponding to a folder inside the repo.
repo_type (str
, optional) β Set to "dataset"
or "space"
if downloading from a dataset or space, None
or "model"
if downloading from a model. Default is None
.
revision (str
, optional) β An optional Git revision id which can be a branch name, a tag, or a commit hash.
Construct the URL of a file from the given information.
The resolved address can either be a huggingface.co-hosted url, or a link to Cloudfront (a Content Delivery Network, or CDN) for large files which are more than a few MBs.
Example:
Copied
Notes:
Cloudfront is replicated over the globe so downloads are way faster for the end user (and it also lowers our bandwidth costs).
Cloudfront aggressively caches files by default (default TTL is 24 hours), however this is not an issue here because we implement a git-based versioning system on huggingface.co, which means that we store the files on S3/Cloudfront in a content-addressable way (i.e., the file name is its hash). Using content-addressable filenames means cache canβt ever be stale.
In terms of client-side caching from this library, we base our caching on the objectsβ entity tag (ETag
), which is an identifier of a specific version of a resource [1]_. An objectβs ETag is: its git-sha1 if stored in git, or its sha256 if stored in git-lfs.
References:
huggingface_hub.snapshot_download
( repo_id: strrepo_type: typing.Optional[str] = Nonerevision: typing.Optional[str] = Noneendpoint: typing.Optional[str] = Nonecache_dir: typing.Union[str, pathlib.Path, NoneType] = Nonelocal_dir: typing.Union[str, pathlib.Path, NoneType] = Nonelocal_dir_use_symlinks: typing.Union[bool, typing.Literal['auto']] = 'auto'library_name: typing.Optional[str] = Nonelibrary_version: typing.Optional[str] = Noneuser_agent: typing.Union[typing.Dict, str, NoneType] = Noneproxies: typing.Optional[typing.Dict] = Noneetag_timeout: float = 10resume_download: bool = Falseforce_download: bool = Falsetoken: typing.Union[str, bool, NoneType] = Nonelocal_files_only: bool = Falseallow_patterns: typing.Union[typing.List[str], str, NoneType] = Noneignore_patterns: typing.Union[typing.List[str], str, NoneType] = Nonemax_workers: int = 8tqdm_class: typing.Optional[tqdm.asyncio.tqdm_asyncio] = None )
Parameters
repo_id (str
) β A user or an organization name and a repo name separated by a /
.
repo_type (str
, optional) β Set to "dataset"
or "space"
if downloading from a dataset or space, None
or "model"
if downloading from a model. Default is None
.
revision (str
, optional) β An optional Git revision id which can be a branch name, a tag, or a commit hash.
cache_dir (str
, Path
, optional) β Path to the folder where cached files are stored.
local_dir (str
or Path
, optional) β If provided, the downloaded files will be placed under this directory, either as symlinks (default) or regular files (see description for more details).
local_dir_use_symlinks ("auto"
or bool
, defaults to "auto"
) β To be used with local_dir
. If set to βautoβ, the cache directory will be used and the file will be either duplicated or symlinked to the local directory depending on its size. It set to True
, a symlink will be created, no matter the file size. If set to False
, the file will either be duplicated from cache (if already exists) or downloaded from the Hub and not cached. See description for more details.
library_name (str
, optional) β The name of the library to which the object corresponds.
library_version (str
, optional) β The version of the library.
user_agent (str
, dict
, optional) β The user-agent info in the form of a dictionary or a string.
proxies (dict
, optional) β Dictionary mapping protocol to the URL of the proxy passed to requests.request
.
etag_timeout (float
, optional, defaults to 10
) β When fetching ETag, how many seconds to wait for the server to send data before giving up which is passed to requests.request
.
resume_download (bool
, optional, defaults to False) -- If
True`, resume a previously interrupted download.
force_download (bool
, optional, defaults to False
) β Whether the file should be downloaded even if it already exists in the local cache.
token (str
, bool
, optional) β A token to be used for the download.
If True
, the token is read from the HuggingFace config folder.
If a string, itβs used as the authentication token.
local_files_only (bool
, optional, defaults to False
) β If True
, avoid downloading the file and return the path to the local cached file if it exists.
allow_patterns (List[str]
or str
, optional) β If provided, only files matching at least one pattern are downloaded.
ignore_patterns (List[str]
or str
, optional) β If provided, files matching any of the patterns are not downloaded.
max_workers (int
, optional) β Number of concurrent threads to download files (1 thread = 1 file download). Defaults to 8.
tqdm_class (tqdm
, optional) β If provided, overwrites the default behavior for the progress bar. Passed argument must inherit from tqdm.auto.tqdm
or at least mimic its behavior. Note that the tqdm_class
is not passed to each individual download. Defaults to the custom HF progress bar that can be disabled by setting HF_HUB_DISABLE_PROGRESS_BARS
environment variable.
Download repo files.
Download a whole snapshot of a repoβs files at the specified revision. This is useful when you want all files from a repo, because you donβt know which ones you will need a priori. All files are nested inside a folder in order to keep their actual filename relative to that folder. You can also filter which files to download using allow_patterns
and ignore_patterns
.
If local_dir
is provided, the file structure from the repo will be replicated in this location. You can configure how you want to move those files:
If local_dir_use_symlinks="auto"
(default), files are downloaded and stored in the cache directory as blob files. Small files (<5MB) are duplicated in local_dir
while a symlink is created for bigger files. The goal is to be able to manually edit and save small files without corrupting the cache while saving disk space for binary files. The 5MB threshold can be configured with the HF_HUB_LOCAL_DIR_AUTO_SYMLINK_THRESHOLD
environment variable.
If local_dir_use_symlinks=True
, files are downloaded, stored in the cache directory and symlinked in local_dir
. This is optimal in term of disk usage but files must not be manually edited.
If local_dir_use_symlinks=False
and the blob files exist in the cache directory, they are duplicated in the local dir. This means disk usage is not optimized.
Finally, if local_dir_use_symlinks=False
and the blob files do not exist in the cache directory, then the files are downloaded and directly placed under local_dir
. This means if you need to download them again later, they will be re-downloaded entirely.
An alternative would be to clone the repo but this requires git and git-lfs to be installed and properly configured. It is also not possible to filter which files to download when cloning a repository using git.
Raises the following errors:
huggingface_hub.get_hf_file_metadata
( url: strtoken: typing.Union[bool, str, NoneType] = Noneproxies: typing.Optional[typing.Dict] = Nonetimeout: typing.Optional[float] = 10.0 )
Parameters
token (str
or bool
, optional) β A token to be used for the download.
If True
, the token is read from the HuggingFace config folder.
If False
or None
, no token is provided.
If a string, itβs used as the authentication token.
proxies (dict
, optional) β Dictionary mapping protocol to the URL of the proxy passed to requests.request
.
timeout (float
, optional, defaults to 10) β How many seconds to wait for the server to send metadata before giving up.
Fetch metadata of a file versioned on the Hub for a given url.
( commit_hash: typing.Optional[str]etag: typing.Optional[str]location: strsize: typing.Optional[int] )
Parameters
commit_hash (str
, optional) β The commit_hash related to the file.
etag (str
, optional) β Etag of the file on the server.
location (str
) β Location where to download the file. Can be a Hub url or not (CDN).
size (size
) β Size of the file. In case of an LFS file, contains the size of the actual LFS file, not the pointer.
Data structure containing information about a file versioned on the Hub.
The methods displayed above are designed to work with a caching system that prevents re-downloading files. The caching system was updated in v0.8.0 to become the central cache-system shared across libraries that depend on the Hub.
if token=True
and the token cannot be found.
if ETag cannot be determined.
if some parameter value is invalid
If the repository to download from cannot be found. This may be because it doesnβt exist, or because it is set to private
and you do not have access.
If the revision to download from cannot be found.
If the file to download cannot be found.
If network is disabled or unavailable and file is not found in cache.
endpoint (str
, optional) β Hugging Face Hub base url. Will default to . Otherwise, one can set the HF_ENDPOINT
environment variable.
[1]
endpoint (str
, optional) β Hugging Face Hub base url. Will default to . Otherwise, one can set the HF_ENDPOINT
environment variable.
if token=True
and the token cannot be found.
if ETag cannot be determined.
if some parameter value is invalid
url (str
) β File url, for example returned by .
Returned by based on a URL.
Read the for a detailed presentation of caching at at HF.