Builder classes

Builders

🤗 Datasets relies on two main classes during the dataset building process: DatasetBuilder and BuilderConfig.

class datasets.DatasetBuilder

( cache_dir: typing.Optional[str] = Nonedataset_name: typing.Optional[str] = Noneconfig_name: typing.Optional[str] = Nonehash: typing.Optional[str] = Nonebase_path: typing.Optional[str] = Noneinfo: typing.Optional[datasets.info.DatasetInfo] = Nonefeatures: typing.Optional[datasets.features.features.Features] = Nonetoken: typing.Union[bool, str, NoneType] = Noneuse_auth_token = 'deprecated'repo_id: typing.Optional[str] = Nonedata_files: typing.Union[str, list, dict, datasets.data_files.DataFilesDict, NoneType] = Nonedata_dir: typing.Optional[str] = Nonestorage_options: typing.Optional[dict] = Nonewriter_batch_size: typing.Optional[int] = Nonename = 'deprecated'**config_kwargs )

Parameters

cache_dir (str, optional) — Directory to cache data. Defaults to "~/.cache/huggingface/datasets".
dataset_name (str, optional) — Name of the dataset, if different from the builder name. Useful for packaged builders like csv, imagefolder, audiofolder, etc. to reflect the difference between datasets that use the same packaged builder.
config_name (str, optional) — Name of the dataset configuration. It affects the data generated on disk. Different configurations will have their own subdirectories and versions. If not provided, the default configuration is used (if it exists).
Added in 2.3.0
Parameter name was renamed to config_name.
hash (str, optional) — Hash specific to the dataset code. Used to update the caching directory when the dataset loading script code is updated (to avoid reusing old data). The typical caching directory (defined in self._relative_data_dir) is name/version/hash/.
base_path (str, optional) — Base path for relative paths that are used to download files. This can be a remote URL.
features (Features, optional) — Features types to use with this dataset. It can be used to change the Features types of a dataset, for example.
token (str or bool, optional) — String or boolean to use as Bearer token for remote files on the Datasets Hub. If True, will get token from "~/.huggingface".
repo_id (str, optional) — ID of the dataset repository. Used to distinguish builders with the same name but not coming from the same namespace, for example “squad” and “lhoestq/squad” repo IDs. In the latter, the builder name would be “lhoestq___squad”.
data_files (str or Sequence or Mapping, optional) — Path(s) to source data file(s). For builders like “csv” or “json” that need the user to specify data files. They can be either local or remote files. For convenience, you can use a DataFilesDict.
data_dir (str, optional) — Path to directory containing source data file(s). Use only if data_files is not passed, in which case it is equivalent to passing os.path.join(data_dir, "**") as data_files. For builders that require manual download, it must be the path to the local directory containing the manually downloaded data.
storage_options (dict, optional) — Key/value pairs to be passed on to the dataset file-system backend, if any.
writer_batch_size (int, optional) — Batch size used by the ArrowWriter. It defines the number of samples that are kept in memory before writing them and also the length of the arrow chunks. None means that the ArrowWriter will use its default value.
name (str) — Configuration name for the dataset.
Deprecated in 2.3.0
Use config_name instead.
**config_kwargs (additional keyword arguments) — Keyword arguments to be passed to the corresponding builder configuration class, set on the class attribute DatasetBuilder.BUILDER_CONFIG_CLASS. The builder configuration class is BuilderConfig or a subclass of it.

Abstract base class for all datasets.

DatasetBuilder has 3 key methods:

DatasetBuilder.info: Documents the dataset, including feature names, types, shapes, version, splits, citation, etc.
DatasetBuilder.download_and_prepare(): Downloads the source data and writes it to disk.
DatasetBuilder.as_dataset(): Generates a Dataset.

Some DatasetBuilders expose multiple variants of the dataset by defining a BuilderConfig subclass and accepting a config object (or name) on construction. Configurable datasets expose a pre-defined set of configurations in DatasetBuilder.builder_configs().

as_dataset

Builder classes

Builders

class datasets.DatasetBuilder

class datasets.GeneratorBasedBuilder

class datasets.BeamBasedBuilder

class datasets.ArrowBasedBuilder

class datasets.BuilderConfig

Download

class datasets.DownloadManager

class datasets.StreamingDownloadManager

class datasets.DownloadConfig

class datasets.DownloadMode

Verification

class datasets.VerificationMode

Splits

class datasets.SplitGenerator

class datasets.Split

class datasets.NamedSplit

class datasets.NamedSplitAll

class datasets.ReadInstruction

Version

class datasets.Version