Main classes
Last updated
Last updated
( description: str = <factory>citation: str = <factory>homepage: str = <factory>license: str = <factory>features: typing.Optional[datasets.features.features.Features] = Nonepost_processed: typing.Optional[datasets.info.PostProcessedInfo] = Nonesupervised_keys: typing.Optional[datasets.info.SupervisedKeysData] = Nonetask_templates: typing.Optional[typing.List[datasets.tasks.base.TaskTemplate]] = Nonebuilder_name: typing.Optional[str] = Nonedataset_name: typing.Optional[str] = Noneconfig_name: typing.Optional[str] = Noneversion: typing.Union[str, datasets.utils.version.Version, NoneType] = Nonesplits: typing.Optional[dict] = Nonedownload_checksums: typing.Optional[dict] = Nonedownload_size: typing.Optional[int] = Nonepost_processing_size: typing.Optional[int] = Nonedataset_size: typing.Optional[int] = Nonesize_in_bytes: typing.Optional[int] = None )
Parameters
description (str
) — A description of the dataset.
citation (str
) — A BibTeX citation of the dataset.
homepage (str
) — A URL to the official homepage for the dataset.
license (str
) — The dataset’s license. It can be the name of the license or a paragraph containing the terms of the license.
features (, optional) — The features used to specify the dataset’s column types.
post_processed (PostProcessedInfo
, optional) — Information regarding the resources of a possible post-processing of a dataset. For example, it can contain the information of an index.
supervised_keys (SupervisedKeysData
, optional) — Specifies the input feature and the label for supervised learning if applicable for the dataset (legacy from TFDS).
builder_name (str
, optional) — The name of the GeneratorBasedBuilder
subclass used to create the dataset. Usually matched to the corresponding script name. It is also the snake_case version of the dataset builder class name.
config_name (str
, optional) — The name of the configuration derived from .
version (str
or , optional) — The version of the dataset.
splits (dict
, optional) — The mapping between split name and metadata.
download_checksums (dict
, optional) — The mapping between the URL to download the dataset’s checksums and corresponding metadata.
download_size (int
, optional) — The size of the files to download to generate the dataset, in bytes.
post_processing_size (int
, optional) — Size of the dataset in bytes after post-processing, if any.
dataset_size (int
, optional) — The combined size in bytes of the Arrow tables for all splits.
size_in_bytes (int
, optional) — The combined size in bytes of all files associated with the dataset (downloaded files + Arrow files).
task_templates (List[TaskTemplate]
, optional) — The task templates to prepare the dataset for during training and evaluation. Each template casts the dataset’s to standardized column names and types as detailed in datasets.tasks
.
**config_kwargs (additional keyword arguments) — Keyword arguments to be passed to the and used in the .
Information about a dataset.
DatasetInfo
documents datasets, including its name, version, and features. See the constructor arguments and properties for a full list.
Not all fields are known on construction and may be updated later.
from_directory
( dataset_info_dir: strfs = 'deprecated'storage_options: typing.Optional[dict] = None )
Parameters
dataset_info_dir (str
) — The directory containing the metadata file. This should be the root directory of a specific dataset version.
fs (fsspec.spec.AbstractFileSystem
, optional) — Instance of the remote filesystem used to download the files from.
Deprecated in 2.9.0
fs
was deprecated in version 2.9.0 and will be removed in 3.0.0. Please use storage_options
instead, e.g. storage_options=fs.storage_options
.
storage_options (dict
, optional) — Key/value pairs to be passed on to the file-system backend, if any.
Added in 2.9.0
This will overwrite all previous metadata.
Example:
Copied
write_to_directory
( dataset_info_dirpretty_print = Falsefs = 'deprecated'storage_options: typing.Optional[dict] = None )
Parameters
dataset_info_dir (str
) — Destination directory.
pretty_print (bool
, defaults to False
) — If True
, the JSON will be pretty-printed with the indent level of 4.
fs (fsspec.spec.AbstractFileSystem
, optional) — Instance of the remote filesystem used to download the files from.
Deprecated in 2.9.0
fs
was deprecated in version 2.9.0 and will be removed in 3.0.0. Please use storage_options
instead, e.g. storage_options=fs.storage_options
.
storage_options (dict
, optional) — Key/value pairs to be passed on to the file-system backend, if any.
Added in 2.9.0
Write DatasetInfo
and license (if present) as JSON files to dataset_info_dir
.
Example:
Copied
( arrow_table: Tableinfo: typing.Optional[datasets.info.DatasetInfo] = Nonesplit: typing.Optional[datasets.splits.NamedSplit] = Noneindices_table: typing.Optional[datasets.table.Table] = Nonefingerprint: typing.Optional[str] = None )
A Dataset backed by an Arrow table.
add_column
( name: strcolumn: typing.Union[list, <built-in function array>]new_fingerprint: str )
Parameters
name (str
) — Column name.
column (list
or np.array
) — Column data to be added.
Add column to Dataset.
Added in 1.7
Example:
Copied
add_item
( item: dictnew_fingerprint: str )
Parameters
item (dict
) — Item data to be added.
Add item to Dataset.
Added in 1.7
Example:
Copied
from_file
( filename: strinfo: typing.Optional[datasets.info.DatasetInfo] = Nonesplit: typing.Optional[datasets.splits.NamedSplit] = Noneindices_filename: typing.Optional[str] = Nonein_memory: bool = False )
Parameters
filename (str
) — File name of the dataset.
info (DatasetInfo
, optional) — Dataset information, like description, citation, etc.
split (NamedSplit
, optional) — Name of the dataset split.
indices_filename (str
, optional) — File names of the indices.
in_memory (bool
, defaults to False
) — Whether to copy the data in-memory.
Instantiate a Dataset backed by an Arrow table at filename.
from_buffer
( buffer: Bufferinfo: typing.Optional[datasets.info.DatasetInfo] = Nonesplit: typing.Optional[datasets.splits.NamedSplit] = Noneindices_buffer: typing.Optional[pyarrow.lib.Buffer] = None )
Parameters
buffer (pyarrow.Buffer
) — Arrow buffer.
info (DatasetInfo
, optional) — Dataset information, like description, citation, etc.
split (NamedSplit
, optional) — Name of the dataset split.
indices_buffer (pyarrow.Buffer
, optional) — Indices Arrow buffer.
Instantiate a Dataset backed by an Arrow buffer.
from_pandas
( df: DataFramefeatures: typing.Optional[datasets.features.features.Features] = Noneinfo: typing.Optional[datasets.info.DatasetInfo] = Nonesplit: typing.Optional[datasets.splits.NamedSplit] = Nonepreserve_index: typing.Optional[bool] = None )
Parameters
df (pandas.DataFrame
) — Dataframe that contains the dataset.
info (DatasetInfo
, optional) — Dataset information, like description, citation, etc.
split (NamedSplit
, optional) — Name of the dataset split.
preserve_index (bool
, optional) — Whether to store the index as an additional column in the resulting Dataset. The default of None
will store the index as a column, except for RangeIndex
which is stored as metadata only. Use preserve_index=True
to force it to be stored as a column.
The column types in the resulting Arrow Table are inferred from the dtypes of the pandas.Series
in the DataFrame. In the case of non-object Series, the NumPy dtype is translated to its Arrow equivalent. In the case of object
, we need to guess the datatype by looking at the Python objects in this Series.
Be aware that Series of the object
dtype don’t carry enough information to always lead to a meaningful Arrow type. In the case that we cannot infer a type, e.g. because the DataFrame is of length 0 or the Series only contains None/nan
objects, the type is set to null
. This behavior can be avoided by constructing explicit features and passing it to this function.
Example:
Copied
from_dict
( mapping: dictfeatures: typing.Optional[datasets.features.features.Features] = Noneinfo: typing.Optional[datasets.info.DatasetInfo] = Nonesplit: typing.Optional[datasets.splits.NamedSplit] = None )
Parameters
mapping (Mapping
) — Mapping of strings to Arrays or Python lists.
info (DatasetInfo
, optional) — Dataset information, like description, citation, etc.
split (NamedSplit
, optional) — Name of the dataset split.
from_generator
( generator: typing.Callablefeatures: typing.Optional[datasets.features.features.Features] = Nonecache_dir: str = Nonekeep_in_memory: bool = Falsegen_kwargs: typing.Optional[dict] = Nonenum_proc: typing.Optional[int] = None**kwargs )
Parameters
generator ( —Callable
): A generator function that yields
examples.
cache_dir (str
, optional, defaults to "~/.cache/boincai/datasets"
) — Directory to cache data.
keep_in_memory (bool
, defaults to False
) — Whether to copy the data in-memory.
gen_kwargs(dict
, optional) — Keyword arguments to be passed to the generator
callable. You can define a sharded dataset by passing the list of shards in gen_kwargs
.
num_proc (int
, optional, defaults to None
) — Number of processes when downloading and generating the dataset locally. This is helpful if the dataset is made of multiple files. Multiprocessing is disabled by default.
Added in 2.7.0
**kwargs (additional keyword arguments) — Keyword arguments to be passed to :GeneratorConfig
.
Create a Dataset from a generator.
Example:
Copied
Copied
data
( )
The Apache Arrow table backing the dataset.
Example:
Copied
cache_files
( )
The cache files containing the Apache Arrow table backing the dataset.
Example:
Copied
num_columns
( )
Number of columns in the dataset.
Example:
Copied
num_rows
( )
Example:
Copied
column_names
( )
Names of the columns in the dataset.
Example:
Copied
shape
( )
Shape of the dataset (number of columns, number of rows).
Example:
Copied
unique
( column: str ) → list
Parameters
Returns
list
List of unique elements in the given column.
Return a list of the unique elements in a column.
This is implemented in the low-level backend and as such, very fast.
Example:
Copied
flatten
Parameters
new_fingerprint (str
, optional) — The new fingerprint of the dataset after transform. If None
, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments.
Returns
A copy of the dataset with flattened columns.
Flatten the table. Each column with a struct type is flattened into one column per struct field. Other columns are left unchanged.
Example:
Copied
cast
Parameters
batch_size (int
, defaults to 1000
) — Number of examples per batch provided to cast. If batch_size <= 0
or batch_size == None
then provide the full dataset as a single batch to cast.
keep_in_memory (bool
, defaults to False
) — Whether to copy the data in-memory.
load_from_cache_file (bool
, defaults to True
if caching is enabled) — If a cache file storing the current computation from function
can be identified, use it instead of recomputing.
cache_file_name (str
, optional, defaults to None
) — Provide the name of a path for the cache file. It is used to store the results of the computation instead of the automatically generated cache file name.
num_proc (int
, optional, defaults to None
) — Number of processes for multiprocessing. By default it doesn’t use multiprocessing.
Returns
A copy of the dataset with casted features.
Cast the dataset to a new set of features.
Example:
Copied
cast_column
( column: strfeature: typing.Union[dict, list, tuple, datasets.features.features.Value, datasets.features.features.ClassLabel, datasets.features.translation.Translation, datasets.features.translation.TranslationVariableLanguages, datasets.features.features.Sequence, datasets.features.features.Array2D, datasets.features.features.Array3D, datasets.features.features.Array4D, datasets.features.features.Array5D, datasets.features.audio.Audio, datasets.features.image.Image]new_fingerprint: typing.Optional[str] = None )
Parameters
column (str
) — Column name.
feature (FeatureType
) — Target feature.
new_fingerprint (str
, optional) — The new fingerprint of the dataset after transform. If None
, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments.
Cast column to feature for decoding.
Example:
Copied
remove_columns
Parameters
column_names (Union[str, List[str]]
) — Name of the column(s) to remove.
new_fingerprint (str
, optional) — The new fingerprint of the dataset after transform. If None
, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments.
Returns
A copy of the dataset object without the columns to remove.
Remove one or several column(s) in the dataset and the features associated to them.
Example:
Copied
rename_column
Parameters
original_column_name (str
) — Name of the column to rename.
new_column_name (str
) — New name for the column.
new_fingerprint (str
, optional) — The new fingerprint of the dataset after transform. If None
, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments.
Returns
A copy of the dataset with a renamed column.
Rename a column in the dataset, and move the features associated to the original column under the new column name.
Example:
Copied
rename_columns
Parameters
column_mapping (Dict[str, str]
) — A mapping of columns to rename to their new names
new_fingerprint (str
, optional) — The new fingerprint of the dataset after transform. If None
, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments.
Returns
A copy of the dataset with renamed columns
Rename several columns in the dataset, and move the features associated to the original columns under the new column names.
Example:
Copied
select_columns
Parameters
column_names (Union[str, List[str]]
) — Name of the column(s) to keep.
new_fingerprint (str
, optional) — The new fingerprint of the dataset after transform. If None
, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments.
Returns
A copy of the dataset object which only consists of selected columns.
Select one or several column(s) in the dataset and the features associated to them.
Example:
Copied
class_encode_column
( column: strinclude_nulls: bool = False )
Parameters
include_nulls (bool
, defaults to False
) — Whether to include null values in the class labels. If True
, the null values will be encoded as the "None"
class label.
Added in 1.14.2
Example:
Copied
__len__
( )
Number of rows in the dataset.
Example:
Copied
__iter__
( )
Iterate through the examples.
iter
( batch_size: intdrop_last_batch: bool = False )
Parameters
batch_size (int
) — size of each batch to yield.
drop_last_batch (bool
, default False) — Whether a last batch smaller than the batch_size should be dropped
Iterate through the batches of size batch_size.
If a formatting is set with [~datasets.Dataset.set_format] rows will be returned with the selected format.
formatted_as
( type: typing.Optional[str] = Nonecolumns: typing.Optional[typing.List] = Noneoutput_all_columns: bool = False**format_kwargs )
Parameters
type (str
, optional) — Output type selected in [None, 'numpy', 'torch', 'tensorflow', 'pandas', 'arrow', 'jax']
. None
means `getitem“ returns python objects (default).
columns (List[str]
, optional) — Columns to format in the output. None
means __getitem__
returns all columns (default).
output_all_columns (bool
, defaults to False
) — Keep un-formatted columns as well in the output (as python objects).
**format_kwargs (additional keyword arguments) — Keywords arguments passed to the convert function like np.array
, torch.tensor
or tensorflow.ragged.constant
.
To be used in a with
statement. Set __getitem__
return format (type and columns).
set_format
( type: typing.Optional[str] = Nonecolumns: typing.Optional[typing.List] = Noneoutput_all_columns: bool = False**format_kwargs )
Parameters
type (str
, optional) — Either output type selected in [None, 'numpy', 'torch', 'tensorflow', 'pandas', 'arrow', 'jax']
. None
means __getitem__
returns python objects (default).
columns (List[str]
, optional) — Columns to format in the output. None
means __getitem__
returns all columns (default).
output_all_columns (bool
, defaults to False
) — Keep un-formatted columns as well in the output (as python objects).
**format_kwargs (additional keyword arguments) — Keywords arguments passed to the convert function like np.array
, torch.tensor
or tensorflow.ragged.constant
.
gets updated. In this case, if you apply map
on a dataset to add a new column, then this column will be formatted as:
Copied
Example:
Copied
set_transform
( transform: typing.Optional[typing.Callable]columns: typing.Optional[typing.List] = Noneoutput_all_columns: bool = False )
Parameters
columns (List[str]
, optional) — Columns to format in the output. If specified, then the input batch of the transform only contains those columns.
output_all_columns (bool
, defaults to False
) — Keep un-formatted columns as well in the output (as python objects). If set to True, then the other un-formatted columns are kept with the output of the transform.
Example:
Copied
reset_format
( )
Reset __getitem__
return format to python objects and all columns.
Same as self.set_format()
Example:
Copied
with_format
( type: typing.Optional[str] = Nonecolumns: typing.Optional[typing.List] = Noneoutput_all_columns: bool = False**format_kwargs )
Parameters
type (str
, optional) — Either output type selected in [None, 'numpy', 'torch', 'tensorflow', 'pandas', 'arrow', 'jax']
. None
means __getitem__
returns python objects (default).
columns (List[str]
, optional) — Columns to format in the output. None
means __getitem__
returns all columns (default).
output_all_columns (bool
, defaults to False
) — Keep un-formatted columns as well in the output (as python objects).
**format_kwargs (additional keyword arguments) — Keywords arguments passed to the convert function like np.array
, torch.tensor
or tensorflow.ragged.constant
.
Set __getitem__
return format (type and columns). The data formatting is applied on-the-fly. The format type
(for example “numpy”) is used to format batches when using __getitem__
.
Example:
Copied
with_transform
( transform: typing.Optional[typing.Callable]columns: typing.Optional[typing.List] = Noneoutput_all_columns: bool = False )
Parameters
columns (List[str]
, optional
) — Columns to format in the output. If specified, then the input batch of the transform only contains those columns.
output_all_columns (bool
, defaults to False
) — Keep un-formatted columns as well in the output (as python objects). If set to True
, then the other un-formatted columns are kept with the output of the transform.
Set __getitem__
return format using this transform. The transform is applied on-the-fly on batches when __getitem__
is called.
Example:
Copied
__getitem__
( key )
Can be used to index columns (by string names) or rows (by integer index or iterable of indices or bools).
cleanup_cache_files
( ) → int
Returns
int
Number of removed files.
Clean up all cache files in the dataset cache directory, excepted the currently used cache file if there is one.
Be careful when running this command that no other process is currently using other cache files.
Example:
Copied
map
( function: typing.Optional[typing.Callable] = Nonewith_indices: bool = Falsewith_rank: bool = Falseinput_columns: typing.Union[str, typing.List[str], NoneType] = Nonebatched: bool = Falsebatch_size: typing.Optional[int] = 1000drop_last_batch: bool = Falseremove_columns: typing.Union[str, typing.List[str], NoneType] = Nonekeep_in_memory: bool = Falseload_from_cache_file: typing.Optional[bool] = Nonecache_file_name: typing.Optional[str] = Nonewriter_batch_size: typing.Optional[int] = 1000features: typing.Optional[datasets.features.features.Features] = Nonedisable_nullable: bool = Falsefn_kwargs: typing.Optional[dict] = Nonenum_proc: typing.Optional[int] = Nonesuffix_template: str = '_{rank:05d}_of_{num_proc:05d}'new_fingerprint: typing.Optional[str] = Nonedesc: typing.Optional[str] = None )
Parameters
function (Callable
) — Function with one of the following signatures:
function(example: Dict[str, Any]) -> Dict[str, Any]
if batched=False
and with_indices=False
and with_rank=False
function(example: Dict[str, Any], *extra_args) -> Dict[str, Any]
if batched=False
and with_indices=True
and/or with_rank=True
(one extra arg for each)
function(batch: Dict[str, List]) -> Dict[str, List]
if batched=True
and with_indices=False
and with_rank=False
function(batch: Dict[str, List], *extra_args) -> Dict[str, List]
if batched=True
and with_indices=True
and/or with_rank=True
(one extra arg for each)
For advanced usage, the function can also return a pyarrow.Table
. Moreover if your function returns nothing (None
), then map
will run your function and return the dataset unchanged. If no function is provided, default to identity function: lambda x: x
.
with_indices (bool
, defaults to False
) — Provide example indices to function
. Note that in this case the signature of function
should be def function(example, idx[, rank]): ...
.
with_rank (bool
, defaults to False
) — Provide process rank to function
. Note that in this case the signature of function
should be def function(example[, idx], rank): ...
.
input_columns (Optional[Union[str, List[str]]]
, defaults to None
) — The columns to be passed into function
as positional arguments. If None
, a dict
mapping to all formatted columns is passed as one argument.
batched (bool
, defaults to False
) — Provide batch of examples to function
.
batch_size (int
, optional, defaults to 1000
) — Number of examples per batch provided to function
if batched=True
. If batch_size <= 0
or batch_size == None
, provide the full dataset as a single batch to function
.
drop_last_batch (bool
, defaults to False
) — Whether a last batch smaller than the batch_size should be dropped instead of being processed by the function.
remove_columns (Optional[Union[str, List[str]]]
, defaults to None
) — Remove a selection of columns while doing the mapping. Columns will be removed before updating the examples with the output of function
, i.e. if function
is adding columns with names in remove_columns
, these columns will be kept.
keep_in_memory (bool
, defaults to False
) — Keep the dataset in memory instead of writing it to a cache file.
load_from_cache_file (Optioanl[bool]
, defaults to True
if caching is enabled) — If a cache file storing the current computation from function
can be identified, use it instead of recomputing.
cache_file_name (str
, optional, defaults to None
) — Provide the name of a path for the cache file. It is used to store the results of the computation instead of the automatically generated cache file name.
writer_batch_size (int
, defaults to 1000
) — Number of rows per write operation for the cache file writer. This value is a good trade-off between memory usage during the processing, and processing speed. Higher value makes the processing do fewer lookups, lower value consume less temporary memory while running map
.
features (Optional[datasets.Features]
, defaults to None
) — Use a specific Features to store the cache file instead of the automatically generated one.
disable_nullable (bool
, defaults to False
) — Disallow null values in the table.
fn_kwargs (Dict
, optional, defaults to None
) — Keyword arguments to be passed to function
.
num_proc (int
, optional, defaults to None
) — Max number of processes when generating cache. Already cached shards are loaded sequentially.
suffix_template (str
) — If cache_file_name
is specified, then this suffix will be added at the end of the base name of each. Defaults to "_{rank:05d}_of_{num_proc:05d}"
. For example, if cache_file_name
is “processed.arrow”, then for rank=1
and num_proc=4
, the resulting file would be "processed_00001_of_00004.arrow"
for the default suffix.
new_fingerprint (str
, optional, defaults to None
) — The new fingerprint of the dataset after transform. If None
, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments.
desc (str
, optional, defaults to None
) — Meaningful description to be displayed alongside with the progress bar while mapping examples.
Apply a function to all the examples in the table (individually or in batches) and update the table. If your function returns a column that already exists, then it overwrites it.
You can specify whether the function should be batched or not with the batched
parameter:
If batched is False
, then the function takes 1 example in and should return 1 example. An example is a dictionary, e.g. {"text": "Hello there !"}
.
If batched is True
and batch_size
is 1, then the function takes a batch of 1 example as input and can return a batch with 1 or more examples. A batch is a dictionary, e.g. a batch of 1 example is {"text": ["Hello there !"]}
.
If batched is True
and batch_size
is n > 1
, then the function takes a batch of n
examples as input and can return a batch with n
examples, or with an arbitrary number of examples. Note that the last batch may have less than n
examples. A batch is a dictionary, e.g. a batch of n
examples is {"text": ["Hello there !"] * n}
.
Example:
Copied
filter
( function: typing.Optional[typing.Callable] = Nonewith_indices = Falseinput_columns: typing.Union[str, typing.List[str], NoneType] = Nonebatched: bool = Falsebatch_size: typing.Optional[int] = 1000keep_in_memory: bool = Falseload_from_cache_file: typing.Optional[bool] = Nonecache_file_name: typing.Optional[str] = Nonewriter_batch_size: typing.Optional[int] = 1000fn_kwargs: typing.Optional[dict] = Nonenum_proc: typing.Optional[int] = Nonesuffix_template: str = '_{rank:05d}_of_{num_proc:05d}'new_fingerprint: typing.Optional[str] = Nonedesc: typing.Optional[str] = None )
Parameters
function (Callable
) — Callable with one of the following signatures:
function(example: Dict[str, Any]) -> bool
if with_indices=False, batched=False
function(example: Dict[str, Any], indices: int) -> bool
if with_indices=True, batched=False
function(example: Dict[str, List]) -> List[bool]
if with_indices=False, batched=True
function(example: Dict[str, List], indices: List[int]) -> List[bool]
if with_indices=True, batched=True
If no function is provided, defaults to an always True
function: lambda x: True
.
with_indices (bool
, defaults to False
) — Provide example indices to function
. Note that in this case the signature of function
should be def function(example, idx): ...
.
input_columns (str
or List[str]
, optional) — The columns to be passed into function
as positional arguments. If None
, a dict
mapping to all formatted columns is passed as one argument.
batched (bool
, defaults to False
) — Provide batch of examples to function
.
batch_size (int
, optional, defaults to 1000
) — Number of examples per batch provided to function
if batched = True
. If batched = False
, one example per batch is passed to function
. If batch_size <= 0
or batch_size == None
, provide the full dataset as a single batch to function
.
keep_in_memory (bool
, defaults to False
) — Keep the dataset in memory instead of writing it to a cache file.
load_from_cache_file (Optional[bool]
, defaults to True
if caching is enabled) — If a cache file storing the current computation from function
can be identified, use it instead of recomputing.
cache_file_name (str
, optional) — Provide the name of a path for the cache file. It is used to store the results of the computation instead of the automatically generated cache file name.
writer_batch_size (int
, defaults to 1000
) — Number of rows per write operation for the cache file writer. This value is a good trade-off between memory usage during the processing, and processing speed. Higher value makes the processing do fewer lookups, lower value consume less temporary memory while running map
.
fn_kwargs (dict
, optional) — Keyword arguments to be passed to function
.
num_proc (int
, optional) — Number of processes for multiprocessing. By default it doesn’t use multiprocessing.
suffix_template (str
) — If cache_file_name
is specified, then this suffix will be added at the end of the base name of each. For example, if cache_file_name
is "processed.arrow"
, then for rank = 1
and num_proc = 4
, the resulting file would be "processed_00001_of_00004.arrow"
for the default suffix (default _{rank:05d}_of_{num_proc:05d}
).
new_fingerprint (str
, optional) — The new fingerprint of the dataset after transform. If None
, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments.
desc (str
, optional, defaults to None
) — Meaningful description to be displayed alongside with the progress bar while filtering examples.
Apply a filter function to all the elements in the table in batches and update the table so that the dataset only includes examples according to the filter function.
Example:
Copied
select
( indices: typing.Iterablekeep_in_memory: bool = Falseindices_cache_file_name: typing.Optional[str] = Nonewriter_batch_size: typing.Optional[int] = 1000new_fingerprint: typing.Optional[str] = None )
Parameters
indices (range
, list
, iterable
, ndarray
or Series
) — Range, list or 1D-array of integer indices for indexing. If the indices correspond to a contiguous range, the Arrow table is simply sliced. However passing a list of indices that are not contiguous creates indices mapping, which is much less efficient, but still faster than recreating an Arrow table made of the requested rows.
keep_in_memory (bool
, defaults to False
) — Keep the indices mapping in memory instead of writing it to a cache file.
indices_cache_file_name (str
, optional, defaults to None
) — Provide the name of a path for the cache file. It is used to store the indices mapping instead of the automatically generated cache file name.
writer_batch_size (int
, defaults to 1000
) — Number of rows per write operation for the cache file writer. This value is a good trade-off between memory usage during the processing, and processing speed. Higher value makes the processing do fewer lookups, lower value consume less temporary memory while running map
.
new_fingerprint (str
, optional, defaults to None
) — The new fingerprint of the dataset after transform. If None
, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments.
Create a new dataset with rows selected following the list/array of indices.
Example:
Copied
sort
( column_names: typing.Union[str, typing.Sequence[str]]reverse: typing.Union[bool, typing.Sequence[bool]] = Falsekind = 'deprecated'null_placement: str = 'at_end'keep_in_memory: bool = Falseload_from_cache_file: typing.Optional[bool] = Noneindices_cache_file_name: typing.Optional[str] = Nonewriter_batch_size: typing.Optional[int] = 1000new_fingerprint: typing.Optional[str] = None )
Parameters
column_names (Union[str, Sequence[str]]
) — Column name(s) to sort by.
reverse (Union[bool, Sequence[bool]]
, defaults to False
) — If True
, sort by descending order rather than ascending. If a single bool is provided, the value is applied to the sorting of all column names. Otherwise a list of bools with the same length and order as column_names must be provided.
kind (str
, optional) — Pandas algorithm for sorting selected in {quicksort, mergesort, heapsort, stable}
, The default is quicksort
. Note that both stable
and mergesort
use timsort
under the covers and, in general, the actual implementation will vary with data type. The mergesort
option is retained for backwards compatibility.
Deprecated in 2.8.0
kind
was deprecated in version 2.10.0 and will be removed in 3.0.0.
null_placement (str
, defaults to at_end
) — Put None
values at the beginning if at_start
or first
or at the end if at_end
or last
Added in 1.14.2
keep_in_memory (bool
, defaults to False
) — Keep the sorted indices in memory instead of writing it to a cache file.
load_from_cache_file (Optional[bool]
, defaults to True
if caching is enabled) — If a cache file storing the sorted indices can be identified, use it instead of recomputing.
indices_cache_file_name (str
, optional, defaults to None
) — Provide the name of a path for the cache file. It is used to store the sorted indices instead of the automatically generated cache file name.
writer_batch_size (int
, defaults to 1000
) — Number of rows per write operation for the cache file writer. Higher value gives smaller cache files, lower value consume less temporary memory.
new_fingerprint (str
, optional, defaults to None
) — The new fingerprint of the dataset after transform. If None
, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments
Create a new dataset sorted according to a single or multiple columns.
Example:
Copied
shuffle
( seed: typing.Optional[int] = Nonegenerator: typing.Optional[numpy.random._generator.Generator] = Nonekeep_in_memory: bool = Falseload_from_cache_file: typing.Optional[bool] = Noneindices_cache_file_name: typing.Optional[str] = Nonewriter_batch_size: typing.Optional[int] = 1000new_fingerprint: typing.Optional[str] = None )
Parameters
seed (int
, optional) — A seed to initialize the default BitGenerator if generator=None
. If None
, then fresh, unpredictable entropy will be pulled from the OS. If an int
or array_like[ints]
is passed, then it will be passed to SeedSequence to derive the initial BitGenerator state.
generator (numpy.random.Generator
, optional) — Numpy random Generator to use to compute the permutation of the dataset rows. If generator=None
(default), uses np.random.default_rng
(the default BitGenerator (PCG64) of NumPy).
keep_in_memory (bool
, default False
) — Keep the shuffled indices in memory instead of writing it to a cache file.
load_from_cache_file (Optional[bool]
, defaults to True
if caching is enabled) — If a cache file storing the shuffled indices can be identified, use it instead of recomputing.
indices_cache_file_name (str
, optional) — Provide the name of a path for the cache file. It is used to store the shuffled indices instead of the automatically generated cache file name.
writer_batch_size (int
, defaults to 1000
) — Number of rows per write operation for the cache file writer. This value is a good trade-off between memory usage during the processing, and processing speed. Higher value makes the processing do fewer lookups, lower value consume less temporary memory while running map
.
new_fingerprint (str
, optional, defaults to None
) — The new fingerprint of the dataset after transform. If None
, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments.
Create a new Dataset where the rows are shuffled.
Currently shuffling uses numpy random generators. You can either supply a NumPy BitGenerator to use, or a seed to initiate NumPy’s default random generator (PCG64).
This may take a lot of time depending of the size of your dataset though:
Copied
It only shuffles the shards order and adds a shuffle buffer to your dataset, which keeps the speed of your dataset optimal:
Copied
Example:
Copied
train_test_split
( test_size: typing.Union[float, int, NoneType] = Nonetrain_size: typing.Union[float, int, NoneType] = Noneshuffle: bool = Truestratify_by_column: typing.Optional[str] = Noneseed: typing.Optional[int] = Nonegenerator: typing.Optional[numpy.random._generator.Generator] = Nonekeep_in_memory: bool = Falseload_from_cache_file: typing.Optional[bool] = Nonetrain_indices_cache_file_name: typing.Optional[str] = Nonetest_indices_cache_file_name: typing.Optional[str] = Nonewriter_batch_size: typing.Optional[int] = 1000train_new_fingerprint: typing.Optional[str] = Nonetest_new_fingerprint: typing.Optional[str] = None )
Parameters
test_size (numpy.random.Generator
, optional) — Size of the test split If float
, should be between 0.0
and 1.0
and represent the proportion of the dataset to include in the test split. If int
, represents the absolute number of test samples. If None
, the value is set to the complement of the train size. If train_size
is also None
, it will be set to 0.25
.
train_size (numpy.random.Generator
, optional) — Size of the train split If float
, should be between 0.0
and 1.0
and represent the proportion of the dataset to include in the train split. If int
, represents the absolute number of train samples. If None
, the value is automatically set to the complement of the test size.
shuffle (bool
, optional, defaults to True
) — Whether or not to shuffle the data before splitting.
stratify_by_column (str
, optional, defaults to None
) — The column name of labels to be used to perform stratified split of data.
seed (int
, optional) — A seed to initialize the default BitGenerator if generator=None
. If None
, then fresh, unpredictable entropy will be pulled from the OS. If an int
or array_like[ints]
is passed, then it will be passed to SeedSequence to derive the initial BitGenerator state.
generator (numpy.random.Generator
, optional) — Numpy random Generator to use to compute the permutation of the dataset rows. If generator=None
(default), uses np.random.default_rng
(the default BitGenerator (PCG64) of NumPy).
keep_in_memory (bool
, defaults to False
) — Keep the splits indices in memory instead of writing it to a cache file.
load_from_cache_file (Optional[bool]
, defaults to True
if caching is enabled) — If a cache file storing the splits indices can be identified, use it instead of recomputing.
train_cache_file_name (str
, optional) — Provide the name of a path for the cache file. It is used to store the train split indices instead of the automatically generated cache file name.
test_cache_file_name (str
, optional) — Provide the name of a path for the cache file. It is used to store the test split indices instead of the automatically generated cache file name.
writer_batch_size (int
, defaults to 1000
) — Number of rows per write operation for the cache file writer. This value is a good trade-off between memory usage during the processing, and processing speed. Higher value makes the processing do fewer lookups, lower value consume less temporary memory while running map
.
train_new_fingerprint (str
, optional, defaults to None
) — The new fingerprint of the train set after transform. If None
, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments
test_new_fingerprint (str
, optional, defaults to None
) — The new fingerprint of the test set after transform. If None
, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments
This method is similar to scikit-learn train_test_split
.
Example:
Copied
shard
( num_shards: intindex: intcontiguous: bool = Falsekeep_in_memory: bool = Falseindices_cache_file_name: typing.Optional[str] = Nonewriter_batch_size: typing.Optional[int] = 1000 )
Parameters
num_shards (int
) — How many shards to split the dataset into.
index (int
) — Which shard to select and return. contiguous — (bool
, defaults to False
): Whether to select contiguous blocks of indices for shards.
keep_in_memory (bool
, defaults to False
) — Keep the dataset in memory instead of writing it to a cache file.
indices_cache_file_name (str
, optional) — Provide the name of a path for the cache file. It is used to store the indices of each shard instead of the automatically generated cache file name.
writer_batch_size (int
, defaults to 1000
) — Number of rows per write operation for the cache file writer. This value is a good trade-off between memory usage during the processing, and processing speed. Higher value makes the processing do fewer lookups, lower value consume less temporary memory while running map
.
Return the index
-nth shard from dataset split into num_shards
pieces.
This shards deterministically. dset.shard(n, i)
will contain all elements of dset whose index mod n = i
.
dset.shard(n, i, contiguous=True)
will instead split dset into contiguous chunks, so it can be easily concatenated back together after processing. If n % i == l
, then the first l
shards will have length (n // i) + 1
, and the remaining shards will have length (n // i)
. datasets.concatenate([dset.shard(n, i, contiguous=True) for i in range(n)])
will return a dataset with the same order as the original.
Be sure to shard before using any randomizing operator (such as shuffle
). It is best if the shard operator is used early in the dataset pipeline.
Example:
Copied
to_tf_dataset
( batch_size: typing.Optional[int] = Nonecolumns: typing.Union[str, typing.List[str], NoneType] = Noneshuffle: bool = Falsecollate_fn: typing.Optional[typing.Callable] = Nonedrop_remainder: bool = Falsecollate_fn_args: typing.Union[typing.Dict[str, typing.Any], NoneType] = Nonelabel_cols: typing.Union[str, typing.List[str], NoneType] = Noneprefetch: bool = Truenum_workers: int = 0num_test_batches: int = 20 )
Parameters
batch_size (int
, optional) — Size of batches to load from the dataset. Defaults to None
, which implies that the dataset won’t be batched, but the returned dataset can be batched later with tf_dataset.batch(batch_size)
.
columns (List[str]
or str
, optional) — Dataset column(s) to load in the tf.data.Dataset
. Column names that are created by the collate_fn
and that do not exist in the original dataset can be used.
shuffle(bool
, defaults to False
) — Shuffle the dataset order when loading. Recommended True
for training, False
for validation/evaluation.
drop_remainder(bool
, defaults to False
) — Drop the last incomplete batch when loading. Ensures that all batches yielded by the dataset will have the same length on the batch dimension.
collate_fn(Callable
, optional) — A function or callable object (such as a DataCollator
) that will collate lists of samples into a batch.
collate_fn_args (Dict
, optional) — An optional dict
of keyword arguments to be passed to the collate_fn
.
label_cols (List[str]
or str
, defaults to None
) — Dataset column(s) to load as labels. Note that many models compute loss internally rather than letting Keras do it, in which case passing the labels here is optional, as long as they’re in the input columns
.
prefetch (bool
, defaults to True
) — Whether to run the dataloader in a separate thread and maintain a small buffer of batches for training. Improves performance by allowing data to be loaded in the background while the model is training.
num_workers (int
, defaults to 0
) — Number of workers to use for loading the dataset. Only supported on Python versions >= 3.8.
num_test_batches (int
, defaults to 20
) — Number of batches to use to infer the output signature of the dataset. The higher this number, the more accurate the signature will be, but the longer it will take to create the dataset.
Create a tf.data.Dataset
from the underlying Dataset. This tf.data.Dataset
will load and collate batches from the Dataset, and is suitable for passing to methods like model.fit()
or model.predict()
. The dataset will yield dicts
for both inputs and labels unless the dict
would contain only a single key, in which case a raw tf.Tensor
is yielded instead.
Example:
Copied
push_to_hub
( repo_id: strconfig_name: str = 'default'split: typing.Optional[str] = Noneprivate: typing.Optional[bool] = Falsetoken: typing.Optional[str] = Nonebranch: typing.Optional[str] = Nonemax_shard_size: typing.Union[str, int, NoneType] = Nonenum_shards: typing.Optional[int] = Noneembed_external_files: bool = True )
Parameters
repo_id (str
) — The ID of the repository to push to in the following format: <user>/<dataset_name>
or <org>/<dataset_name>
. Also accepts <dataset_name>
, which will default to the namespace of the logged-in user.
config_name (str
, defaults to “default”) — The configuration name of a dataset. Defaults to “default”
split (str
, optional) — The name of the split that will be given to that dataset. Defaults to self.split
.
private (bool
, optional, defaults to False
) — Whether the dataset repository should be set to private or not. Only affects repository creation: a repository that already exists will not be affected by that parameter.
token (str
, optional) — An optional authentication token for the BOINC AI Hub. If no token is passed, will default to the token saved locally when logging in with boincai-cli login
. Will raise an error if no token is passed and the user is not logged-in.
branch (str
, optional) — The git branch on which to push the dataset. This defaults to the default branch as specified in your repository, which defaults to "main"
.
max_shard_size (int
or str
, optional, defaults to "500MB"
) — The maximum size of the dataset shards to be uploaded to the hub. If expressed as a string, needs to be digits followed by a unit (like "5MB"
).
num_shards (int
, optional) — Number of shards to write. By default the number of shards depends on max_shard_size
.
Added in 2.8.0
embed_external_files (bool
, defaults to True
) — Whether to embed file bytes in the shards. In particular, this will do the following before the push for the fields of type:
Pushes the dataset to the hub as a Parquet dataset. The dataset is pushed using HTTP requests and does not need to have neither git or git-lfs installed.
Example:
Copied
save_to_disk
( dataset_path: typing.Union[str, bytes, os.PathLike]fs = 'deprecated'max_shard_size: typing.Union[str, int, NoneType] = Nonenum_shards: typing.Optional[int] = Nonenum_proc: typing.Optional[int] = Nonestorage_options: typing.Optional[dict] = None )
Parameters
dataset_path (str
) — Path (e.g. dataset/train
) or remote URI (e.g. s3://my-bucket/dataset/train
) of the dataset directory where the dataset will be saved to.
fs (fsspec.spec.AbstractFileSystem
, optional) — Instance of the remote filesystem where the dataset will be saved to.
Deprecated in 2.8.0
fs
was deprecated in version 2.8.0 and will be removed in 3.0.0. Please use storage_options
instead, e.g. storage_options=fs.storage_options
max_shard_size (int
or str
, optional, defaults to "500MB"
) — The maximum size of the dataset shards to be uploaded to the hub. If expressed as a string, needs to be digits followed by a unit (like "50MB"
).
num_shards (int
, optional) — Number of shards to write. By default the number of shards depends on max_shard_size
and num_proc
.
Added in 2.8.0
num_proc (int
, optional) — Number of processes when downloading and generating the dataset locally. Multiprocessing is disabled by default.
Added in 2.8.0
storage_options (dict
, optional) — Key/value pairs to be passed on to the file-system backend, if any.
Added in 2.8.0
Saves a dataset to a dataset directory, or in a filesystem using any implementation of fsspec.spec.AbstractFileSystem
.
All the Image() and Audio() data are stored in the arrow files. If you want to store paths or urls, please use the Value(“string”) type.
Example:
Copied
load_from_disk
Parameters
dataset_path (str
) — Path (e.g. "dataset/train"
) or remote URI (e.g. "s3//my-bucket/dataset/train"
) of the dataset directory where the dataset will be loaded from.
fs (fsspec.spec.AbstractFileSystem
, optional) — Instance of the remote filesystem where the dataset will be saved to.
Deprecated in 2.8.0
fs
was deprecated in version 2.8.0 and will be removed in 3.0.0. Please use storage_options
instead, e.g. storage_options=fs.storage_options
storage_options (dict
, optional) — Key/value pairs to be passed on to the file-system backend, if any.
Added in 2.8.0
Returns
If dataset_path
is a path of a dataset directory, the dataset requested.
If dataset_path
is a path of a dataset dict directory, a datasets.DatasetDict
with each split.
Loads a dataset that was previously saved using save_to_disk
from a dataset directory, or from a filesystem using any implementation of fsspec.spec.AbstractFileSystem
.
Example:
Copied
flatten_indices
( keep_in_memory: bool = Falsecache_file_name: typing.Optional[str] = Nonewriter_batch_size: typing.Optional[int] = 1000features: typing.Optional[datasets.features.features.Features] = Nonedisable_nullable: bool = Falsenum_proc: typing.Optional[int] = Nonenew_fingerprint: typing.Optional[str] = None )
Parameters
keep_in_memory (bool
, defaults to False
) — Keep the dataset in memory instead of writing it to a cache file.
cache_file_name (str
, optional, default None
) — Provide the name of a path for the cache file. It is used to store the results of the computation instead of the automatically generated cache file name.
writer_batch_size (int
, defaults to 1000
) — Number of rows per write operation for the cache file writer. This value is a good trade-off between memory usage during the processing, and processing speed. Higher value makes the processing do fewer lookups, lower value consume less temporary memory while running map
.
disable_nullable (bool
, defaults to False
) — Allow null values in the table.
num_proc (int
, optional, default None
) — Max number of processes when generating cache. Already cached shards are loaded sequentially
new_fingerprint (str
, optional, defaults to None
) — The new fingerprint of the dataset after transform. If None
, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments
Create and cache a new Dataset by flattening the indices mapping.
to_csv
( path_or_buf: typing.Union[str, bytes, os.PathLike, typing.BinaryIO]batch_size: typing.Optional[int] = Nonenum_proc: typing.Optional[int] = None**to_csv_kwargs ) → int
Parameters
path_or_buf (PathLike
or FileOrBuffer
) — Either a path to a file or a BinaryIO.
batch_size (int
, optional) — Size of the batch to load in memory and write at once. Defaults to datasets.config.DEFAULT_MAX_BATCH_SIZE
.
num_proc (int
, optional) — Number of processes for multiprocessing. By default it doesn’t use multiprocessing. batch_size
in this case defaults to datasets.config.DEFAULT_MAX_BATCH_SIZE
but feel free to make it 5x or 10x of the default value if you have sufficient compute power.
Changed in 2.10.0
Now, index
defaults to False
if not specified.
If you would like to write the index, pass index=True
and also set a name for the index column by passing index_label
.
Returns
int
The number of characters or bytes written.
Exports the dataset to csv
Example:
Copied
to_pandas
( batch_size: typing.Optional[int] = Nonebatched: bool = False )
Parameters
batched (bool
) — Set to True
to return a generator that yields the dataset as batches of batch_size
rows. Defaults to False
(returns the whole datasets once).
batch_size (int
, optional) — The size (number of rows) of the batches if batched
is True
. Defaults to datasets.config.DEFAULT_MAX_BATCH_SIZE
.
Returns the dataset as a pandas.DataFrame
. Can also return a generator for large datasets.
Example:
Copied
to_dict
( batch_size: typing.Optional[int] = Nonebatched = 'deprecated' )
Parameters
batched (bool
) — Set to True
to return a generator that yields the dataset as batches of batch_size
rows. Defaults to False
(returns the whole datasets once).
Deprecated in 2.11.0
Use .iter(batch_size=batch_size)
followed by .to_dict()
on the individual batches instead.
batch_size (int
, optional) — The size (number of rows) of the batches if batched
is True
. Defaults to datasets.config.DEFAULT_MAX_BATCH_SIZE
.
Returns the dataset as a Python dict. Can also return a generator for large datasets.
Example:
Copied
to_json
( path_or_buf: typing.Union[str, bytes, os.PathLike, typing.BinaryIO]batch_size: typing.Optional[int] = Nonenum_proc: typing.Optional[int] = None**to_json_kwargs ) → int
Parameters
path_or_buf (PathLike
or FileOrBuffer
) — Either a path to a file or a BinaryIO.
batch_size (int
, optional) — Size of the batch to load in memory and write at once. Defaults to datasets.config.DEFAULT_MAX_BATCH_SIZE
.
num_proc (int
, optional) — Number of processes for multiprocessing. By default it doesn’t use multiprocessing. batch_size
in this case defaults to datasets.config.DEFAULT_MAX_BATCH_SIZE
but feel free to make it 5x or 10x of the default value if you have sufficient compute power.
Changed in 2.11.0
Now, index
defaults to False
if orient
is "split"
or "table"
.
If you would like to write the index, pass index=True
.
Returns
int
The number of characters or bytes written.
Export the dataset to JSON Lines or JSON.
Example:
Copied
to_parquet
( path_or_buf: typing.Union[str, bytes, os.PathLike, typing.BinaryIO]batch_size: typing.Optional[int] = None**parquet_writer_kwargs ) → int
Parameters
path_or_buf (PathLike
or FileOrBuffer
) — Either a path to a file or a BinaryIO.
batch_size (int
, optional) — Size of the batch to load in memory and write at once. Defaults to datasets.config.DEFAULT_MAX_BATCH_SIZE
.
**parquet_writer_kwargs (additional keyword arguments) — Parameters to pass to PyArrow’s pyarrow.parquet.ParquetWriter
.
Returns
int
The number of characters or bytes written.
Exports the dataset to parquet
Example:
Copied
to_sql
( name: strcon: typing.Union[str, ForwardRef('sqlalchemy.engine.Connection'), ForwardRef('sqlalchemy.engine.Engine'), ForwardRef('sqlite3.Connection')]batch_size: typing.Optional[int] = None**sql_writer_kwargs ) → int
Parameters
name (str
) — Name of SQL table.
batch_size (int
, optional) — Size of the batch to load in memory and write at once. Defaults to datasets.config.DEFAULT_MAX_BATCH_SIZE
.
Changed in 2.11.0
Now, index
defaults to False
if not specified.
If you would like to write the index, pass index=True
and also set a name for the index column by passing index_label
.
Returns
int
The number of records written.
Exports the dataset to a SQL database.
Example:
Copied
to_iterable_dataset
( num_shards: typing.Optional[int] = 1 )
Parameters
Contrary to map-style datasets, iterable datasets are lazy and can only be iterated over (e.g. using a for loop). Since they are read sequentially in training loops, iterable datasets are much faster than map-style datasets. All the transformations applied to iterable datasets like filtering or processing are done on-the-fly when you start iterating over the dataset.
To get the best speed performance, make sure your dataset doesn’t have an indices mapping. If this is the case, the data are not read contiguously, which can be slow sometimes. You can use ds = ds.flatten_indices()
to write your dataset in contiguous chunks of data and have optimal speed before switching to an iterable dataset.
Example:
Basic usage:
Copied
With lazy filtering and processing:
Copied
With sharding to enable efficient shuffling:
Copied
With a PyTorch DataLoader:
Copied
With a PyTorch DataLoader and shuffling:
Copied
In a distributed setup like PyTorch DDP with a PyTorch DataLoader and shuffling
Copied
With shuffling and multiple epochs:
Copied
Feel free to also use `IterableDataset.set_epoch()` when using a PyTorch DataLoader or in distributed setups.
add_faiss_index
( column: strindex_name: typing.Optional[str] = Nonedevice: typing.Optional[int] = Nonestring_factory: typing.Optional[str] = Nonemetric_type: typing.Optional[int] = Nonecustom_index: typing.Optional[ForwardRef('faiss.Index')] = Nonebatch_size: int = 1000train_size: typing.Optional[int] = Nonefaiss_verbose: bool = Falsedtype = <class 'numpy.float32'> )
Parameters
column (str
) — The column of the vectors to add to the index.
device (Union[int, List[int]]
, optional) — If positive integer, this is the index of the GPU to use. If negative integer, use all GPUs. If a list of positive integers is passed in, run only on those GPUs. By default it uses the CPU.
string_factory (str
, optional) — This is passed to the index factory of Faiss to create the index. Default index class is IndexFlat
.
metric_type (int
, optional) — Type of metric. Ex: faiss.METRIC_INNER_PRODUCT
or faiss.METRIC_L2
.
custom_index (faiss.Index
, optional) — Custom Faiss index that you already have instantiated and configured for your needs.
batch_size (int
) — Size of the batch to use while adding vectors to the FaissIndex
. Default value is 1000
.
Added in 2.4.0
train_size (int
, optional) — If the index needs a training step, specifies how many vectors will be used to train the index.
faiss_verbose (bool
, defaults to False
) — Enable the verbosity of the Faiss index.
dtype (data-type
) — The dtype of the numpy arrays that are indexed. Default is np.float32
.
Add a dense index using Faiss for fast retrieval. By default the index is done over the vectors of the specified column. You can specify device
if you want to run it on GPU (device
must be the GPU index). You can find more information about Faiss here:
Example:
Copied
add_faiss_index_from_external_arrays
( external_arrays: arrayindex_name: strdevice: typing.Optional[int] = Nonestring_factory: typing.Optional[str] = Nonemetric_type: typing.Optional[int] = Nonecustom_index: typing.Optional[ForwardRef('faiss.Index')] = Nonebatch_size: int = 1000train_size: typing.Optional[int] = Nonefaiss_verbose: bool = Falsedtype = <class 'numpy.float32'> )
Parameters
external_arrays (np.array
) — If you want to use arrays from outside the lib for the index, you can set external_arrays
. It will use external_arrays
to create the Faiss index instead of the arrays in the given column
.
device (Optional Union[int, List[int]]
, optional) — If positive integer, this is the index of the GPU to use. If negative integer, use all GPUs. If a list of positive integers is passed in, run only on those GPUs. By default it uses the CPU.
string_factory (str
, optional) — This is passed to the index factory of Faiss to create the index. Default index class is IndexFlat
.
metric_type (int
, optional) — Type of metric. Ex: faiss.faiss.METRIC_INNER_PRODUCT
or faiss.METRIC_L2
.
custom_index (faiss.Index
, optional) — Custom Faiss index that you already have instantiated and configured for your needs.
batch_size (int
, optional) — Size of the batch to use while adding vectors to the FaissIndex. Default value is 1000.
Added in 2.4.0
train_size (int
, optional) — If the index needs a training step, specifies how many vectors will be used to train the index.
faiss_verbose (bool
, defaults to False) — Enable the verbosity of the Faiss index.
dtype (numpy.dtype
) — The dtype of the numpy arrays that are indexed. Default is np.float32.
Add a dense index using Faiss for fast retrieval. The index is created using the vectors of external_arrays
. You can specify device
if you want to run it on GPU (device
must be the GPU index). You can find more information about Faiss here:
save_faiss_index
( index_name: strfile: typing.Union[str, pathlib.PurePath]storage_options: typing.Optional[typing.Dict] = None )
Parameters
index_name (str
) — The index_name/identifier of the index. This is the index_name that is used to call .get_nearest
or .search
.
file (str
) — The path to the serialized faiss index on disk or remote URI (e.g. "s3://my-bucket/index.faiss"
).
storage_options (dict
, optional) — Key/value pairs to be passed on to the file-system backend, if any.
Added in 2.11.0
Save a FaissIndex on disk.
load_faiss_index
( index_name: strfile: typing.Union[str, pathlib.PurePath]device: typing.Union[int, typing.List[int], NoneType] = Nonestorage_options: typing.Optional[typing.Dict] = None )
Parameters
index_name (str
) — The index_name/identifier of the index. This is the index_name that is used to call .get_nearest
or .search
.
file (str
) — The path to the serialized faiss index on disk or remote URI (e.g. "s3://my-bucket/index.faiss"
).
device (Optional Union[int, List[int]]
) — If positive integer, this is the index of the GPU to use. If negative integer, use all GPUs. If a list of positive integers is passed in, run only on those GPUs. By default it uses the CPU.
storage_options (dict
, optional) — Key/value pairs to be passed on to the file-system backend, if any.
Added in 2.11.0
Load a FaissIndex from disk.
If you want to do additional configurations, you can have access to the faiss index object by doing .get_index(index_name).faiss_index
to make it fit your needs.
add_elasticsearch_index
( column: strindex_name: typing.Optional[str] = Nonehost: typing.Optional[str] = Noneport: typing.Optional[int] = Nonees_client: typing.Optional[ForwardRef('elasticsearch.Elasticsearch')] = Nonees_index_name: typing.Optional[str] = Nonees_index_config: typing.Optional[dict] = None )
Parameters
column (str
) — The column of the documents to add to the index.
host (str
, optional, defaults to localhost
) — Host of where ElasticSearch is running.
port (str
, optional, defaults to 9200
) — Port of where ElasticSearch is running.
es_client (elasticsearch.Elasticsearch
, optional) — The elasticsearch client used to create the index if host and port are None
.
es_index_name (str
, optional) — The elasticsearch index name used to create the index.
es_index_config (dict
, optional) — The configuration of the elasticsearch index. Default config is:
Add a text index using ElasticSearch for fast retrieval. This is done in-place.
Example:
Copied
load_elasticsearch_index
( index_name: stres_index_name: strhost: typing.Optional[str] = Noneport: typing.Optional[int] = Nonees_client: typing.Optional[ForwardRef('Elasticsearch')] = Nonees_index_config: typing.Optional[dict] = None )
Parameters
index_name (str
) — The index_name
/identifier of the index. This is the index name that is used to call get_nearest
or search
.
es_index_name (str
) — The name of elasticsearch index to load.
host (str
, optional, defaults to localhost
) — Host of where ElasticSearch is running.
port (str
, optional, defaults to 9200
) — Port of where ElasticSearch is running.
es_client (elasticsearch.Elasticsearch
, optional) — The elasticsearch client used to create the index if host and port are None
.
es_index_config (dict
, optional) — The configuration of the elasticsearch index. Default config is:
Load an existing text index using ElasticSearch for fast retrieval.
list_indexes
( )
List the colindex_nameumns
/identifiers of all the attached indexes.
get_index
( index_name: str )
Parameters
index_name (str
) — Index name.
List the index_name
/identifiers of all the attached indexes.
drop_index
( index_name: str )
Parameters
index_name (str
) — The index_name
/identifier of the index.
Drop the index with the specified column.
search
( index_name: strquery: typing.Union[str, <built-in function array>]k: int = 10**kwargs ) → (scores, indices)
Parameters
index_name (str
) — The name/identifier of the index.
query (Union[str, np.ndarray]
) — The query as a string if index_name
is a text index or as a numpy array if index_name
is a vector index.
k (int
) — The number of examples to retrieve.
Returns
(scores, indices)
A tuple of (scores, indices)
where:
scores (List[List[float]
): the retrieval scores from either FAISS (IndexFlatL2
by default) or ElasticSearch of the retrieved examples
indices (List[List[int]]
): the indices of the retrieved examples
Find the nearest examples indices in the dataset to the query.
search_batch
( index_name: strqueries: typing.Union[typing.List[str], <built-in function array>]k: int = 10**kwargs ) → (total_scores, total_indices)
Parameters
index_name (str
) — The index_name
/identifier of the index.
queries (Union[List[str], np.ndarray]
) — The queries as a list of strings if index_name
is a text index or as a numpy array if index_name
is a vector index.
k (int
) — The number of examples to retrieve per query.
Returns
(total_scores, total_indices)
A tuple of (total_scores, total_indices)
where:
total_scores (List[List[float]
): the retrieval scores from either FAISS (IndexFlatL2
by default) or ElasticSearch of the retrieved examples per query
total_indices (List[List[int]]
): the indices of the retrieved examples per query
Find the nearest examples indices in the dataset to the query.
get_nearest_examples
( index_name: strquery: typing.Union[str, <built-in function array>]k: int = 10**kwargs ) → (scores, examples)
Parameters
index_name (str
) — The index_name/identifier of the index.
query (Union[str, np.ndarray]
) — The query as a string if index_name
is a text index or as a numpy array if index_name
is a vector index.
k (int
) — The number of examples to retrieve.
Returns
(scores, examples)
A tuple of (scores, examples)
where:
scores (List[float]
): the retrieval scores from either FAISS (IndexFlatL2
by default) or ElasticSearch of the retrieved examples
examples (dict
): the retrieved examples
Find the nearest examples in the dataset to the query.
get_nearest_examples_batch
( index_name: strqueries: typing.Union[typing.List[str], <built-in function array>]k: int = 10**kwargs ) → (total_scores, total_examples)
Parameters
index_name (str
) — The index_name
/identifier of the index.
queries (Union[List[str], np.ndarray]
) — The queries as a list of strings if index_name
is a text index or as a numpy array if index_name
is a vector index.
k (int
) — The number of examples to retrieve per query.
Returns
(total_scores, total_examples)
A tuple of (total_scores, total_examples)
where:
total_scores (List[List[float]
): the retrieval scores from either FAISS (IndexFlatL2
by default) or ElasticSearch of the retrieved examples per query
total_examples (List[dict]
): the retrieved examples per query
Find the nearest examples in the dataset to the query.
info
( )
split
( )
builder_name
( )
citation
( )
config_name
( )
dataset_size
( )
description
( )
download_checksums
( )
download_size
( )
features
( )
homepage
( )
license
( )
size_in_bytes
( )
supervised_keys
( )
version
( )
from_csv
( path_or_paths: typing.Union[str, bytes, os.PathLike, typing.List[typing.Union[str, bytes, os.PathLike]]]split: typing.Optional[datasets.splits.NamedSplit] = Nonefeatures: typing.Optional[datasets.features.features.Features] = Nonecache_dir: str = Nonekeep_in_memory: bool = Falsenum_proc: typing.Optional[int] = None**kwargs )
Parameters
path_or_paths (path-like
or list of path-like
) — Path(s) of the CSV file(s).
cache_dir (str
, optional, defaults to "~/.cache/boincai/datasets"
) — Directory to cache data.
keep_in_memory (bool
, defaults to False
) — Whether to copy the data in-memory.
num_proc (int
, optional, defaults to None
) — Number of processes when downloading and generating the dataset locally. This is helpful if the dataset is made of multiple files. Multiprocessing is disabled by default.
Added in 2.8.0
**kwargs (additional keyword arguments) — Keyword arguments to be passed to pandas.read_csv
.
Create Dataset from CSV file(s).
Example:
Copied
from_json
( path_or_paths: typing.Union[str, bytes, os.PathLike, typing.List[typing.Union[str, bytes, os.PathLike]]]split: typing.Optional[datasets.splits.NamedSplit] = Nonefeatures: typing.Optional[datasets.features.features.Features] = Nonecache_dir: str = Nonekeep_in_memory: bool = Falsefield: typing.Optional[str] = Nonenum_proc: typing.Optional[int] = None**kwargs )
Parameters
path_or_paths (path-like
or list of path-like
) — Path(s) of the JSON or JSON Lines file(s).
cache_dir (str
, optional, defaults to "~/.cache/boincai/datasets"
) — Directory to cache data.
keep_in_memory (bool
, defaults to False
) — Whether to copy the data in-memory.
field (str
, optional) — Field name of the JSON file where the dataset is contained in.
num_proc (int
, optional defaults to None
) — Number of processes when downloading and generating the dataset locally. This is helpful if the dataset is made of multiple files. Multiprocessing is disabled by default.
Added in 2.8.0
**kwargs (additional keyword arguments) — Keyword arguments to be passed to JsonConfig
.
Create Dataset from JSON or JSON Lines file(s).
Example:
Copied
from_parquet
( path_or_paths: typing.Union[str, bytes, os.PathLike, typing.List[typing.Union[str, bytes, os.PathLike]]]split: typing.Optional[datasets.splits.NamedSplit] = Nonefeatures: typing.Optional[datasets.features.features.Features] = Nonecache_dir: str = Nonekeep_in_memory: bool = Falsecolumns: typing.Optional[typing.List[str]] = Nonenum_proc: typing.Optional[int] = None**kwargs )
Parameters
path_or_paths (path-like
or list of path-like
) — Path(s) of the Parquet file(s).
split (NamedSplit
, optional) — Split name to be assigned to the dataset.
features (Features
, optional) — Dataset features.
cache_dir (str
, optional, defaults to "~/.cache/boincai/datasets"
) — Directory to cache data.
keep_in_memory (bool
, defaults to False
) — Whether to copy the data in-memory.
columns (List[str]
, optional) — If not None
, only these columns will be read from the file. A column name may be a prefix of a nested field, e.g. ‘a’ will select ‘a.b’, ‘a.c’, and ‘a.d.e’.
num_proc (int
, optional, defaults to None
) — Number of processes when downloading and generating the dataset locally. This is helpful if the dataset is made of multiple files. Multiprocessing is disabled by default.
Added in 2.8.0
**kwargs (additional keyword arguments) — Keyword arguments to be passed to ParquetConfig
.
Create Dataset from Parquet file(s).
Example:
Copied
from_text
( path_or_paths: typing.Union[str, bytes, os.PathLike, typing.List[typing.Union[str, bytes, os.PathLike]]]split: typing.Optional[datasets.splits.NamedSplit] = Nonefeatures: typing.Optional[datasets.features.features.Features] = Nonecache_dir: str = Nonekeep_in_memory: bool = Falsenum_proc: typing.Optional[int] = None**kwargs )
Parameters
path_or_paths (path-like
or list of path-like
) — Path(s) of the text file(s).
split (NamedSplit
, optional) — Split name to be assigned to the dataset.
features (Features
, optional) — Dataset features.
cache_dir (str
, optional, defaults to "~/.cache/boincai/datasets"
) — Directory to cache data.
keep_in_memory (bool
, defaults to False
) — Whether to copy the data in-memory.
num_proc (int
, optional, defaults to None
) — Number of processes when downloading and generating the dataset locally. This is helpful if the dataset is made of multiple files. Multiprocessing is disabled by default.
Added in 2.8.0
**kwargs (additional keyword arguments) — Keyword arguments to be passed to TextConfig
.
Create Dataset from text file(s).
Example:
Copied
from_sql
( sql: typing.Union[str, ForwardRef('sqlalchemy.sql.Selectable')]con: typing.Union[str, ForwardRef('sqlalchemy.engine.Connection'), ForwardRef('sqlalchemy.engine.Engine'), ForwardRef('sqlite3.Connection')]features: typing.Optional[datasets.features.features.Features] = Nonecache_dir: str = Nonekeep_in_memory: bool = False**kwargs )
Parameters
sql (str
or sqlalchemy.sql.Selectable
) — SQL query to be executed or a table name.
cache_dir (str
, optional, defaults to "~/.cache/boincai/datasets"
) — Directory to cache data.
keep_in_memory (bool
, defaults to False
) — Whether to copy the data in-memory.
**kwargs (additional keyword arguments) — Keyword arguments to be passed to SqlConfig
.
Create Dataset from SQL query or database table.
Example:
Copied
The returned dataset can only be cached if con
is specified as URI string.
prepare_for_task
( task: typing.Union[str, datasets.tasks.base.TaskTemplate]id: int = 0 )
Parameters
task (Union[str, TaskTemplate]
) — The task to prepare the dataset for during training and evaluation. If str
, supported tasks include:
"text-classification"
"question-answering"
id (int
, defaults to 0
) — The id required to unambiguously identify the task template when multiple task templates of the same type are supported.
Casts datasets.DatasetInfo.features
according to a task-specific schema. Intended for single-use only, so all task templates are removed from datasets.DatasetInfo.task_templates
after casting.
align_labels_with_mapping
( label2id: typing.Dictlabel_column: str )
Parameters
label2id (dict
) — The label name to ID mapping to align the dataset with.
label_column (str
) — The column name of labels to align on.
Align the dataset’s label ID and label name mapping to match an input label2id
mapping. This is useful when you want to ensure that a model’s predicted labels are aligned with the dataset. The alignment in done using the lowercase label names.
Example:
Copied
datasets.concatenate_datasets
( dsets: typing.List[~DatasetType]info: typing.Optional[datasets.info.DatasetInfo] = Nonesplit: typing.Optional[datasets.splits.NamedSplit] = Noneaxis: int = 0 )
Parameters
dsets (List[datasets.Dataset]
) — List of Datasets to concatenate.
info (DatasetInfo
, optional) — Dataset information, like description, citation, etc.
split (NamedSplit
, optional) — Name of the dataset split.
axis ({0, 1}
, defaults to 0
) — Axis to concatenate over, where 0
means over rows (vertically) and 1
means over columns (horizontally).
Added in 1.6.0
Example:
Copied
datasets.interleave_datasets
Parameters
datasets (List[Dataset]
or List[IterableDataset]
) — List of datasets to interleave.
probabilities (List[float]
, optional, defaults to None
) — If specified, the new dataset is constructed by sampling examples from one source at a time according to these probabilities.
seed (int
, optional, defaults to None
) — The random seed used to choose a source for each example.
Added in 2.4.0
Added in 2.4.0
stopping_strategy (str
, defaults to first_exhausted
) — Two strategies are proposed right now, first_exhausted
and all_exhausted
. By default, first_exhausted
is an undersampling strategy, i.e the dataset construction is stopped as soon as one dataset has ran out of samples. If the strategy is all_exhausted
, we use an oversampling strategy, i.e the dataset construction is stopped as soon as every samples of every dataset has been added at least once. Note that if the strategy is all_exhausted
, the interleaved dataset size can get enormous:
with no probabilities, the resulting dataset will have max_length_datasets*nb_dataset
samples.
with given probabilities, the resulting dataset will have more samples if some datasets have really low probability of visiting.
Returns
Return type depends on the input datasets
parameter. Dataset
if the input is a list of Dataset
, IterableDataset
if the input is a list of IterableDataset
.
Interleave several datasets (sources) into a single dataset. The new dataset is constructed by alternating between the sources to get the examples.
If probabilities
is None
(default) the new dataset is constructed by cycling between each source to get the examples.
If probabilities
is not None
, the new dataset is constructed by getting examples from a random source at a time according to the provided probabilities.
The resulting dataset ends when one of the source datasets runs out of examples except when oversampling
is True
, in which case, the resulting dataset ends when all datasets have ran out of examples at least one time.
Note for iterable datasets:
In a distributed setup or in PyTorch DataLoader workers, the stopping strategy is applied per process. Therefore the “first_exhausted” strategy on an sharded iterable dataset can generate less samples in total (up to 1 missing sample per subdataset per worker).
Example:
For regular datasets (map-style):
Copied
datasets.distributed.split_dataset_by_node
Parameters
rank (int
) — Rank of the current node.
world_size (int
) — Total number of nodes.
Returns
The dataset to be used on the node at rank rank
.
Split a dataset for the node at rank rank
in a pool of nodes of size world_size
.
For map-style datasets:
Each node is assigned a chunk of data, e.g. rank 0 is given the first chunk of the dataset. To maximize data loading throughput, chunks are made of contiguous data on disk if possible.
For iterable datasets:
If the dataset has a number of shards that is a factor of world_size
(i.e. if dataset.n_shards % world_size == 0
), then the shards are evenly assigned across the nodes, which is the most optimized. Otherwise, each node keeps 1 example out of world_size
, skipping the other examples.
datasets.enable_caching
( )
When applying transforms on a dataset, the data are stored in cache files. The caching mechanism allows to reload an existing cache file if it’s already been computed.
Reloading a dataset is possible since the cache files are named using the dataset fingerprint, which is updated after each transform.
If disabled, the library will no longer reload cached datasets files when applying transforms to the datasets. More precisely, if the caching is disabled:
cache files are always recreated
cache files are written to a temporary directory that is deleted when session closes
cache files are named using a random hash instead of the dataset fingerprint
datasets.disable_caching
( )
When applying transforms on a dataset, the data are stored in cache files. The caching mechanism allows to reload an existing cache file if it’s already been computed.
Reloading a dataset is possible since the cache files are named using the dataset fingerprint, which is updated after each transform.
If disabled, the library will no longer reload cached datasets files when applying transforms to the datasets. More precisely, if the caching is disabled:
cache files are always recreated
cache files are written to a temporary directory that is deleted when session closes
cache files are named using a random hash instead of the dataset fingerprint
datasets.is_caching_enabled
( )
When applying transforms on a dataset, the data are stored in cache files. The caching mechanism allows to reload an existing cache file if it’s already been computed.
Reloading a dataset is possible since the cache files are named using the dataset fingerprint, which is updated after each transform.
If disabled, the library will no longer reload cached datasets files when applying transforms to the datasets. More precisely, if the caching is disabled:
cache files are always recreated
cache files are written to a temporary directory that is deleted when session closes
cache files are named using a random hash instead of the dataset fingerprint
Dictionary with split names as keys (‘train’, ‘test’ for example), and Dataset
objects as values. It also has dataset transform methods like map or filter, to process all the splits at once.
( )
A dictionary (dict of str: datasets.Dataset) with dataset transforms methods (map, filter, etc.)
data
( )
The Apache Arrow tables backing each split.
Example:
Copied
cache_files
( )
The cache files containing the Apache Arrow table backing each split.
Example:
Copied
num_columns
( )
Number of columns in each split of the dataset.
Example:
Copied
num_rows
( )
Example:
Copied
column_names
( )
Names of the columns in each split of the dataset.
Example:
Copied
shape
( )
Shape of each split of the dataset (number of columns, number of rows).
Example:
Copied
unique
( column: str ) → Dict[str
, list
]
Parameters
Returns
Dict[str
, list
]
Dictionary of unique elements in the given column.
Return a list of the unique elements in a column for each split.
This is implemented in the low-level backend and as such, very fast.
Example:
Copied
cleanup_cache_files
( )
Clean up all cache files in the dataset cache directory, excepted the currently used cache file if there is one. Be careful when running this command that no other process is currently using other cache files.
Example:
Copied
map
( function: typing.Optional[typing.Callable] = Nonewith_indices: bool = Falsewith_rank: bool = Falseinput_columns: typing.Union[str, typing.List[str], NoneType] = Nonebatched: bool = Falsebatch_size: typing.Optional[int] = 1000drop_last_batch: bool = Falseremove_columns: typing.Union[str, typing.List[str], NoneType] = Nonekeep_in_memory: bool = Falseload_from_cache_file: typing.Optional[bool] = Nonecache_file_names: typing.Union[typing.Dict[str, typing.Optional[str]], NoneType] = Nonewriter_batch_size: typing.Optional[int] = 1000features: typing.Optional[datasets.features.features.Features] = Nonedisable_nullable: bool = Falsefn_kwargs: typing.Optional[dict] = Nonenum_proc: typing.Optional[int] = Nonedesc: typing.Optional[str] = None )
Parameters
function (callable
) — with one of the following signature:
function(example: Dict[str, Any]) -> Dict[str, Any]
if batched=False
and with_indices=False
function(example: Dict[str, Any], indices: int) -> Dict[str, Any]
if batched=False
and with_indices=True
function(batch: Dict[str, List]) -> Dict[str, List]
if batched=True
and with_indices=False
function(batch: Dict[str, List], indices: List[int]) -> Dict[str, List]
if batched=True
and with_indices=True
For advanced usage, the function can also return a pyarrow.Table
. Moreover if your function returns nothing (None
), then map
will run your function and return the dataset unchanged.
with_indices (bool
, defaults to False
) — Provide example indices to function
. Note that in this case the signature of function
should be def function(example, idx): ...
.
with_rank (bool
, defaults to False
) — Provide process rank to function
. Note that in this case the signature of function
should be def function(example[, idx], rank): ...
.
input_columns ([Union[str, List[str]]]
, optional, defaults to None
) — The columns to be passed into function
as positional arguments. If None
, a dict mapping to all formatted columns is passed as one argument.
batched (bool
, defaults to False
) — Provide batch of examples to function
.
batch_size (int
, optional, defaults to 1000
) — Number of examples per batch provided to function
if batched=True
, batch_size <= 0
or batch_size == None
then provide the full dataset as a single batch to function
.
drop_last_batch (bool
, defaults to False
) — Whether a last batch smaller than the batch_size should be dropped instead of being processed by the function.
remove_columns ([Union[str, List[str]]]
, optional, defaults to None
) — Remove a selection of columns while doing the mapping. Columns will be removed before updating the examples with the output of function
, i.e. if function
is adding columns with names in remove_columns
, these columns will be kept.
keep_in_memory (bool
, defaults to False
) — Keep the dataset in memory instead of writing it to a cache file.
load_from_cache_file (Optional[bool]
, defaults to True
if caching is enabled) — If a cache file storing the current computation from function
can be identified, use it instead of recomputing.
cache_file_names ([Dict[str, str]]
, optional, defaults to None
) — Provide the name of a path for the cache file. It is used to store the results of the computation instead of the automatically generated cache file name. You have to provide one cache_file_name
per dataset in the dataset dictionary.
writer_batch_size (int
, default 1000
) — Number of rows per write operation for the cache file writer. This value is a good trade-off between memory usage during the processing, and processing speed. Higher value makes the processing do fewer lookups, lower value consume less temporary memory while running map
.
disable_nullable (bool
, defaults to False
) — Disallow null values in the table.
fn_kwargs (Dict
, optional, defaults to None
) — Keyword arguments to be passed to function
num_proc (int
, optional, defaults to None
) — Number of processes for multiprocessing. By default it doesn’t use multiprocessing.
desc (str
, optional, defaults to None
) — Meaningful description to be displayed alongside with the progress bar while mapping examples.
Apply a function to all the elements in the table (individually or in batches) and update the table (if function does updated examples). The transformation is applied to all the datasets of the dataset dictionary.
Example:
Copied
filter
( functionwith_indices = Falseinput_columns: typing.Union[str, typing.List[str], NoneType] = Nonebatched: bool = Falsebatch_size: typing.Optional[int] = 1000keep_in_memory: bool = Falseload_from_cache_file: typing.Optional[bool] = Nonecache_file_names: typing.Union[typing.Dict[str, typing.Optional[str]], NoneType] = Nonewriter_batch_size: typing.Optional[int] = 1000fn_kwargs: typing.Optional[dict] = Nonenum_proc: typing.Optional[int] = Nonedesc: typing.Optional[str] = None )
Parameters
function (callable
) — With one of the following signature:
function(example: Dict[str, Any]) -> bool
if with_indices=False, batched=False
function(example: Dict[str, Any], indices: int) -> bool
if with_indices=True, batched=False
function(example: Dict[str, List]) -> List[bool]
if with_indices=False, batched=True
function(example: Dict[str, List], indices: List[int]) -> List[bool]
if `with_indices=True, batched=True
with_indices (bool
, defaults to False
) — Provide example indices to function
. Note that in this case the signature of function
should be def function(example, idx): ...
.
input_columns ([Union[str, List[str]]]
, optional, defaults to None
) — The columns to be passed into function
as positional arguments. If None
, a dict mapping to all formatted columns is passed as one argument.
batched (bool
, defaults to False
) — Provide batch of examples to function
.
batch_size (int
, optional, defaults to 1000
) — Number of examples per batch provided to function
if batched=True
batch_size <= 0
or batch_size == None
then provide the full dataset as a single batch to function
.
keep_in_memory (bool
, defaults to False
) — Keep the dataset in memory instead of writing it to a cache file.
load_from_cache_file (Optional[bool]
, defaults to True
if chaching is enabled) — If a cache file storing the current computation from function
can be identified, use it instead of recomputing.
cache_file_names ([Dict[str, str]]
, optional, defaults to None
) — Provide the name of a path for the cache file. It is used to store the results of the computation instead of the automatically generated cache file name. You have to provide one cache_file_name
per dataset in the dataset dictionary.
writer_batch_size (int
, defaults to 1000
) — Number of rows per write operation for the cache file writer. This value is a good trade-off between memory usage during the processing, and processing speed. Higher value makes the processing do fewer lookups, lower value consume less temporary memory while running map
.
fn_kwargs (Dict
, optional, defaults to None
) — Keyword arguments to be passed to function
num_proc (int
, optional, defaults to None
) — Number of processes for multiprocessing. By default it doesn’t use multiprocessing.
desc (str
, optional, defaults to None
) — Meaningful description to be displayed alongside with the progress bar while filtering examples.
Apply a filter function to all the elements in the table in batches and update the table so that the dataset only includes examples according to the filter function. The transformation is applied to all the datasets of the dataset dictionary.
Example:
Copied
sort
( column_names: typing.Union[str, typing.Sequence[str]]reverse: typing.Union[bool, typing.Sequence[bool]] = Falsekind = 'deprecated'null_placement: str = 'at_end'keep_in_memory: bool = Falseload_from_cache_file: typing.Optional[bool] = Noneindices_cache_file_names: typing.Union[typing.Dict[str, typing.Optional[str]], NoneType] = Nonewriter_batch_size: typing.Optional[int] = 1000 )
Parameters
column_names (Union[str, Sequence[str]]
) — Column name(s) to sort by.
reverse (Union[bool, Sequence[bool]]
, defaults to False
) — If True
, sort by descending order rather than ascending. If a single bool is provided, the value is applied to the sorting of all column names. Otherwise a list of bools with the same length and order as column_names must be provided.
kind (str
, optional) — Pandas algorithm for sorting selected in {quicksort, mergesort, heapsort, stable}
, The default is quicksort
. Note that both stable
and mergesort
use timsort under the covers and, in general, the actual implementation will vary with data type. The mergesort
option is retained for backwards compatibility.
Deprecated in 2.8.0
kind
was deprecated in version 2.10.0 and will be removed in 3.0.0.
null_placement (str
, defaults to at_end
) — Put None
values at the beginning if at_start
or first
or at the end if at_end
or last
keep_in_memory (bool
, defaults to False
) — Keep the sorted indices in memory instead of writing it to a cache file.
load_from_cache_file (Optional[bool]
, defaults to True
if caching is enabled) — If a cache file storing the sorted indices can be identified, use it instead of recomputing.
indices_cache_file_names ([Dict[str, str]]
, optional, defaults to None
) — Provide the name of a path for the cache file. It is used to store the indices mapping instead of the automatically generated cache file name. You have to provide one cache_file_name
per dataset in the dataset dictionary.
writer_batch_size (int
, defaults to 1000
) — Number of rows per write operation for the cache file writer. Higher value gives smaller cache files, lower value consume less temporary memory.
Create a new dataset sorted according to a single or multiple columns.
Example:
Copied
shuffle
( seeds: typing.Union[int, typing.Dict[str, typing.Optional[int]], NoneType] = Noneseed: typing.Optional[int] = Nonegenerators: typing.Union[typing.Dict[str, numpy.random._generator.Generator], NoneType] = Nonekeep_in_memory: bool = Falseload_from_cache_file: typing.Optional[bool] = Noneindices_cache_file_names: typing.Union[typing.Dict[str, typing.Optional[str]], NoneType] = Nonewriter_batch_size: typing.Optional[int] = 1000 )
Parameters
seeds (Dict[str, int]
or int
, optional) — A seed to initialize the default BitGenerator if generator=None
. If None
, then fresh, unpredictable entropy will be pulled from the OS. If an int
or array_like[ints]
is passed, then it will be passed to SeedSequence to derive the initial BitGenerator state. You can provide one seed
per dataset in the dataset dictionary.
seed (int
, optional) — A seed to initialize the default BitGenerator if generator=None
. Alias for seeds (a ValueError
is raised if both are provided).
generators (Dict[str, *optional*, np.random.Generator]
) — Numpy random Generator to use to compute the permutation of the dataset rows. If generator=None
(default), uses np.random.default_rng
(the default BitGenerator (PCG64) of NumPy). You have to provide one generator
per dataset in the dataset dictionary.
keep_in_memory (bool
, defaults to False
) — Keep the dataset in memory instead of writing it to a cache file.
load_from_cache_file (Optional[bool]
, defaults to True
if caching is enabled) — If a cache file storing the current computation from function
can be identified, use it instead of recomputing.
indices_cache_file_names (Dict[str, str]
, optional) — Provide the name of a path for the cache file. It is used to store the indices mappings instead of the automatically generated cache file name. You have to provide one cache_file_name
per dataset in the dataset dictionary.
writer_batch_size (int
, defaults to 1000
) — Number of rows per write operation for the cache file writer. This value is a good trade-off between memory usage during the processing, and processing speed. Higher value makes the processing do fewer lookups, lower value consume less temporary memory while running map
.
Create a new Dataset where the rows are shuffled.
The transformation is applied to all the datasets of the dataset dictionary.
Currently shuffling uses numpy random generators. You can either supply a NumPy BitGenerator to use, or a seed to initiate NumPy’s default random generator (PCG64).
Example:
Copied
set_format
( type: typing.Optional[str] = Nonecolumns: typing.Optional[typing.List] = Noneoutput_all_columns: bool = False**format_kwargs )
Parameters
type (str
, optional) — Output type selected in [None, 'numpy', 'torch', 'tensorflow', 'pandas', 'arrow', 'jax']
. None
means __getitem__
returns python objects (default).
columns (List[str]
, optional) — Columns to format in the output. None
means __getitem__
returns all columns (default).
output_all_columns (bool
, defaults to False) — Keep un-formatted columns as well in the output (as python objects),
**format_kwargs (additional keyword arguments) — Keywords arguments passed to the convert function like np.array
, torch.tensor
or tensorflow.ragged.constant
.
Set __getitem__
return format (type and columns). The format is set for every dataset in the dataset dictionary.
It is possible to call map
after calling set_format
. Since map
may add new columns, then the list of formatted columns gets updated. In this case, if you apply map
on a dataset to add a new column, then this column will be formatted:
new formatted columns = (all columns - previously unformatted columns)
Example:
Copied
reset_format
( )
Reset __getitem__
return format to python objects and all columns. The transformation is applied to all the datasets of the dataset dictionary.
Same as self.set_format()
Example:
Copied
formatted_as
( type: typing.Optional[str] = Nonecolumns: typing.Optional[typing.List] = Noneoutput_all_columns: bool = False**format_kwargs )
Parameters
type (str
, optional) — Output type selected in [None, 'numpy', 'torch', 'tensorflow', 'pandas', 'arrow', 'jax']
. None
means __getitem__
returns python objects (default).
columns (List[str]
, optional) — Columns to format in the output. None
means __getitem__
returns all columns (default).
output_all_columns (bool
, defaults to False) — Keep un-formatted columns as well in the output (as python objects).
**format_kwargs (additional keyword arguments) — Keywords arguments passed to the convert function like np.array
, torch.tensor
or tensorflow.ragged.constant
.
To be used in a with
statement. Set __getitem__
return format (type and columns). The transformation is applied to all the datasets of the dataset dictionary.
with_format
( type: typing.Optional[str] = Nonecolumns: typing.Optional[typing.List] = Noneoutput_all_columns: bool = False**format_kwargs )
Parameters
type (str
, optional) — Output type selected in [None, 'numpy', 'torch', 'tensorflow', 'pandas', 'arrow', 'jax']
. None
means __getitem__
returns python objects (default).
columns (List[str]
, optional) — Columns to format in the output. None
means __getitem__
returns all columns (default).
output_all_columns (bool
, defaults to False
) — Keep un-formatted columns as well in the output (as python objects).
**format_kwargs (additional keyword arguments) — Keywords arguments passed to the convert function like np.array
, torch.tensor
or tensorflow.ragged.constant
.
Set __getitem__
return format (type and columns). The data formatting is applied on-the-fly. The format type
(for example “numpy”) is used to format batches when using __getitem__
. The format is set for every dataset in the dataset dictionary.
Example:
Copied
with_transform
( transform: typing.Optional[typing.Callable]columns: typing.Optional[typing.List] = Noneoutput_all_columns: bool = False )
Parameters
columns (List[str]
, optional) — Columns to format in the output. If specified, then the input batch of the transform only contains those columns.
output_all_columns (bool
, defaults to False) — Keep un-formatted columns as well in the output (as python objects). If set to True
, then the other un-formatted columns are kept with the output of the transform.
Set __getitem__
return format using this transform. The transform is applied on-the-fly on batches when __getitem__
is called. The transform is set for every dataset in the dataset dictionary
Example:
Copied
flatten
( max_depth = 16 )
Flatten the Apache Arrow Table of each split (nested features are flatten). Each column with a struct type is flattened into one column per struct field. Other columns are left unchanged.
Example:
Copied
cast
( features: Features )
Parameters
Cast the dataset to a new set of features. The transformation is applied to all the datasets of the dataset dictionary.
Example:
Copied
cast_column
( column: strfeature )
Parameters
column (str
) — Column name.
feature (Feature
) — Target feature.
Cast column to feature for decoding.
Example:
Copied
remove_columns
( column_names: typing.Union[str, typing.List[str]] )
Parameters
column_names (Union[str, List[str]]
) — Name of the column(s) to remove.
Remove one or several column(s) from each split in the dataset and the features associated to the column(s).
The transformation is applied to all the splits of the dataset dictionary.
Example:
Copied
rename_column
( original_column_name: strnew_column_name: str )
Parameters
original_column_name (str
) — Name of the column to rename.
new_column_name (str
) — New name for the column.
Rename a column in the dataset and move the features associated to the original column under the new column name. The transformation is applied to all the datasets of the dataset dictionary.
takes care of moving the original features under the new column name.
doesn’t copy the data to a new dataset and is thus much faster.
Example:
Copied
rename_columns
Parameters
column_mapping (Dict[str, str]
) — A mapping of columns to rename to their new names.
Returns
A copy of the dataset with renamed columns.
Rename several columns in the dataset, and move the features associated to the original columns under the new column names. The transformation is applied to all the datasets of the dataset dictionary.
Example:
Copied
select_columns
( column_names: typing.Union[str, typing.List[str]] )
Parameters
column_names (Union[str, List[str]]
) — Name of the column(s) to keep.
Select one or several column(s) from each split in the dataset and the features associated to the column(s).
The transformation is applied to all the splits of the dataset dictionary.
Example:
Copied
class_encode_column
( column: strinclude_nulls: bool = False )
Parameters
column (str
) — The name of the column to cast.
include_nulls (bool
, defaults to False
) — Whether to include null values in the class labels. If True
, the null values will be encoded as the "None"
class label.
Added in 1.14.2
Example:
Copied
push_to_hub
( repo_idconfig_name: str = 'default'private: typing.Optional[bool] = Falsetoken: typing.Optional[str] = Nonebranch: NoneType = Nonemax_shard_size: typing.Union[str, int, NoneType] = Nonenum_shards: typing.Union[typing.Dict[str, int], NoneType] = Noneembed_external_files: bool = True )
Parameters
repo_id (str
) — The ID of the repository to push to in the following format: <user>/<dataset_name>
or <org>/<dataset_name>
. Also accepts <dataset_name>
, which will default to the namespace of the logged-in user.
private (bool
, optional) — Whether the dataset repository should be set to private or not. Only affects repository creation: a repository that already exists will not be affected by that parameter.
config_name (str
) — Configuration name of a dataset. Defaults to “default”.
token (str
, optional) — An optional authentication token for the BOINC AI Hub. If no token is passed, will default to the token saved locally when logging in with boincai-cli login
. Will raise an error if no token is passed and the user is not logged-in.
branch (str
, optional) — The git branch on which to push the dataset.
max_shard_size (int
or str
, optional, defaults to "500MB"
) — The maximum size of the dataset shards to be uploaded to the hub. If expressed as a string, needs to be digits followed by a unit (like "500MB"
or "1GB"
).
num_shards (Dict[str, int]
, optional) — Number of shards to write. By default the number of shards depends on max_shard_size
. Use a dictionary to define a different num_shards for each split.
Added in 2.8.0
embed_external_files (bool
, defaults to True
) — Whether to embed file bytes in the shards. In particular, this will do the following before the push for the fields of type:
Each dataset split will be pushed independently. The pushed dataset will keep the original split names.
Example:
Copied
save_to_disk
( dataset_dict_path: typing.Union[str, bytes, os.PathLike]fs = 'deprecated'max_shard_size: typing.Union[str, int, NoneType] = Nonenum_shards: typing.Union[typing.Dict[str, int], NoneType] = Nonenum_proc: typing.Optional[int] = Nonestorage_options: typing.Optional[dict] = None )
Parameters
dataset_dict_path (str
) — Path (e.g. dataset/train
) or remote URI (e.g. s3://my-bucket/dataset/train
) of the dataset dict directory where the dataset dict will be saved to.
fs (fsspec.spec.AbstractFileSystem
, optional) — Instance of the remote filesystem where the dataset will be saved to.
Deprecated in 2.8.0
fs
was deprecated in version 2.8.0 and will be removed in 3.0.0. Please use storage_options
instead, e.g. storage_options=fs.storage_options
max_shard_size (int
or str
, optional, defaults to "500MB"
) — The maximum size of the dataset shards to be uploaded to the hub. If expressed as a string, needs to be digits followed by a unit (like "50MB"
).
num_shards (Dict[str, int]
, optional) — Number of shards to write. By default the number of shards depends on max_shard_size
and num_proc
. You need to provide the number of shards for each dataset in the dataset dictionary. Use a dictionary to define a different num_shards for each split.
Added in 2.8.0
num_proc (int
, optional, default None
) — Number of processes when downloading and generating the dataset locally. Multiprocessing is disabled by default.
Added in 2.8.0
storage_options (dict
, optional) — Key/value pairs to be passed on to the file-system backend, if any.
Added in 2.8.0
Saves a dataset dict to a filesystem using fsspec.spec.AbstractFileSystem
.
All the Image() and Audio() data are stored in the arrow files. If you want to store paths or urls, please use the Value(“string”) type.
Example:
Copied
load_from_disk
( dataset_dict_path: typing.Union[str, bytes, os.PathLike]fs = 'deprecated'keep_in_memory: typing.Optional[bool] = Nonestorage_options: typing.Optional[dict] = None )
Parameters
dataset_dict_path (str
) — Path (e.g. "dataset/train"
) or remote URI (e.g. "s3//my-bucket/dataset/train"
) of the dataset dict directory where the dataset dict will be loaded from.
fs (fsspec.spec.AbstractFileSystem
, optional) — Instance of the remote filesystem where the dataset will be saved to.
Deprecated in 2.8.0
fs
was deprecated in version 2.8.0 and will be removed in 3.0.0. Please use storage_options
instead, e.g. storage_options=fs.storage_options
storage_options (dict
, optional) — Key/value pairs to be passed on to the file-system backend, if any.
Added in 2.8.0
Load a dataset that was previously saved using save_to_disk
from a filesystem using fsspec.spec.AbstractFileSystem
.
Example:
Copied
from_csv
( path_or_paths: typing.Dict[str, typing.Union[str, bytes, os.PathLike]]features: typing.Optional[datasets.features.features.Features] = Nonecache_dir: str = Nonekeep_in_memory: bool = False**kwargs )
Parameters
path_or_paths (dict
of path-like) — Path(s) of the CSV file(s).
cache_dir (str, optional, defaults to "~/.cache/boincai/datasets"
) — Directory to cache data.
keep_in_memory (bool
, defaults to False
) — Whether to copy the data in-memory.
**kwargs (additional keyword arguments) — Keyword arguments to be passed to pandas.read_csv
.
Example:
Copied
from_json
( path_or_paths: typing.Dict[str, typing.Union[str, bytes, os.PathLike]]features: typing.Optional[datasets.features.features.Features] = Nonecache_dir: str = Nonekeep_in_memory: bool = False**kwargs )
Parameters
path_or_paths (path-like
or list of path-like
) — Path(s) of the JSON Lines file(s).
cache_dir (str, optional, defaults to "~/.cache/boincai/datasets"
) — Directory to cache data.
keep_in_memory (bool
, defaults to False
) — Whether to copy the data in-memory.
**kwargs (additional keyword arguments) — Keyword arguments to be passed to JsonConfig
.
Example:
Copied
from_parquet
( path_or_paths: typing.Dict[str, typing.Union[str, bytes, os.PathLike]]features: typing.Optional[datasets.features.features.Features] = Nonecache_dir: str = Nonekeep_in_memory: bool = Falsecolumns: typing.Optional[typing.List[str]] = None**kwargs )
Parameters
path_or_paths (dict
of path-like) — Path(s) of the CSV file(s).
cache_dir (str
, optional, defaults to "~/.cache/boincai/datasets"
) — Directory to cache data.
keep_in_memory (bool
, defaults to False
) — Whether to copy the data in-memory.
columns (List[str]
, optional) — If not None
, only these columns will be read from the file. A column name may be a prefix of a nested field, e.g. ‘a’ will select ‘a.b’, ‘a.c’, and ‘a.d.e’.
**kwargs (additional keyword arguments) — Keyword arguments to be passed to ParquetConfig
.
Example:
Copied
from_text
( path_or_paths: typing.Dict[str, typing.Union[str, bytes, os.PathLike]]features: typing.Optional[datasets.features.features.Features] = Nonecache_dir: str = Nonekeep_in_memory: bool = False**kwargs )
Parameters
path_or_paths (dict
of path-like) — Path(s) of the text file(s).
cache_dir (str
, optional, defaults to "~/.cache/boincai/datasets"
) — Directory to cache data.
keep_in_memory (bool
, defaults to False
) — Whether to copy the data in-memory.
**kwargs (additional keyword arguments) — Keyword arguments to be passed to TextConfig
.
Example:
Copied
prepare_for_task
( task: typing.Union[str, datasets.tasks.base.TaskTemplate]id: int = 0 )
Parameters
task (Union[str, TaskTemplate]
) — The task to prepare the dataset for during training and evaluation. If str
, supported tasks include:
"text-classification"
"question-answering"
id (int
, defaults to 0
) — The id required to unambiguously identify the task template when multiple task templates of the same type are supported.
Casts datasets.DatasetInfo.features
according to a task-specific schema. Intended for single-use only, so all task templates are removed from datasets.DatasetInfo.task_templates
after casting.
( ex_iterable: _BaseExamplesIterableinfo: typing.Optional[datasets.info.DatasetInfo] = Nonesplit: typing.Optional[datasets.splits.NamedSplit] = Noneformatting: typing.Optional[datasets.iterable_dataset.FormattingConfig] = Noneshuffling: typing.Optional[datasets.iterable_dataset.ShufflingConfig] = Nonedistributed: typing.Optional[datasets.iterable_dataset.DistributedConfig] = Nonetoken_per_repo_id: typing.Union[typing.Dict[str, typing.Union[str, bool, NoneType]], NoneType] = Noneformat_type = 'deprecated' )
A Dataset backed by an iterable.
from_generator
( generator: typing.Callablefeatures: typing.Optional[datasets.features.features.Features] = Nonegen_kwargs: typing.Optional[dict] = None ) → IterableDataset
Parameters
generator (Callable
) — A generator function that yields
examples.
features (Features
, optional) — Dataset features.
gen_kwargs(dict
, optional) — Keyword arguments to be passed to the generator
callable. You can define a sharded iterable dataset by passing the list of shards in gen_kwargs
. This can be used to improve shuffling and when iterating over the dataset with multiple workers.
Returns
IterableDataset
Create an Iterable Dataset from a generator.
Example:
Copied
Copied
remove_columns
( column_names: typing.Union[str, typing.List[str]] ) → IterableDataset
Parameters
column_names (Union[str, List[str]]
) — Name of the column(s) to remove.
Returns
IterableDataset
A copy of the dataset object without the columns to remove.
Remove one or several column(s) in the dataset and the features associated to them. The removal is done on-the-fly on the examples when iterating over the dataset.
Example:
Copied
select_columns
( column_names: typing.Union[str, typing.List[str]] ) → IterableDataset
Parameters
column_names (Union[str, List[str]]
) — Name of the column(s) to select.
Returns
IterableDataset
A copy of the dataset object with selected columns.
Select one or several column(s) in the dataset and the features associated to them. The selection is done on-the-fly on the examples when iterating over the dataset.
Example:
Copied
cast_column
( column: strfeature: typing.Union[dict, list, tuple, datasets.features.features.Value, datasets.features.features.ClassLabel, datasets.features.translation.Translation, datasets.features.translation.TranslationVariableLanguages, datasets.features.features.Sequence, datasets.features.features.Array2D, datasets.features.features.Array3D, datasets.features.features.Array4D, datasets.features.features.Array5D, datasets.features.audio.Audio, datasets.features.image.Image] ) → IterableDataset
Parameters
column (str
) — Column name.
feature (Feature
) — Target feature.
Returns
IterableDataset
Cast column to feature for decoding.
Example:
Copied
cast
( features: Features ) → IterableDataset
Parameters
Returns
IterableDataset
A copy of the dataset with casted features.
Cast the dataset to a new set of features.
Example:
Copied
__iter__
( )
iter
( batch_size: intdrop_last_batch: bool = False )
Parameters
batch_size (int
) — size of each batch to yield.
drop_last_batch (bool
, default False) — Whether a last batch smaller than the batch_size should be dropped
Iterate through the batches of size batch_size.
map
( function: typing.Optional[typing.Callable] = Nonewith_indices: bool = Falseinput_columns: typing.Union[str, typing.List[str], NoneType] = Nonebatched: bool = Falsebatch_size: typing.Optional[int] = 1000drop_last_batch: bool = Falseremove_columns: typing.Union[str, typing.List[str], NoneType] = Nonefeatures: typing.Optional[datasets.features.features.Features] = Nonefn_kwargs: typing.Optional[dict] = None )
Parameters
function (Callable
, optional, defaults to None
) — Function applied on-the-fly on the examples when you iterate on the dataset. It must have one of the following signatures:
function(example: Dict[str, Any]) -> Dict[str, Any]
if batched=False
and with_indices=False
function(example: Dict[str, Any], idx: int) -> Dict[str, Any]
if batched=False
and with_indices=True
function(batch: Dict[str, List]) -> Dict[str, List]
if batched=True
and with_indices=False
function(batch: Dict[str, List], indices: List[int]) -> Dict[str, List]
if batched=True
and with_indices=True
For advanced usage, the function can also return a pyarrow.Table
. Moreover if your function returns nothing (None
), then map
will run your function and return the dataset unchanged. If no function is provided, default to identity function: lambda x: x
.
with_indices (bool
, defaults to False
) — Provide example indices to function
. Note that in this case the signature of function
should be def function(example, idx[, rank]): ...
.
input_columns (Optional[Union[str, List[str]]]
, defaults to None
) — The columns to be passed into function
as positional arguments. If None
, a dict mapping to all formatted columns is passed as one argument.
batched (bool
, defaults to False
) — Provide batch of examples to function
.
batch_size (int
, optional, defaults to 1000
) — Number of examples per batch provided to function
if batched=True
. batch_size <= 0
or batch_size == None
then provide the full dataset as a single batch to function
.
drop_last_batch (bool
, defaults to False
) — Whether a last batch smaller than the batch_size should be dropped instead of being processed by the function.
remove_columns ([List[str]]
, optional, defaults to None
) — Remove a selection of columns while doing the mapping. Columns will be removed before updating the examples with the output of function
, i.e. if function
is adding columns with names in remove_columns
, these columns will be kept.
features ([Features]
, optional, defaults to None
) — Feature types of the resulting dataset.
fn_kwargs (Dict
, optional, default None
) — Keyword arguments to be passed to function
.
Apply a function to all the examples in the iterable dataset (individually or in batches) and update them. If your function returns a column that already exists, then it overwrites it. The function is applied on-the-fly on the examples when iterating over the dataset.
You can specify whether the function should be batched or not with the batched
parameter:
If batched is False
, then the function takes 1 example in and should return 1 example. An example is a dictionary, e.g. {"text": "Hello there !"}
.
If batched is True
and batch_size
is 1, then the function takes a batch of 1 example as input and can return a batch with 1 or more examples. A batch is a dictionary, e.g. a batch of 1 example is {“text”: [“Hello there !”]}.
If batched is True
and batch_size
is n
> 1, then the function takes a batch of n
examples as input and can return a batch with n
examples, or with an arbitrary number of examples. Note that the last batch may have less than n
examples. A batch is a dictionary, e.g. a batch of n
examples is {"text": ["Hello there !"] * n}
.
Example:
Copied
rename_column
( original_column_name: strnew_column_name: str ) → IterableDataset
Parameters
original_column_name (str
) — Name of the column to rename.
new_column_name (str
) — New name for the column.
Returns
IterableDataset
A copy of the dataset with a renamed column.
Rename a column in the dataset, and move the features associated to the original column under the new column name.
Example:
Copied
filter
( function: typing.Optional[typing.Callable] = Nonewith_indices = Falseinput_columns: typing.Union[str, typing.List[str], NoneType] = Nonebatched: bool = Falsebatch_size: typing.Optional[int] = 1000fn_kwargs: typing.Optional[dict] = None )
Parameters
function (Callable
) — Callable with one of the following signatures:
function(example: Dict[str, Any]) -> bool
if with_indices=False, batched=False
function(example: Dict[str, Any], indices: int) -> bool
if with_indices=True, batched=False
function(example: Dict[str, List]) -> List[bool]
if with_indices=False, batched=True
function(example: Dict[str, List], indices: List[int]) -> List[bool]
if with_indices=True, batched=True
If no function is provided, defaults to an always True function: lambda x: True
.
with_indices (bool
, defaults to False
) — Provide example indices to function
. Note that in this case the signature of function
should be def function(example, idx): ...
.
input_columns (str
or List[str]
, optional) — The columns to be passed into function
as positional arguments. If None
, a dict mapping to all formatted columns is passed as one argument.
batched (bool
, defaults to False
) — Provide batch of examples to function
.
batch_size (int
, optional, default 1000
) — Number of examples per batch provided to function
if batched=True
.
fn_kwargs (Dict
, optional, default None
) — Keyword arguments to be passed to function
.
Apply a filter function to all the elements so that the dataset only includes examples according to the filter function. The filtering is done on-the-fly when iterating over the dataset.
Example:
Copied
shuffle
( seed = Nonegenerator: typing.Optional[numpy.random._generator.Generator] = Nonebuffer_size: int = 1000 )
Parameters
seed (int
, optional, defaults to None
) — Random seed that will be used to shuffle the dataset. It is used to sample from the shuffle buffe and also to shuffle the data shards.
generator (numpy.random.Generator
, optional) — Numpy random Generator to use to compute the permutation of the dataset rows. If generator=None
(default), uses np.random.default_rng
(the default BitGenerator (PCG64) of NumPy).
buffer_size (int
, defaults to 1000
) — Size of the buffer.
Randomly shuffles the elements of this dataset.
This dataset fills a buffer with buffer_size
elements, then randomly samples elements from this buffer, replacing the selected elements with new elements. For perfect shuffling, a buffer size greater than or equal to the full size of the dataset is required.
For instance, if your dataset contains 10,000 elements but buffer_size
is set to 1000, then shuffle
will initially select a random element from only the first 1000 elements in the buffer. Once an element is selected, its space in the buffer is replaced by the next (i.e. 1,001-st) element, maintaining the 1000 element buffer.
Example:
Copied
skip
( n )
Parameters
n (int
) — Number of elements to skip.
Example:
Copied
take
( n )
Parameters
n (int
) — Number of elements to take.
Example:
Copied
info
( )
split
( )
builder_name
( )
citation
( )
config_name
( )
dataset_size
( )
description
( )
download_checksums
( )
download_size
( )
features
( )
homepage
( )
license
( )
size_in_bytes
( )
supervised_keys
( )
version
( )
Dictionary with split names as keys (‘train’, ‘test’ for example), and IterableDataset
objects as values.
( )
map
( function: typing.Optional[typing.Callable] = Nonewith_indices: bool = Falseinput_columns: typing.Union[str, typing.List[str], NoneType] = Nonebatched: bool = Falsebatch_size: int = 1000drop_last_batch: bool = Falseremove_columns: typing.Union[str, typing.List[str], NoneType] = Nonefn_kwargs: typing.Optional[dict] = None )
Parameters
function (Callable
, optional, defaults to None
) — Function applied on-the-fly on the examples when you iterate on the dataset. It must have one of the following signatures:
function(example: Dict[str, Any]) -> Dict[str, Any]
if batched=False
and with_indices=False
function(example: Dict[str, Any], idx: int) -> Dict[str, Any]
if batched=False
and with_indices=True
function(batch: Dict[str, List]) -> Dict[str, List]
if batched=True
and with_indices=False
function(batch: Dict[str, List], indices: List[int]) -> Dict[str, List]
if batched=True
and with_indices=True
For advanced usage, the function can also return a pyarrow.Table
. Moreover if your function returns nothing (None
), then map
will run your function and return the dataset unchanged. If no function is provided, default to identity function: lambda x: x
.
with_indices (bool
, defaults to False
) — Provide example indices to function
. Note that in this case the signature of function
should be def function(example, idx[, rank]): ...
.
input_columns ([Union[str, List[str]]]
, optional, defaults to None
) — The columns to be passed into function
as positional arguments. If None
, a dict mapping to all formatted columns is passed as one argument.
batched (bool
, defaults to False
) — Provide batch of examples to function
.
batch_size (int
, optional, defaults to 1000
) — Number of examples per batch provided to function
if batched=True
.
drop_last_batch (bool
, defaults to False
) — Whether a last batch smaller than the batch_size
should be dropped instead of being processed by the function.
remove_columns ([List[str]]
, optional, defaults to None
) — Remove a selection of columns while doing the mapping. Columns will be removed before updating the examples with the output of function
, i.e. if function
is adding columns with names in remove_columns
, these columns will be kept.
fn_kwargs (Dict
, optional, defaults to None
) — Keyword arguments to be passed to function
Apply a function to all the examples in the iterable dataset (individually or in batches) and update them. If your function returns a column that already exists, then it overwrites it. The function is applied on-the-fly on the examples when iterating over the dataset. The transformation is applied to all the datasets of the dataset dictionary.
You can specify whether the function should be batched or not with the batched
parameter:
If batched is False
, then the function takes 1 example in and should return 1 example. An example is a dictionary, e.g. {"text": "Hello there !"}
.
If batched is True
and batch_size
is 1, then the function takes a batch of 1 example as input and can return a batch with 1 or more examples. A batch is a dictionary, e.g. a batch of 1 example is {"text": ["Hello there !"]}
.
If batched is True
and batch_size
is n
> 1, then the function takes a batch of n
examples as input and can return a batch with n
examples, or with an arbitrary number of examples. Note that the last batch may have less than n
examples. A batch is a dictionary, e.g. a batch of n
examples is {"text": ["Hello there !"] * n}
.
Example:
Copied
filter
( function: typing.Optional[typing.Callable] = Nonewith_indices = Falseinput_columns: typing.Union[str, typing.List[str], NoneType] = Nonebatched: bool = Falsebatch_size: typing.Optional[int] = 1000fn_kwargs: typing.Optional[dict] = None )
Parameters
function (Callable
) — Callable with one of the following signatures:
function(example: Dict[str, Any]) -> bool
if with_indices=False, batched=False
function(example: Dict[str, Any], indices: int) -> bool
if with_indices=True, batched=False
function(example: Dict[str, List]) -> List[bool]
if with_indices=False, batched=True
function(example: Dict[str, List], indices: List[int]) -> List[bool]
if with_indices=True, batched=True
If no function is provided, defaults to an always True function: lambda x: True
.
with_indices (bool
, defaults to False
) — Provide example indices to function
. Note that in this case the signature of function
should be def function(example, idx): ...
.
input_columns (str
or List[str]
, optional) — The columns to be passed into function
as positional arguments. If None
, a dict mapping to all formatted columns is passed as one argument.
batched (bool
, defaults to False
) — Provide batch of examples to function
batch_size (int
, optional, defaults to 1000
) — Number of examples per batch provided to function
if batched=True
.
fn_kwargs (Dict
, optional, defaults to None
) — Keyword arguments to be passed to function
Apply a filter function to all the elements so that the dataset only includes examples according to the filter function. The filtering is done on-the-fly when iterating over the dataset. The filtering is applied to all the datasets of the dataset dictionary.
Example:
Copied
shuffle
( seed = Nonegenerator: typing.Optional[numpy.random._generator.Generator] = Nonebuffer_size: int = 1000 )
Parameters
seed (int
, optional, defaults to None
) — Random seed that will be used to shuffle the dataset. It is used to sample from the shuffle buffe and als oto shuffle the data shards.
generator (numpy.random.Generator
, optional) — Numpy random Generator to use to compute the permutation of the dataset rows. If generator=None
(default), uses np.random.default_rng
(the default BitGenerator (PCG64) of NumPy).
buffer_size (int
, defaults to 1000
) — Size of the buffer.
Randomly shuffles the elements of this dataset. The shuffling is applied to all the datasets of the dataset dictionary.
This dataset fills a buffer with buffer_size elements, then randomly samples elements from this buffer, replacing the selected elements with new elements. For perfect shuffling, a buffer size greater than or equal to the full size of the dataset is required.
For instance, if your dataset contains 10,000 elements but buffer_size
is set to 1000, then shuffle
will initially select a random element from only the first 1000 elements in the buffer. Once an element is selected, its space in the buffer is replaced by the next (i.e. 1,001-st) element, maintaining the 1000 element buffer.
Example:
Copied
with_format
( type: typing.Optional[str] = None )
Parameters
type (str
, optional, defaults to None
) — If set to “torch”, the returned dataset will be a subclass of torch.utils.data.IterableDataset
to be used in a DataLoader
.
Return a dataset with the specified format. This method only supports the “torch” format for now. The format is set to all the datasets of the dataset dictionary.
Example:
Copied
cast
Parameters
features (Features
) — New features to cast the dataset to. The name of the fields in the features must match the current column names. The type of the data must also be convertible from one type to the other. For non-trivial conversion, e.g. string
<-> ClassLabel
you should use map
to update the Dataset.
Returns
A copy of the dataset with casted features.
Cast the dataset to a new set of features. The type casting is applied to all the datasets of the dataset dictionary.
Example:
Copied
cast_column
( column: strfeature: typing.Union[dict, list, tuple, datasets.features.features.Value, datasets.features.features.ClassLabel, datasets.features.translation.Translation, datasets.features.translation.TranslationVariableLanguages, datasets.features.features.Sequence, datasets.features.features.Array2D, datasets.features.features.Array3D, datasets.features.features.Array4D, datasets.features.features.Array5D, datasets.features.audio.Audio, datasets.features.image.Image] )
Parameters
column (str
) — Column name.
feature (Feature
) — Target feature.
Cast column to feature for decoding. The type casting is applied to all the datasets of the dataset dictionary.
Example:
Copied
remove_columns
Parameters
column_names (Union[str, List[str]]
) — Name of the column(s) to remove.
Returns
A copy of the dataset object without the columns to remove.
Remove one or several column(s) in the dataset and the features associated to them. The removal is done on-the-fly on the examples when iterating over the dataset. The removal is applied to all the datasets of the dataset dictionary.
Example:
Copied
rename_column
Parameters
original_column_name (str
) — Name of the column to rename.
new_column_name (str
) — New name for the column.
Returns
A copy of the dataset with a renamed column.
Rename a column in the dataset, and move the features associated to the original column under the new column name. The renaming is applied to all the datasets of the dataset dictionary.
Example:
Copied
rename_columns
Parameters
column_mapping (Dict[str, str]
) — A mapping of columns to rename to their new names.
Returns
A copy of the dataset with renamed columns
Rename several columns in the dataset, and move the features associated to the original columns under the new column names. The renaming is applied to all the datasets of the dataset dictionary.
Example:
Copied
select_columns
Parameters
column_names (Union[str, List[str]]
) — Name of the column(s) to keep.
Returns
A copy of the dataset object with only selected columns.
Select one or several column(s) in the dataset and the features associated to them. The selection is done on-the-fly on the examples when iterating over the dataset. The selection is applied to all the datasets of the dataset dictionary.
Example:
Copied
( *args**kwargs )
A special dictionary that defines the internal structure of a dataset.
Instantiated with a dictionary of type dict[str, FieldType]
, where keys are the desired column names, and values are the type of that column.
FieldType
can be one of the following:
a python dict
which specifies that the field is a nested field containing a mapping of sub-fields to sub-fields features. It’s possible to have nested fields of nested fields in an arbitrary manner.
copy
( )
Example:
Copied
decode_batch
( batch: dicttoken_per_repo_id: typing.Union[typing.Dict[str, typing.Union[str, bool, NoneType]], NoneType] = None )
Parameters
batch (dict[str, list[Any]]
) — Dataset batch data.
token_per_repo_id (dict
, optional) — To access and decode audio or image files from private repositories on the Hub, you can pass a dictionary repo_id (str) -> token (bool or str)
Decode batch with custom feature decoding.
decode_column
( column: listcolumn_name: str )
Parameters
column (list[Any]
) — Dataset column data.
column_name (str
) — Dataset column name.
Decode column with custom feature decoding.
decode_example
( example: dicttoken_per_repo_id: typing.Union[typing.Dict[str, typing.Union[str, bool, NoneType]], NoneType] = None )
Parameters
example (dict[str, Any]
) — Dataset row data.
token_per_repo_id (dict
, optional) — To access and decode audio or image files from private repositories on the Hub, you can pass a dictionary repo_id (str) -> token (bool or str)
.
Decode example with custom feature decoding.
encode_batch
( batch )
Parameters
batch (dict[str, list[Any]]
) — Data in a Dataset batch.
Encode batch into a format for Arrow.
encode_column
( columncolumn_name: str )
Parameters
column (list[Any]
) — Data in a Dataset column.
column_name (str
) — Dataset column name.
Encode column into a format for Arrow.
encode_example
( example )
Parameters
example (dict[str, Any]
) — Data in a Dataset row.
Encode example into a format for Arrow.
flatten
Returns
The flattened features.
Flatten the features. Every dictionary column is removed and is replaced by all the subfields it contains. The new fields are named by concatenating the name of the original column and the subfield name like this: <original>.<subfield>
.
If a column contains nested dictionaries, then all the lower-level subfields names are also concatenated to form new columns: <original>.<subfield>.<subsubfield>
, etc.
Example:
Copied
from_arrow_schema
( pa_schema: Schema )
Parameters
pa_schema (pyarrow.Schema
) — Arrow Schema.
from_dict
( dic ) → Features
Parameters
dic (dict[str, Any]) — Python dictionary.
Returns
Features
Construct [Features] from dict.
Regenerate the nested feature object from a deserialized dict. We use the _type key to infer the dataclass name of the feature FieldType.
It allows for a convenient constructor syntax to define features from deserialized JSON dictionaries. This function is used in particular when deserializing a [DatasetInfo] that was dumped to a JSON object. This acts as an analogue to [Features.from_arrow_schema] and handles the recursive field-by-field instantiation, but doesn’t require any mapping to/from pyarrow, except for the fact that it takes advantage of the mapping of pyarrow primitive dtypes that [Value] automatically performs.
Example:
Copied
reorder_fields_as
( other: Features )
Parameters
other ([Features]) — The other [Features] to align with.
Reorder Features fields to match the field order of other [Features].
The order of the fields is important since it matters for the underlying arrow data. Re-ordering the fields allows to make the underlying arrow data type match.
Example:
Copied
( feature: typing.Anylength: int = -1id: typing.Optional[str] = None )
Parameters
length (int
) — Length of the sequence.
Construct a list of feature from a single type or a dict of types. Mostly here for compatiblity with tfds.
Example:
Copied
( num_classes: dataclasses.InitVar[typing.Optional[int]] = Nonenames: typing.List[str] = Nonenames_file: dataclasses.InitVar[typing.Optional[str]] = Noneid: typing.Optional[str] = None )
Parameters
num_classes (int
, optional) — Number of classes. All labels must be < num_classes
.
names (list
of str
, optional) — String names for the integer classes. The order in which the names are provided is kept.
names_file (str
, optional) — Path to a file with names for the integer classes, one per line.
Feature type for integer class labels.
There are 3 ways to define a ClassLabel
, which correspond to the 3 arguments:
num_classes
: Create 0 to (num_classes-1) labels.
names
: List of label strings.
names_file
: File containing the list of labels.
Under the hood the labels are stored as integers. You can use negative integers to represent unknown/missing labels.
Example:
Copied
cast_storage
( storage: typing.Union[pyarrow.lib.StringArray, pyarrow.lib.IntegerArray] ) → pa.Int64Array
Parameters
storage (Union[pa.StringArray, pa.IntegerArray]
) — PyArrow array to cast.
Returns
pa.Int64Array
Array in the ClassLabel
arrow storage type.
Cast an Arrow array to the ClassLabel
arrow storage type. The Arrow types that can be converted to the ClassLabel
pyarrow storage type are:
pa.string()
pa.int()
int2str
( values: typing.Union[int, collections.abc.Iterable] )
Conversion integer
=> class name string
.
Regarding unknown/missing labels: passing negative integers raises ValueError
.
Example:
Copied
str2int
( values: typing.Union[str, collections.abc.Iterable] )
Conversion class name string
=> integer
.
Example:
Copied
( dtype: strid: typing.Optional[str] = None )
The Value
dtypes are as follows:
null
bool
int8
int16
int32
int64
uint8
uint16
uint32
uint64
float16
float32
(alias float)
float64
(alias double)
time32[(s|ms)]
time64[(us|ns)]
timestamp[(s|ms|us|ns)]
timestamp[(s|ms|us|ns), tz=(tzstring)]
date32
date64
duration[(s|ms|us|ns)]
decimal128(precision, scale)
decimal256(precision, scale)
binary
large_binary
string
large_string
Example:
Copied
( languages: typing.List[str]id: typing.Optional[str] = None )
Parameters
languages (dict
) — A dictionary for each example mapping string language codes to string translations.
FeatureConnector
for translations with fixed languages per example. Here for compatiblity with tfds.
Example:
Copied
flatten
( )
Flatten the Translation feature into a dictionary.
( languages: typing.Optional[typing.List] = Nonenum_languages: typing.Optional[int] = Noneid: typing.Optional[str] = None ) →
language
or translation
(variable-length 1D tf.Tensor
of tf.string
)
Parameters
languages (dict
) — A dictionary for each example mapping string language codes to one or more string translations. The languages present may vary from example to example.
Returns
language
or translation
(variable-length 1D tf.Tensor
of tf.string
)
Language codes sorted in ascending order or plain text translations, sorted to align with language codes.
FeatureConnector
for translations with variable languages per example. Here for compatiblity with tfds.
Example:
Copied
flatten
( )
Flatten the TranslationVariableLanguages feature into a dictionary.
( shape: tupledtype: strid: typing.Optional[str] = None )
Parameters
shape (tuple
) — The size of each dimension.
dtype (str
) — The value of the data type.
Create a two-dimensional array.
Example:
Copied
( shape: tupledtype: strid: typing.Optional[str] = None )
Parameters
shape (tuple
) — The size of each dimension.
dtype (str
) — The value of the data type.
Create a three-dimensional array.
Example:
Copied
( shape: tupledtype: strid: typing.Optional[str] = None )
Parameters
shape (tuple
) — The size of each dimension.
dtype (str
) — The value of the data type.
Create a four-dimensional array.
Example:
Copied
( shape: tupledtype: strid: typing.Optional[str] = None )
Parameters
shape (tuple
) — The size of each dimension.
dtype (str
) — The value of the data type.
Create a five-dimensional array.
Example:
Copied
( sampling_rate: typing.Optional[int] = Nonemono: bool = Truedecode: bool = Trueid: typing.Optional[str] = None )
Parameters
sampling_rate (int
, optional) — Target sampling rate. If None
, the native sampling rate is used.
mono (bool
, defaults to True
) — Whether to convert the audio signal to mono by averaging samples across channels.
decode (bool
, defaults to True
) — Whether to decode the audio data. If False
, returns the underlying dictionary in the format {"path": audio_path, "bytes": audio_bytes}
.
Audio Feature
to extract audio data from an audio file.
Input: The Audio feature accepts as input:
A str
: Absolute path to the audio file (i.e. random access is allowed).
A dict
with the keys:
path
: String with relative path of the audio file to the archive file.
bytes
: Bytes content of the audio file.
This is useful for archived files with sequential access.
A dict
with the keys:
path
: String with relative path of the audio file to the archive file.
array
: Array containing the audio sample
sampling_rate
: Integer corresponding to the sampling rate of the audio sample.
This is useful for archived files with sequential access.
Example:
Copied
cast_storage
( storage: typing.Union[pyarrow.lib.StringArray, pyarrow.lib.StructArray] ) → pa.StructArray
Parameters
storage (Union[pa.StringArray, pa.StructArray]
) — PyArrow array to cast.
Returns
pa.StructArray
Array in the Audio arrow storage type, that is pa.struct({"bytes": pa.binary(), "path": pa.string()})
Cast an Arrow array to the Audio arrow storage type. The Arrow types that can be converted to the Audio pyarrow storage type are:
pa.string()
- it must contain the “path” data
pa.binary()
- it must contain the audio bytes
pa.struct({"bytes": pa.binary()})
pa.struct({"path": pa.string()})
pa.struct({"bytes": pa.binary(), "path": pa.string()})
- order doesn’t matter
decode_example
( value: dicttoken_per_repo_id: typing.Union[typing.Dict[str, typing.Union[str, bool, NoneType]], NoneType] = None ) → dict
Parameters
value (dict
) — A dictionary with keys:
path
: String with relative audio file path.
bytes
: Bytes of the audio file.
token_per_repo_id (dict
, optional) — To access and decode audio files from private repositories on the Hub, you can pass a dictionary repo_id (str
) -> token (bool
or str
)
Returns
dict
Decode example audio file into audio data.
embed_storage
( storage: StructArray ) → pa.StructArray
Parameters
storage (pa.StructArray
) — PyArrow array to embed.
Returns
pa.StructArray
Array in the Audio arrow storage type, that is pa.struct({"bytes": pa.binary(), "path": pa.string()})
.
Embed audio files into the Arrow array.
encode_example
( value: typing.Union[str, bytes, dict] ) → dict
Parameters
value (str
or dict
) — Data passed as input to Audio feature.
Returns
dict
Encode example into a format for Arrow.
flatten
( )
If in the decodable state, raise an error, otherwise flatten the feature into a dictionary.
( decode: bool = Trueid: typing.Optional[str] = None )
Parameters
decode (bool
, defaults to True
) — Whether to decode the image data. If False
, returns the underlying dictionary in the format {"path": image_path, "bytes": image_bytes}
.
Image Feature
to read image data from an image file.
Input: The Image feature accepts as input:
A str
: Absolute path to the image file (i.e. random access is allowed).
A dict
with the keys:
path
: String with relative path of the image file to the archive file.
bytes
: Bytes of the image file.
This is useful for archived files with sequential access.
An np.ndarray
: NumPy array representing an image.
A PIL.Image.Image
: PIL image object.
Examples:
Copied
cast_storage
( storage: typing.Union[pyarrow.lib.StringArray, pyarrow.lib.StructArray, pyarrow.lib.ListArray] ) → pa.StructArray
Parameters
storage (Union[pa.StringArray, pa.StructArray, pa.ListArray]
) — PyArrow array to cast.
Returns
pa.StructArray
Array in the Image arrow storage type, that is pa.struct({"bytes": pa.binary(), "path": pa.string()})
.
Cast an Arrow array to the Image arrow storage type. The Arrow types that can be converted to the Image pyarrow storage type are:
pa.string()
- it must contain the “path” data
pa.binary()
- it must contain the image bytes
pa.struct({"bytes": pa.binary()})
pa.struct({"path": pa.string()})
pa.struct({"bytes": pa.binary(), "path": pa.string()})
- order doesn’t matter
pa.list(*)
- it must contain the image array data
decode_example
( value: dicttoken_per_repo_id = None )
Parameters
value (str
or dict
) — A string with the absolute image file path, a dictionary with keys:
path
: String with absolute or relative image file path.
bytes
: The bytes of the image file.
token_per_repo_id (dict
, optional) — To access and decode image files from private repositories on the Hub, you can pass a dictionary repo_id (str
) -> token (bool
or str
).
Decode example image file into image data.
embed_storage
( storage: StructArray ) → pa.StructArray
Parameters
storage (pa.StructArray
) — PyArrow array to embed.
Returns
pa.StructArray
Array in the Image arrow storage type, that is pa.struct({"bytes": pa.binary(), "path": pa.string()})
.
Embed image files into the Arrow array.
encode_example
( value: typing.Union[str, bytes, dict, numpy.ndarray, ForwardRef('PIL.Image.Image')] )
Parameters
value (str
, np.ndarray
, PIL.Image.Image
or dict
) — Data passed as input to Image feature.
Encode example into a format for Arrow.
flatten
( )
If in the decodable state, return the feature itself, otherwise flatten the feature into a dictionary.
( description: strcitation: strfeatures: Featuresinputs_description: str = <factory>homepage: str = <factory>license: str = <factory>codebase_urls: typing.List[str] = <factory>reference_urls: typing.List[str] = <factory>streamable: bool = Falseformat: typing.Optional[str] = Nonemetric_name: typing.Optional[str] = Noneconfig_name: typing.Optional[str] = Noneexperiment_id: typing.Optional[str] = None )
Information about a metric.
MetricInfo
documents a metric, including its name, version, and features. See the constructor arguments and properties for a full list.
Note: Not all fields are known on construction and may be updated later.
from_directory
( metric_info_dir )
Create MetricInfo from the JSON file in metric_info_dir
.
Example:
Copied
write_to_directory
( metric_info_dirpretty_print = False )
Write MetricInfo
as JSON to metric_info_dir
. Also save the license separately in LICENCE. If pretty_print
is True, the JSON will be pretty-printed with the indent level of 4.
Example:
Copied
( config_name: typing.Optional[str] = Nonekeep_in_memory: bool = Falsecache_dir: typing.Optional[str] = Nonenum_process: int = 1process_id: int = 0seed: typing.Optional[int] = Noneexperiment_id: typing.Optional[str] = Nonemax_concurrent_cache_files: int = 10000timeout: typing.Union[int, float] = 100**kwargs )
Parameters
config_name (str
) — This is used to define a hash specific to a metrics computation script and prevents the metric’s data to be overridden when the metric loading script is modified.
keep_in_memory (bool
) — keep all predictions and references in memory. Not possible in distributed settings.
cache_dir (str
) — Path to a directory in which temporary prediction/references data will be stored. The data directory should be located on a shared file-system in distributed setups.
num_process (int
) — specify the total number of nodes in a distributed settings. This is useful to compute metrics in distributed setups (in particular non-additive metrics like F1).
process_id (int
) — specify the id of the current process in a distributed setup (between 0 and num_process-1) This is useful to compute metrics in distributed setups (in particular non-additive metrics like F1).
experiment_id (str
) — A specific experiment id. This is used if several distributed evaluations share the same file system. This is useful to compute metrics in distributed setups (in particular non-additive metrics like F1).
max_concurrent_cache_files (int
) — Max number of concurrent metrics cache files (default 10000).
timeout (Union[int, float]
) — Timeout in second for distributed setting synchronization.
A Metric is the base class and common API for all metrics.
Deprecated in 2.5.0
add
( prediction = Nonereference = None**kwargs )
Parameters
prediction (list/array/tensor, optional) — Predictions.
reference (list/array/tensor, optional) — References.
Add one prediction and reference for the metric’s stack.
Example:
Copied
add_batch
( predictions = Nonereferences = None**kwargs )
Parameters
predictions (list/array/tensor, optional) — Predictions.
references (list/array/tensor, optional) — References.
Add a batch of predictions and references for the metric’s stack.
Example:
Copied
compute
( predictions = Nonereferences = None**kwargs )
Parameters
predictions (list/array/tensor, optional) — Predictions.
references (list/array/tensor, optional) — References.
**kwargs (optional) — Keyword arguments that will be forwarded to the metrics _compute
method (see details in the docstring).
Compute the metrics.
Usage of positional arguments is not allowed to prevent mistakes.
Example:
Copied
download_and_prepare
( download_config: typing.Optional[datasets.download.download_config.DownloadConfig] = Nonedl_manager: typing.Optional[datasets.download.download_manager.DownloadManager] = None )
Parameters
Downloads and prepares dataset for reading.
( *args**kwargs )
Parameters
anon (bool
, default to False
) — Whether to use anonymous connection (public buckets only). If False
, uses the key/secret given, or boto’s credential resolver (client_kwargs, environment, variables, config files, EC2 IAM server, in that order).
key (str
) — If not anonymous, use this access key ID, if specified.
secret (str
) — If not anonymous, use this secret access key, if specified.
token (str
) — If not anonymous, use this security token, if specified.
use_ssl (bool
, defaults to True
) — Whether to use SSL in connections to S3; may be faster without, but insecure. If use_ssl
is also set in client_kwargs
, the value set in client_kwargs
will take priority.
s3_additional_kwargs (dict
) — Parameters that are used when calling S3 API methods. Typically used for things like ServerSideEncryption.
client_kwargs (dict
) — Parameters for the botocore client.
requester_pays (bool
, defaults to False
) — Whether RequesterPays
buckets are supported.
default_block_size (int
) — If given, the default block size value used for open()
, if no specific value is given at all time. The built-in default is 5MB.
default_fill_cache (bool
, defaults to True
) — Whether to use cache filling with open by default. Refer to S3File.open
.
default_cache_type (str
, defaults to bytes
) — If given, the default cache_type
value used for open()
. Set to none
if no caching is desired. See fsspec’s documentation for other available cache_type
values.
version_aware (bool
, defaults to False
) — Whether to support bucket versioning. If enable this will require the user to have the necessary IAM permissions for dealing with versioned objects.
cache_regions (bool
, defaults to False
) — Whether to cache bucket regions. Whenever a new bucket is used, it will first find out which region it belongs to and then use the client for that region.
asynchronous (bool
, defaults to False
) — Whether this instance is to be used from inside coroutines.
config_kwargs (dict
) — Parameters passed to botocore.client.Config
. **kwargs — Other parameters for core session.
session (aiobotocore.session.AioSession
) — Session to be used for all connections. This session will be used inplace of creating a new session inside S3FileSystem. For example: aiobotocore.session.AioSession(profile='test_user')
.
skip_instance_cache (bool
) — Control reuse of instances. Passed on to fsspec
.
use_listings_cache (bool
) — Control reuse of directory listings. Passed on to fsspec
.
listings_expiry_time (int
or float
) — Control reuse of directory listings. Passed on to fsspec
.
max_paths (int
) — Control reuse of directory listings. Passed on to fsspec
.
Users can use this class to access S3 as if it were a file system. It exposes a filesystem-like API (ls, cp, open, etc.) on top of S3 storage. Provide credentials either explicitly (key=
, secret=
) or with boto’s credential methods. See botocore documentation for more information. If no credentials are available, use anon=True
.
Examples:
Listing files from public S3 bucket.
Copied
Listing files from private S3 bucket using aws_access_key_id
and aws_secret_access_key
.
Copied
Using S3Filesystem
with botocore.session.Session
and custom aws_profile
.
Copied
Copied
Copied
datasets.filesystems.extract_path_from_uri
( dataset_path: str )
Parameters
dataset_path (str
) — Path (e.g. dataset/train
) or remote uri (e.g. s3://my-bucket/dataset/train
) of the dataset directory.
Preprocesses dataset_path
and removes remote filesystem (e.g. removing s3://
).
datasets.filesystems.is_remote_filesystem
( fs: AbstractFileSystem )
Parameters
Validates if filesystem has remote protocol.
( )
Hasher that accepts python objects as inputs.
Create from the JSON file in dataset_info_dir
.
This function updates all the dynamically generated fields (num_examples, hash, time of creation,…) of the .
The base class implements a Dataset backed by an Apache Arrow table.
features (, optional) — Dataset features.
Convert pandas.DataFrame
to a pyarrow.Table
to create a .
features (, optional) — Dataset features.
Convert dict
to a pyarrow.Table
to create a .
features (, optional) — Dataset features.
Number of rows in the dataset (same as ).
column (str
) — Column name (list all the column names with ).
( new_fingerprint: typing.Optional[str] = Nonemax_depth = 16 ) →
( features: Featuresbatch_size: typing.Optional[int] = 1000keep_in_memory: bool = Falseload_from_cache_file: typing.Optional[bool] = Nonecache_file_name: typing.Optional[str] = Nonewriter_batch_size: typing.Optional[int] = 1000num_proc: typing.Optional[int] = None ) →
features () — New features to cast the dataset to. The name of the fields in the features must match the current column names. The type of the data must also be convertible from one type to the other. For non-trivial conversion, e.g. str
<-> ClassLabel
you should use to update the Dataset.
writer_batch_size (int
, defaults to 1000
) — Number of rows per write operation for the cache file writer. This value is a good trade-off between memory usage during the processing, and processing speed. Higher value makes the processing do fewer lookups, lower value consume less temporary memory while running .
( column_names: typing.Union[str, typing.List[str]]new_fingerprint: typing.Optional[str] = None ) →
You can also remove a column using with remove_columns
but the present method is in-place (doesn’t copy the data to a new dataset) and is thus faster.
( original_column_name: strnew_column_name: strnew_fingerprint: typing.Optional[str] = None ) →
( column_mapping: typing.Dict[str, str]new_fingerprint: typing.Optional[str] = None ) →
( column_names: typing.Union[str, typing.List[str]]new_fingerprint: typing.Optional[str] = None ) →
column (str
) — The name of the column to cast (list all the column names with )
Casts the given column as and updates the table.
If a formatting is set with rows will be returned with the selected format.
Set __getitem__
return format (type and columns). The data formatting is applied on-the-fly. The format type
(for example “numpy”) is used to format batches when using __getitem__
. It’s also possible to use custom transforms for formatting using .
It is possible to call after calling set_format
. Since map
may add new columns, then the list of formatted columns
transform (Callable
, optional) — User-defined formatting transform, replaces the format defined by . A formatting function is a callable that takes a batch (as a dict
) as input and returns a batch. This function is applied right before returning the objects in __getitem__
.
Set __getitem__
return format using this transform. The transform is applied on-the-fly on batches when __getitem__
is called. As , this can be reset using .
It’s also possible to use custom transforms for formatting using .
Contrary to , with_format
returns a new object.
transform (Callable
, optional
) — User-defined formatting transform, replaces the format defined by . A formatting function is a callable that takes a batch (as a dict
) as input and returns a batch. This function is applied right before returning the objects in __getitem__
.
As , this can be reset using .
Contrary to , with_transform
returns a new object.
Shuffling takes the list of indices [0:len(my_dataset)]
and shuffles it to create an indices mapping. However as soon as your has an indices mapping, the speed can become 10x slower. This is because there is an extra step to get the row index to read using the indices mapping, and most importantly, you aren’t reading contiguous chunks of data anymore. To restore the speed, you’d need to rewrite the entire dataset on your disk again using , which removes the indices mapping.
In this case, we recommend switching to an and leveraging its fast approximate shuffling method .
Return a dictionary () with two random train and test subsets (train
and test
Dataset
splits). Splits are created from the dataset according to test_size
, train_size
and shuffle
.
and : remove local path information and embed file content in the Parquet files.
The resulting Parquet files are self-contained by default. If your dataset contains or data, the Parquet files will store the bytes of your images or audio files. You can disable this by setting embed_external_files
to False
.
For and data:
( dataset_path: strfs = 'deprecated'keep_in_memory: typing.Optional[bool] = Nonestorage_options: typing.Optional[dict] = None ) → or
keep_in_memory (bool
, defaults to None
) — Whether to copy the dataset in-memory. If None
, the dataset will not be copied in-memory unless explicitly enabled by setting datasets.config.IN_MEMORY_MAX_SIZE
to nonzero. See more details in the section.
or
features (Optional[datasets.Features]
, defaults to None
) — Use a specific to store the cache file instead of the automatically generated one.
**to_csv_kwargs (additional keyword arguments) — Parameters to pass to pandas’s .
**to_json_kwargs (additional keyword arguments) — Parameters to pass to pandas’s .
con (str
or sqlite3.Connection
or sqlalchemy.engine.Connection
or sqlalchemy.engine.Connection
) — A or a SQLite3/SQLAlchemy connection object used to write to a database.
**sql_writer_kwargs (additional keyword arguments) — Parameters to pass to pandas’s .
num_shards (int
, default to 1
) — Number of shards to define when instantiating the iterable dataset. This is especially useful for big datasets to be able to shuffle properly, and also to enable fast parallel loading using a PyTorch DataLoader or in distributed setups for example. Shards are defined using : it simply slices the data without writing anything on disk.
Get an from a map-style . This is equivalent to loading a dataset in streaming mode with , but much faster since the data is streamed from local files.
Still, it is possible to shuffle an iterable dataset using . This is a fast approximate shuffling that works best if you have multiple shards and if you specify a buffer size that is big enough.
index_name (str
, optional) — The index_name
/identifier of the index. This is the index_name
that is used to call or . By default it corresponds to column
.
For
index_name (str
) — The index_name
/identifier of the index. This is the index_name
that is used to call or .
For
index_name (str
, optional) — The index_name
/identifier of the index. This is the index name that is used to call or . By default it corresponds to column
.
object containing all the metadata in the dataset.
object corresponding to a named dataset split.
split (, optional) — Split name to be assigned to the dataset.
features (, optional) — Dataset features.
split (, optional) — Split name to be assigned to the dataset.
features (, optional) — Dataset features.
con (str
or sqlite3.Connection
or sqlalchemy.engine.Connection
or sqlalchemy.engine.Connection
) — A used to instantiate a database connection or a SQLite3/SQLAlchemy connection object.
features (, optional) — Dataset features.
If TaskTemplate
, must be one of the task templates in .
Prepare a dataset for the given task by casting the dataset’s to standardized column names and types as detailed in .
Converts a list of with the same schema into a single .
( datasets: typing.List[~DatasetType]probabilities: typing.Optional[typing.List[float]] = Noneseed: typing.Optional[int] = Noneinfo: typing.Optional[datasets.info.DatasetInfo] = Nonesplit: typing.Optional[datasets.splits.NamedSplit] = Nonestopping_strategy: typing.Literal['first_exhausted', 'all_exhausted'] = 'first_exhausted' ) → or
info (, optional) — Dataset information, like description, citation, etc.
split (, optional) — Name of the dataset split.
or
You can use this function on a list of objects, or on a list of objects.
( dataset: DatasetTyperank: intworld_size: int ) → or
dataset ( or ) — The dataset to split by node.
or
use to save a transformed dataset or it will be deleted when session closes
caching doesn’t affect . If you want to regenerate a dataset from scratch you should use the download_mode
parameter in .
use to save a transformed dataset or it will be deleted when session closes
caching doesn’t affect . If you want to regenerate a dataset from scratch you should use the download_mode
parameter in .
use ] to save a transformed dataset or it will be deleted when session closes
caching doesn’t affect . If you want to regenerate a dataset from scratch you should use the download_mode
parameter in .
Number of rows in each split of the dataset (same as ).
column (str
) — column name (list all the column names with )
features ([datasets.Features]
, optional, defaults to None
) — Use a specific to store the cache file instead of the automatically generated one.
It’s also possible to use custom transforms for formatting using .
Contrary to , with_format
returns a new object with new objects.
transform (Callable
, optional) — User-defined formatting transform, replaces the format defined by . A formatting function is a callable that takes a batch (as a dict) as input and returns a batch. This function is applied right before returning the objects in __getitem__
.
As , this can be reset using .
Contrary to set_transform()
, with_transform
returns a new object with new objects.
features () — New features to cast the dataset to. The name and order of the fields in the features must match the current column names. The type of the data must also be convertible from one type to the other. For non-trivial conversion, e.g. string
<-> ClassLabel
you should use to update the Dataset.
You can also remove a column using with feature
but cast
is in-place (doesn’t copy the data to a new dataset) and is thus faster.
You can also remove a column using with remove_columns
but the present method is in-place (doesn’t copy the data to a new dataset) and is thus faster.
You can also rename a column using with remove_columns
but the present method:
( column_mapping: typing.Dict[str, str] ) →
Casts the given column as and updates the tables.
and removes local path information and embed file content in the Parquet files.
Pushes the to the hub as a Parquet dataset. The is pushed using HTTP requests and does not need to have neither git or git-lfs installed.
The resulting Parquet files are self-contained by default: if your dataset contains or data, the Parquet files will store the bytes of your images or audio files. You can disable this by setting embed_external_files
to False.
For and data:
keep_in_memory (bool
, defaults to None
) — Whether to copy the dataset in-memory. If None
, the dataset will not be copied in-memory unless explicitly enabled by setting datasets.config.IN_MEMORY_MAX_SIZE
to nonzero. See more details in the section.
features (, optional) — Dataset features.
Create from CSV file(s).
features (, optional) — Dataset features.
Create from JSON Lines file(s).
features (, optional) — Dataset features.
Create from Parquet file(s).
features (, optional) — Dataset features.
Create from text file(s).
If TaskTemplate
, must be one of the task templates in .
Prepare a dataset for the given task by casting the dataset’s to standardized column names and types as detailed in .
The base class implements an iterable Dataset backed by python generators.
features () — New features to cast the dataset to. The name of the fields in the features must match the current column names. The type of the data must also be convertible from one type to the other. For non-trivial conversion, e.g. string
<-> ClassLabel
you should use to update the Dataset.
If the dataset is made of several shards, it also does shuffle the order of the shards. However if the order has been fixed by using or then the order of the shards is kept unchanged.
Create a new that skips the first n
elements.
Create a new with only the first n
elements.
object containing all the metadata in the dataset.
object corresponding to a named dataset split.
If the dataset is made of several shards, it also does shuffle
the order of the shards. However if the order has been fixed by using or then the order of the shards is kept unchanged.
( features: Features ) →
( column_names: typing.Union[str, typing.List[str]] ) →
( original_column_name: strnew_column_name: str ) →
( column_mapping: typing.Dict[str, str] ) →
( column_names: typing.Union[str, typing.List[str]] ) →
a feature specifies a single typed value, e.g. int64
or string
.
a feature specifies a field with a predefined set of classes which can have labels associated to them and will be stored as integers in the dataset.
a python list
or a specifies that the field contains a list of objects. The python list
or should be provided with a single sub-feature as an example of the feature type hosted in this list.
A with a internal dictionary feature will be automatically converted into a dictionary of lists. This behavior is implemented to have a compatilbity layer with the TensorFlow Datasets library but may be un-wanted in some cases. If you don’t want this behavior, you can use a python list
instead of the .
a , , or feature for multidimensional arrays.
an feature to store the absolute path to an audio file or a dictionary with the relative path to an audio file (“path” key) and its bytes content (“bytes” key). This feature extracts the audio data.
an feature to store the absolute path to an image file, an np.ndarray
object, a PIL.Image.Image
object or a dictionary with the relative path to an image file (“path” key) and its bytes content (“bytes” key). This feature extracts the image data.
and , the two features specific to Machine Translation.
Make a deep copy of .
( max_depth = 16 ) →
Construct from Arrow Schema. It also checks the schema metadata for BOINC AI Datasets features. Non-nullable fields are not supported and set to nullable.
The base class Metric
implements a Metric backed by one or several .
seed (int
, optional) — If specified, this will temporarily set numpy’s random seed when is run.
Use the new library 🌍 Evaluate instead:
download_config (, optional) — Specific download configuration parameters.
dl_manager (, optional) — Specific download manager to use.
datasets.filesystems.S3FileSystem
is a subclass of .
Loading dataset from S3 using S3Filesystem
and .
Saving dataset to S3 using S3Filesystem
and .
fs (fsspec.spec.AbstractFileSystem
) — An abstract super-class for pythonic file-systems, e.g. fsspec.filesystem('file')
or .