Explore dataset statistics
Last updated
Last updated
Datasets Server provides a /statistics
endpoint for fetching some basic statistics precomputed for a requested dataset. This will get you a quick insight on how the data is distributed.
Currently, statistics are computed only for .
The /statistics
endpoint requires three query parameters:
dataset
: the dataset name, for example glue
config
: the configuration name, for example cola
split
: the split name, for example train
Letβs get some stats for glue
dataset, cola
config, train
split:
PythonJavaScriptcURLCopied
The response JSON contains two keys:
num_examples
- number of samples in a split
Copied
column_type
in response can be one of the following values:
float
- for float dtypes
int
- for integer dtypes
string_label
- for string dtypes, if there are less than or equal to 30 unique values in a string column in a given split
string_text
- for string dtypes, if there are more than 30 unique values in a string column in a given split
number and proportion of null
values
number and proportion of values with no label
number of unique values (excluding null
and no label
)
value counts for each label (excluding null
and no label
)
The following measures are returned for float data types:
minimum, maximum, mean, and standard deviation values
number and proportion of null
values
histogram with 10 bins
The following measures are returned for integer data types:
minimum, maximum, mean, and standard deviation values
number and proportion of null
values
histogram with less than or equal to 10 bins
If string column has less than or equal to 30 unique values within the requested split, it is considered to be a category. The following measures are returned:
number and proportion of null
values
number of unique values (excluding null
)
value counts for each label (excluding null
)
If string column has more than 30 unique values within the requested split, it is considered to be a column containing texts and response contains statistics over text lengths. The following measures are computed:
minimum, maximum, mean, and standard deviation of text lengths
number and proportion of null
values
histogram of text lengths with 10 bins
statistics
- list of dictionaries of statistics per each column, each dictionary has three keys: column_name
, column_type
, and column_statistics
. Content of column_statistics
depends on a column type, see for more details
Currently, statistics are supported for strings, float and integer numbers, and the special feature type of the library.
class_label
- for feature
This type represents categorical data encoded as feature. The following measures are computed: