Explore dataset statistics

Explore statistics over split data

Datasets Server provides a /statistics endpoint for fetching some basic statistics precomputed for a requested dataset. This will get you a quick insight on how the data is distributed.

Currently, statistics are computed only for datasets with Parquet exports.

The /statistics endpoint requires three query parameters:

  • dataset: the dataset name, for example glue

  • config: the configuration name, for example cola

  • split: the split name, for example train

Let’s get some stats for glue dataset, cola config, train split:

PythonJavaScriptcURLCopied

import requests
headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = "https://datasets-server.boincai.com/statistics?dataset=glue&config=cola&split=train"
def query():
    response = requests.get(API_URL, headers=headers)
    return response.json()
data = query()

The response JSON contains two keys:

  • num_examples - number of samples in a split

  • statistics - list of dictionaries of statistics per each column, each dictionary has three keys: column_name, column_type, and column_statistics. Content of column_statistics depends on a column type, see Response structure by data types for more details

Copied

Response structure by data type

Currently, statistics are supported for strings, float and integer numbers, and the special datasets.ClassLabel feature type of the datasets library.

column_type in response can be one of the following values:

  • class_label - for datasets.ClassLabel feature

  • float - for float dtypes

  • int - for integer dtypes

  • string_label - for string dtypes, if there are less than or equal to 30 unique values in a string column in a given split

  • string_text - for string dtypes, if there are more than 30 unique values in a string column in a given split

class_label

This type represents categorical data encoded as ClassLabel feature. The following measures are computed:

  • number and proportion of null values

  • number and proportion of values with no label

  • number of unique values (excluding null and no label)

  • value counts for each label (excluding null and no label)

Example

float

The following measures are returned for float data types:

  • minimum, maximum, mean, and standard deviation values

  • number and proportion of null values

  • histogram with 10 bins

Example

int

The following measures are returned for integer data types:

  • minimum, maximum, mean, and standard deviation values

  • number and proportion of null values

  • histogram with less than or equal to 10 bins

Example

string_label

If string column has less than or equal to 30 unique values within the requested split, it is considered to be a category. The following measures are returned:

  • number and proportion of null values

  • number of unique values (excluding null)

  • value counts for each label (excluding null)

Example

string_text

If string column has more than 30 unique values within the requested split, it is considered to be a column containing texts and response contains statistics over text lengths. The following measures are computed:

  • minimum, maximum, mean, and standard deviation of text lengths

  • number and proportion of null values

  • histogram of text lengths with 10 bins

Example

←Get the number of rows and the bytes sizeOverview→Explore statistics over split dataResponse structure by data typeclass_labelfloatintstring_labelstring_text

Last updated