Explore dataset statistics

Explore statistics over split data

Datasets Server provides a /statistics endpoint for fetching some basic statistics precomputed for a requested dataset. This will get you a quick insight on how the data is distributed.

Currently, statistics are computed only for datasets with Parquet exportsarrow-up-right.

The /statistics endpoint requires three query parameters:

  • dataset: the dataset name, for example glue

  • config: the configuration name, for example cola

  • split: the split name, for example train

Let’s get some stats for glue dataset, cola config, train split:

PythonJavaScriptcURLCopied

import requests
headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = "https://datasets-server.boincai.com/statistics?dataset=glue&config=cola&split=train"
def query():
    response = requests.get(API_URL, headers=headers)
    return response.json()
data = query()

The response JSON contains two keys:

  • num_examples - number of samples in a split

  • statistics - list of dictionaries of statistics per each column, each dictionary has three keys: column_name, column_type, and column_statistics. Content of column_statistics depends on a column type, see Response structure by data typesarrow-up-right for more details

Copied

Response structure by data type

Currently, statistics are supported for strings, float and integer numbers, and the special datasets.ClassLabelarrow-up-right feature type of the datasetsarrow-up-right library.

column_type in response can be one of the following values:

  • class_label - for datasets.ClassLabelarrow-up-right feature

  • float - for float dtypes

  • int - for integer dtypes

  • string_label - for string dtypes, if there are less than or equal to 30 unique values in a string column in a given split

  • string_text - for string dtypes, if there are more than 30 unique values in a string column in a given split

class_label

This type represents categorical data encoded as ClassLabelarrow-up-right feature. The following measures are computed:

  • number and proportion of null values

  • number and proportion of values with no label

  • number of unique values (excluding null and no label)

  • value counts for each label (excluding null and no label)

chevron-rightExamplehashtag

float

The following measures are returned for float data types:

  • minimum, maximum, mean, and standard deviation values

  • number and proportion of null values

  • histogram with 10 bins

chevron-rightExamplehashtag

int

The following measures are returned for integer data types:

  • minimum, maximum, mean, and standard deviation values

  • number and proportion of null values

  • histogram with less than or equal to 10 bins

chevron-rightExamplehashtag

string_label

If string column has less than or equal to 30 unique values within the requested split, it is considered to be a category. The following measures are returned:

  • number and proportion of null values

  • number of unique values (excluding null)

  • value counts for each label (excluding null)

chevron-rightExamplehashtag

string_text

If string column has more than 30 unique values within the requested split, it is considered to be a column containing texts and response contains statistics over text lengths. The following measures are computed:

  • minimum, maximum, mean, and standard deviation of text lengths

  • number and proportion of null values

  • histogram of text lengths with 10 bins

chevron-rightExamplehashtag

←Get the number of rows and the bytes sizearrow-up-rightOverview→arrow-up-rightExplore statistics over split dataarrow-up-rightResponse structure by data typearrow-up-rightclass_labelarrow-up-rightfloatarrow-up-rightintarrow-up-rightstring_labelarrow-up-rightstring_textarrow-up-right

Last updated