Explore dataset statistics

Explore statistics over split data

Datasets Server provides a /statistics endpoint for fetching some basic statistics precomputed for a requested dataset. This will get you a quick insight on how the data is distributed.

Currently, statistics are computed only for datasets with Parquet exports.

The /statistics endpoint requires three query parameters:

dataset: the dataset name, for example glue
config: the configuration name, for example cola
split: the split name, for example train

Let’s get some stats for glue dataset, cola config, train split:

PythonJavaScriptcURLCopied

import requests
headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = "https://datasets-server.boincai.com/statistics?dataset=glue&config=cola&split=train"
def query():
    response = requests.get(API_URL, headers=headers)
    return response.json()
data = query()

The response JSON contains two keys:

num_examples - number of samples in a split
statistics - list of dictionaries of statistics per each column, each dictionary has three keys: column_name, column_type, and column_statistics. Content of column_statistics depends on a column type, see Response structure by data types for more details

Copied

{
  "num_examples": 8551,
  "statistics": [
    {
      "column_name": "idx",
      "column_type": "int",
      "column_statistics": {
        "nan_count": 0,
        "nan_proportion": 0,
        "min": 0,
        "max": 8550,
        "mean": 4275,
        "median": 4275,
        "std": 2468.60541,
        "histogram": {
          "hist": [
            856,
            856,
            856,
            856,
            856,
            856,
            856,
            856,
            856,
            847
          ],
          "bin_edges": [
            0,
            856,
            1712,
            2568,
            3424,
            4280,
            5136,
            5992,
            6848,
            7704,
            8550
          ]
        }
      }
    },
    {
      "column_name": "label",
      "column_type": "class_label",
      "column_statistics": {
        "nan_count": 0,
        "nan_proportion": 0,
        "no_label_count": 0,
        "no_label_proportion": 0,
        "n_unique": 2,
        "frequencies": {
          "unacceptable": 2528,
          "acceptable": 6023
        }
      }
    },
    {
      "column_name": "sentence",
      "column_type": "string_text",
      "column_statistics": {
        "nan_count": 0,
        "nan_proportion": 0,
        "min": 6,
        "max": 231,
        "mean": 40.70074,
        "median": 37,
        "std": 19.14431,
        "histogram": {
          "hist": [
            2260,
            4512,
            1262,
            380,
            102,
            26,
            6,
            1,
            1,
            1
          ],
          "bin_edges": [
            6,
            29,
            52,
            75,
            98,
            121,
            144,
            167,
            190,
            213,
            231
          ]
        }
      }
    }
  ]
}

Response structure by data type

Currently, statistics are supported for strings, float and integer numbers, and the special datasets.ClassLabel feature type of the datasets library.

column_type in response can be one of the following values:

class_label - for datasets.ClassLabel feature
float - for float dtypes
int - for integer dtypes
string_label - for string dtypes, if there are less than or equal to 30 unique values in a string column in a given split
string_text - for string dtypes, if there are more than 30 unique values in a string column in a given split

class_label

This type represents categorical data encoded as ClassLabel feature. The following measures are computed:

number and proportion of null values
number and proportion of values with no label
number of unique values (excluding null and no label)
value counts for each label (excluding null and no label)

Example

float

The following measures are returned for float data types:

minimum, maximum, mean, and standard deviation values
number and proportion of null values
histogram with 10 bins

Example

int

The following measures are returned for integer data types:

minimum, maximum, mean, and standard deviation values
number and proportion of null values
histogram with less than or equal to 10 bins

Example

string_label

If string column has less than or equal to 30 unique values within the requested split, it is considered to be a category. The following measures are returned:

number and proportion of null values
number of unique values (excluding null)
value counts for each label (excluding null)

Example

string_text

If string column has more than 30 unique values within the requested split, it is considered to be a column containing texts and response contains statistics over text lengths. The following measures are computed:

minimum, maximum, mean, and standard deviation of text lengths
number and proportion of null values
histogram of text lengths with 10 bins

Example

←Get the number of rows and the bytes size Overview→Explore statistics over split data Response structure by data type class_label float int string_label string_text

PreviousGet the number of rows and the bytes size NextQUERY DATASETS FROM DATASETS SERVER

Last updated 1 year ago