Datasets-server
  • 🌍GET STARTED
    • BOINC AI Datasets server
    • Quickstart
    • Analyze a dataset on the Hub
  • 🌍GUIDES
    • Check dataset validity
    • List splits and configurations
    • Get dataset information
    • Preview a dataset
    • Download slices of rows
    • Search text in a dataset
    • Filter rows in a dataset
    • List Parquet files
    • Get the number of rows and the bytes size
    • Explore dataset statistics
    • 🌍QUERY DATASETS FROM DATASETS SERVER
      • Overview
      • ClickHouse
      • DuckDB
      • Pandas
      • Polars
  • 🌍CONCEPTUAL GUIDES
    • Splits and configurations
    • Data types
    • Server infrastructure
Powered by GitBook
On this page
  • Explore statistics over split data
  • Response structure by data type
  1. GUIDES

Explore dataset statistics

PreviousGet the number of rows and the bytes sizeNextQUERY DATASETS FROM DATASETS SERVER

Last updated 1 year ago

Explore statistics over split data

Datasets Server provides a /statistics endpoint for fetching some basic statistics precomputed for a requested dataset. This will get you a quick insight on how the data is distributed.

Currently, statistics are computed only for .

The /statistics endpoint requires three query parameters:

  • dataset: the dataset name, for example glue

  • config: the configuration name, for example cola

  • split: the split name, for example train

Let’s get some stats for glue dataset, cola config, train split:

PythonJavaScriptcURLCopied

import requests
headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = "https://datasets-server.boincai.com/statistics?dataset=glue&config=cola&split=train"
def query():
    response = requests.get(API_URL, headers=headers)
    return response.json()
data = query()

The response JSON contains two keys:

  • num_examples - number of samples in a split

Copied

{
  "num_examples": 8551,
  "statistics": [
    {
      "column_name": "idx",
      "column_type": "int",
      "column_statistics": {
        "nan_count": 0,
        "nan_proportion": 0,
        "min": 0,
        "max": 8550,
        "mean": 4275,
        "median": 4275,
        "std": 2468.60541,
        "histogram": {
          "hist": [
            856,
            856,
            856,
            856,
            856,
            856,
            856,
            856,
            856,
            847
          ],
          "bin_edges": [
            0,
            856,
            1712,
            2568,
            3424,
            4280,
            5136,
            5992,
            6848,
            7704,
            8550
          ]
        }
      }
    },
    {
      "column_name": "label",
      "column_type": "class_label",
      "column_statistics": {
        "nan_count": 0,
        "nan_proportion": 0,
        "no_label_count": 0,
        "no_label_proportion": 0,
        "n_unique": 2,
        "frequencies": {
          "unacceptable": 2528,
          "acceptable": 6023
        }
      }
    },
    {
      "column_name": "sentence",
      "column_type": "string_text",
      "column_statistics": {
        "nan_count": 0,
        "nan_proportion": 0,
        "min": 6,
        "max": 231,
        "mean": 40.70074,
        "median": 37,
        "std": 19.14431,
        "histogram": {
          "hist": [
            2260,
            4512,
            1262,
            380,
            102,
            26,
            6,
            1,
            1,
            1
          ],
          "bin_edges": [
            6,
            29,
            52,
            75,
            98,
            121,
            144,
            167,
            190,
            213,
            231
          ]
        }
      }
    }
  ]
}

Response structure by data type

column_type in response can be one of the following values:

  • float - for float dtypes

  • int - for integer dtypes

  • string_label - for string dtypes, if there are less than or equal to 30 unique values in a string column in a given split

  • string_text - for string dtypes, if there are more than 30 unique values in a string column in a given split

class_label

  • number and proportion of null values

  • number and proportion of values with no label

  • number of unique values (excluding null and no label)

  • value counts for each label (excluding null and no label)

Example

float

The following measures are returned for float data types:

  • minimum, maximum, mean, and standard deviation values

  • number and proportion of null values

  • histogram with 10 bins

Example

int

The following measures are returned for integer data types:

  • minimum, maximum, mean, and standard deviation values

  • number and proportion of null values

  • histogram with less than or equal to 10 bins

Example

string_label

If string column has less than or equal to 30 unique values within the requested split, it is considered to be a category. The following measures are returned:

  • number and proportion of null values

  • number of unique values (excluding null)

  • value counts for each label (excluding null)

Example

string_text

If string column has more than 30 unique values within the requested split, it is considered to be a column containing texts and response contains statistics over text lengths. The following measures are computed:

  • minimum, maximum, mean, and standard deviation of text lengths

  • number and proportion of null values

  • histogram of text lengths with 10 bins

Example

statistics - list of dictionaries of statistics per each column, each dictionary has three keys: column_name, column_type, and column_statistics. Content of column_statistics depends on a column type, see for more details

Currently, statistics are supported for strings, float and integer numbers, and the special feature type of the library.

class_label - for feature

This type represents categorical data encoded as feature. The following measures are computed:

🌍
datasets with Parquet exports
Response structure by data types
datasets.ClassLabel
datasets
datasets.ClassLabel
ClassLabel
←Get the number of rows and the bytes size
Overview→
Explore statistics over split data
Response structure by data type
class_label
float
int
string_label
string_text