# Explore dataset statistics

## Explore statistics over split data

Datasets Server provides a `/statistics` endpoint for fetching some basic statistics precomputed for a requested dataset. This will get you a quick insight on how the data is distributed.

Currently, statistics are computed only for [datasets with Parquet exports](https://huggingface.co/docs/datasets-server/parquet).

The `/statistics` endpoint requires three query parameters:

* `dataset`: the dataset name, for example `glue`
* `config`: the configuration name, for example `cola`
* `split`: the split name, for example `train`

Let’s get some stats for `glue` dataset, `cola` config, `train` split:

PythonJavaScriptcURLCopied

```
import requests
headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = "https://datasets-server.boincai.com/statistics?dataset=glue&config=cola&split=train"
def query():
    response = requests.get(API_URL, headers=headers)
    return response.json()
data = query()
```

The response JSON contains two keys:

* `num_examples` - number of samples in a split
* `statistics` - list of dictionaries of statistics per each column, each dictionary has three keys: `column_name`, `column_type`, and `column_statistics`. Content of `column_statistics` depends on a column type, see [Response structure by data types](https://huggingface.co/docs/datasets-server/statistics#response-structure-by-data-type) for more details

Copied

```
{
  "num_examples": 8551,
  "statistics": [
    {
      "column_name": "idx",
      "column_type": "int",
      "column_statistics": {
        "nan_count": 0,
        "nan_proportion": 0,
        "min": 0,
        "max": 8550,
        "mean": 4275,
        "median": 4275,
        "std": 2468.60541,
        "histogram": {
          "hist": [
            856,
            856,
            856,
            856,
            856,
            856,
            856,
            856,
            856,
            847
          ],
          "bin_edges": [
            0,
            856,
            1712,
            2568,
            3424,
            4280,
            5136,
            5992,
            6848,
            7704,
            8550
          ]
        }
      }
    },
    {
      "column_name": "label",
      "column_type": "class_label",
      "column_statistics": {
        "nan_count": 0,
        "nan_proportion": 0,
        "no_label_count": 0,
        "no_label_proportion": 0,
        "n_unique": 2,
        "frequencies": {
          "unacceptable": 2528,
          "acceptable": 6023
        }
      }
    },
    {
      "column_name": "sentence",
      "column_type": "string_text",
      "column_statistics": {
        "nan_count": 0,
        "nan_proportion": 0,
        "min": 6,
        "max": 231,
        "mean": 40.70074,
        "median": 37,
        "std": 19.14431,
        "histogram": {
          "hist": [
            2260,
            4512,
            1262,
            380,
            102,
            26,
            6,
            1,
            1,
            1
          ],
          "bin_edges": [
            6,
            29,
            52,
            75,
            98,
            121,
            144,
            167,
            190,
            213,
            231
          ]
        }
      }
    }
  ]
}
```

### Response structure by data type

Currently, statistics are supported for strings, float and integer numbers, and the special [`datasets.ClassLabel`](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.ClassLabel) feature type of the [`datasets`](https://huggingface.co/docs/datasets/) library.

`column_type` in response can be one of the following values:

* `class_label` - for [`datasets.ClassLabel`](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.ClassLabel) feature
* `float` - for float dtypes
* `int` - for integer dtypes
* `string_label` - for string dtypes, if there are less than or equal to 30 unique values in a string column in a given split
* `string_text` - for string dtypes, if there are more than 30 unique values in a string column in a given split

#### class\_label

This type represents categorical data encoded as [`ClassLabel`](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.ClassLabel) feature. The following measures are computed:

* number and proportion of `null` values
* number and proportion of values with no label
* number of unique values (excluding `null` and `no label`)
* value counts for each label (excluding `null` and `no label`)

<details>

<summary>Example</summary>

```
```

</details>

#### float

The following measures are returned for float data types:

* minimum, maximum, mean, and standard deviation values
* number and proportion of `null` values
* histogram with 10 bins

<details>

<summary>Example</summary>

```
```

</details>

#### int

The following measures are returned for integer data types:

* minimum, maximum, mean, and standard deviation values
* number and proportion of `null` values
* histogram with less than or equal to 10 bins

<details>

<summary>Example</summary>

```
```

</details>

#### string\_label

If string column has less than or equal to 30 unique values within the requested split, it is considered to be a category. The following measures are returned:

* number and proportion of `null` values
* number of unique values (excluding `null`)
* value counts for each label (excluding `null`)

<details>

<summary>Example</summary>

```
```

</details>

#### string\_text

If string column has more than 30 unique values within the requested split, it is considered to be a column containing texts and response contains statistics over text lengths. The following measures are computed:

* minimum, maximum, mean, and standard deviation of text lengths
* number and proportion of `null` values
* histogram of text lengths with 10 bins

<details>

<summary>Example</summary>

```
```

</details>

[←Get the number of rows and the bytes size](https://huggingface.co/docs/datasets-server/size)[Overview→](https://huggingface.co/docs/datasets-server/parquet_process)[Explore statistics over split data](https://huggingface.co/docs/datasets-server/statistics#explore-statistics-over-split-data)[Response structure by data type](https://huggingface.co/docs/datasets-server/statistics#response-structure-by-data-type)[class\_label](https://huggingface.co/docs/datasets-server/statistics#classlabel)[float](https://huggingface.co/docs/datasets-server/statistics#float)[int](https://huggingface.co/docs/datasets-server/statistics#int)[string\_label](https://huggingface.co/docs/datasets-server/statistics#stringlabel)[string\_text](https://huggingface.co/docs/datasets-server/statistics#stringtext)
