Explore dataset statistics
Explore statistics over split data
Datasets Server provides a /statistics
endpoint for fetching some basic statistics precomputed for a requested dataset. This will get you a quick insight on how the data is distributed.
Currently, statistics are computed only for datasets with Parquet exports.
The /statistics
endpoint requires three query parameters:
dataset
: the dataset name, for exampleglue
config
: the configuration name, for examplecola
split
: the split name, for exampletrain
Let’s get some stats for glue
dataset, cola
config, train
split:
PythonJavaScriptcURLCopied
import requests
headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = "https://datasets-server.boincai.com/statistics?dataset=glue&config=cola&split=train"
def query():
response = requests.get(API_URL, headers=headers)
return response.json()
data = query()
The response JSON contains two keys:
num_examples
- number of samples in a splitstatistics
- list of dictionaries of statistics per each column, each dictionary has three keys:column_name
,column_type
, andcolumn_statistics
. Content ofcolumn_statistics
depends on a column type, see Response structure by data types for more details
Copied
{
"num_examples": 8551,
"statistics": [
{
"column_name": "idx",
"column_type": "int",
"column_statistics": {
"nan_count": 0,
"nan_proportion": 0,
"min": 0,
"max": 8550,
"mean": 4275,
"median": 4275,
"std": 2468.60541,
"histogram": {
"hist": [
856,
856,
856,
856,
856,
856,
856,
856,
856,
847
],
"bin_edges": [
0,
856,
1712,
2568,
3424,
4280,
5136,
5992,
6848,
7704,
8550
]
}
}
},
{
"column_name": "label",
"column_type": "class_label",
"column_statistics": {
"nan_count": 0,
"nan_proportion": 0,
"no_label_count": 0,
"no_label_proportion": 0,
"n_unique": 2,
"frequencies": {
"unacceptable": 2528,
"acceptable": 6023
}
}
},
{
"column_name": "sentence",
"column_type": "string_text",
"column_statistics": {
"nan_count": 0,
"nan_proportion": 0,
"min": 6,
"max": 231,
"mean": 40.70074,
"median": 37,
"std": 19.14431,
"histogram": {
"hist": [
2260,
4512,
1262,
380,
102,
26,
6,
1,
1,
1
],
"bin_edges": [
6,
29,
52,
75,
98,
121,
144,
167,
190,
213,
231
]
}
}
}
]
}
Response structure by data type
Currently, statistics are supported for strings, float and integer numbers, and the special datasets.ClassLabel
feature type of the datasets
library.
column_type
in response can be one of the following values:
class_label
- fordatasets.ClassLabel
featurefloat
- for float dtypesint
- for integer dtypesstring_label
- for string dtypes, if there are less than or equal to 30 unique values in a string column in a given splitstring_text
- for string dtypes, if there are more than 30 unique values in a string column in a given split
class_label
This type represents categorical data encoded as ClassLabel
feature. The following measures are computed:
number and proportion of
null
valuesnumber and proportion of values with no label
number of unique values (excluding
null
andno label
)value counts for each label (excluding
null
andno label
)
float
The following measures are returned for float data types:
minimum, maximum, mean, and standard deviation values
number and proportion of
null
valueshistogram with 10 bins
int
The following measures are returned for integer data types:
minimum, maximum, mean, and standard deviation values
number and proportion of
null
valueshistogram with less than or equal to 10 bins
string_label
If string column has less than or equal to 30 unique values within the requested split, it is considered to be a category. The following measures are returned:
number and proportion of
null
valuesnumber of unique values (excluding
null
)value counts for each label (excluding
null
)
string_text
If string column has more than 30 unique values within the requested split, it is considered to be a column containing texts and response contains statistics over text lengths. The following measures are computed:
minimum, maximum, mean, and standard deviation of text lengths
number and proportion of
null
valueshistogram of text lengths with 10 bins
←Get the number of rows and the bytes sizeOverview→Explore statistics over split dataResponse structure by data typeclass_labelfloatintstring_labelstring_text
Last updated