Explore dataset statistics
Explore statistics over split data
Datasets Server provides a /statistics
endpoint for fetching some basic statistics precomputed for a requested dataset. This will get you a quick insight on how the data is distributed.
Currently, statistics are computed only for datasets with Parquet exports.
The /statistics
endpoint requires three query parameters:
dataset
: the dataset name, for exampleglue
config
: the configuration name, for examplecola
split
: the split name, for exampletrain
Letβs get some stats for glue
dataset, cola
config, train
split:
PythonJavaScriptcURLCopied
import requests
headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = "https://datasets-server.boincai.com/statistics?dataset=glue&config=cola&split=train"
def query():
response = requests.get(API_URL, headers=headers)
return response.json()
data = query()
The response JSON contains two keys:
num_examples
- number of samples in a splitstatistics
- list of dictionaries of statistics per each column, each dictionary has three keys:column_name
,column_type
, andcolumn_statistics
. Content ofcolumn_statistics
depends on a column type, see Response structure by data types for more details
Copied
{
"num_examples": 8551,
"statistics": [
{
"column_name": "idx",
"column_type": "int",
"column_statistics": {
"nan_count": 0,
"nan_proportion": 0,
"min": 0,
"max": 8550,
"mean": 4275,
"median": 4275,
"std": 2468.60541,
"histogram": {
"hist": [
856,
856,
856,
856,
856,
856,
856,
856,
856,
847
],
"bin_edges": [
0,
856,
1712,
2568,
3424,
4280,
5136,
5992,
6848,
7704,
8550
]
}
}
},
{
"column_name": "label",
"column_type": "class_label",
"column_statistics": {
"nan_count": 0,
"nan_proportion": 0,
"no_label_count": 0,
"no_label_proportion": 0,
"n_unique": 2,
"frequencies": {
"unacceptable": 2528,
"acceptable": 6023
}
}
},
{
"column_name": "sentence",
"column_type": "string_text",
"column_statistics": {
"nan_count": 0,
"nan_proportion": 0,
"min": 6,
"max": 231,
"mean": 40.70074,
"median": 37,
"std": 19.14431,
"histogram": {
"hist": [
2260,
4512,
1262,
380,
102,
26,
6,
1,
1,
1
],
"bin_edges": [
6,
29,
52,
75,
98,
121,
144,
167,
190,
213,
231
]
}
}
}
]
}
Response structure by data type
Currently, statistics are supported for strings, float and integer numbers, and the special datasets.ClassLabel
feature type of the datasets
library.
column_type
in response can be one of the following values:
class_label
- fordatasets.ClassLabel
featurefloat
- for float dtypesint
- for integer dtypesstring_label
- for string dtypes, if there are less than or equal to 30 unique values in a string column in a given splitstring_text
- for string dtypes, if there are more than 30 unique values in a string column in a given split
class_label
This type represents categorical data encoded as ClassLabel
feature. The following measures are computed:
number and proportion of
null
valuesnumber and proportion of values with no label
number of unique values (excluding
null
andno label
)value counts for each label (excluding
null
andno label
)
float
The following measures are returned for float data types:
minimum, maximum, mean, and standard deviation values
number and proportion of
null
valueshistogram with 10 bins
int
The following measures are returned for integer data types:
minimum, maximum, mean, and standard deviation values
number and proportion of
null
valueshistogram with less than or equal to 10 bins
string_label
If string column has less than or equal to 30 unique values within the requested split, it is considered to be a category. The following measures are returned:
number and proportion of
null
valuesnumber of unique values (excluding
null
)value counts for each label (excluding
null
)
string_text
If string column has more than 30 unique values within the requested split, it is considered to be a column containing texts and response contains statistics over text lengths. The following measures are computed:
minimum, maximum, mean, and standard deviation of text lengths
number and proportion of
null
valueshistogram of text lengths with 10 bins
βGet the number of rows and the bytes sizeOverviewβExplore statistics over split dataResponse structure by data typeclass_labelfloatintstring_labelstring_text
Last updated