Datasets-server
  • 🌍GET STARTED
    • BOINC AI Datasets server
    • Quickstart
    • Analyze a dataset on the Hub
  • 🌍GUIDES
    • Check dataset validity
    • List splits and configurations
    • Get dataset information
    • Preview a dataset
    • Download slices of rows
    • Search text in a dataset
    • Filter rows in a dataset
    • List Parquet files
    • Get the number of rows and the bytes size
    • Explore dataset statistics
    • 🌍QUERY DATASETS FROM DATASETS SERVER
      • Overview
      • ClickHouse
      • DuckDB
      • Pandas
      • Polars
  • 🌍CONCEPTUAL GUIDES
    • Splits and configurations
    • Data types
    • Server infrastructure
Powered by GitBook
On this page
  • Quickstart
  • Gated datasets
  • Check dataset validity
  • List configurations and splits
  • Preview a dataset
  • Download slices of a dataset
  • Search text in a dataset
  • Access Parquet files
  • Get the size of the dataset
  1. GET STARTED

Quickstart

PreviousBOINC AI Datasets serverNextAnalyze a dataset on the Hub

Last updated 1 year ago

Quickstart

In this quickstart, you’ll learn how to use the Datasets Server’s REST API to:

  • Check whether a dataset on the Hub is functional.

  • Return the configuration and splits of a dataset.

  • Preview the first 100 rows of a dataset.

  • Download slices of rows of a dataset.

  • Search a word in a dataset.

  • Access the dataset as parquet files.

Each feature is served through an endpoint summarized in the table below:

Endpoint
Method
Description
Query parameters

GET

Check whether a specific dataset is valid.

dataset: name of the dataset

GET

Get the list of configurations and splits of a dataset.

dataset: name of the dataset

GET

Get the first rows of a dataset split.

- dataset: name of the dataset - config: name of the config - split: name of the split

GET

Get a slice of rows of a dataset split.

- dataset: name of the dataset - config: name of the config - split: name of the split - offset: offset of the slice - length: length of the slice (maximum 100)

GET

Search text in a dataset split.

- dataset: name of the dataset - config: name of the config - split: name of the split - query: text to search for

GET

Get the list of parquet files of a dataset.

dataset: name of the dataset

GET

Get the size of a dataset.

dataset: name of the dataset

There is no installation or setup required to use Datasets Server.

The base URL of the REST API is:

Copied

https://datasets-server.boincai.com

Gated datasets

For gated datasets, you’ll need to provide your user token in headers of your query. Otherwise, you’ll get an error message to retry with authentication.

PythonJavaScriptcURLCopied

import requests
headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = "https://datasets-server.boincai.com/is-valid?dataset=mozilla-foundation/common_voice_10_0"
def query():
    response = requests.get(API_URL, headers=headers)
    return response.json()
data = query()

You’ll see the following error if you’re trying to access a gated dataset without providing your user token:

Copied

print(data)
{'error': 'The dataset does not exist, or is not accessible without authentication (private or gated). Please check the spelling of the dataset name or retry with authentication.'}

Check dataset validity

PythonJavaScriptcURLCopied

import requests
API_URL = "https://datasets-server.boincai.com/is-valid?dataset=rotten_tomatoes"
def query():
    response = requests.get(API_URL)
    return response.json()
data = query()

This returns whether the dataset provides a preview (see /first-rows), the viewer (see /rows) and the search (see /search):

Copied

{ "preview": true, "viewer": true, "search": true }

List configurations and splits

The /splits endpoint returns a JSON list of the splits in a dataset:

PythonJavaScriptcURLCopied

import requests
API_URL = "https://datasets-server.boincai.com/splits?dataset=rotten_tomatoes"
def query():
    response = requests.get(API_URL)
    return response.json()
data = query()

This returns the available configuration and splits in the dataset:

Copied

{
  "splits": [
    { "dataset": "rotten_tomatoes", "config": "default", "split": "train" },
    {
      "dataset": "rotten_tomatoes",
      "config": "default",
      "split": "validation"
    },
    { "dataset": "rotten_tomatoes", "config": "default", "split": "test" }
  ],
  "pending": [],
  "failed": []
}

Preview a dataset

The /first-rows endpoint returns a JSON list of the first 100 rows of a dataset. It also returns the types of data features (“columns” data types). You should specify the dataset name, configuration name (you can find out the configuration name from the /splits endpoint), and split name of the dataset you’d like to preview:

PythonJavaScriptcURLCopied

import requests
API_URL = "https://datasets-server.boincai.com/first-rows?dataset=rotten_tomatoes&config=default&split=train"
def query():
    response = requests.get(API_URL)
    return response.json()
data = query()

This returns the first 100 rows of the dataset:

Copied

{
  "dataset": "rotten_tomatoes",
  "config": "default",
  "split": "train",
  "features": [
    {
      "feature_idx": 0,
      "name": "text",
      "type": { "dtype": "string", "_type": "Value" }
    },
    {
      "feature_idx": 1,
      "name": "label",
      "type": { "names": ["neg", "pos"], "_type": "ClassLabel" }
    }
  ],
  "rows": [
    {
      "row_idx": 0,
      "row": {
        "text": "the rock is destined to be the 21st century's new \" conan \" and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .",
        "label": 1
      },
      "truncated_cells": []
    },
    {
      "row_idx": 1,
      "row": {
        "text": "the gorgeously elaborate continuation of \" the lord of the rings \" trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth .",
        "label": 1
      },
      "truncated_cells": []
    },
    ...,
    ...
  ]
}

Download slices of a dataset

The /rows endpoint returns a JSON list of a slice of rows of a dataset at any given location (offset). It also returns the types of data features (“columns” data types). You should specify the dataset name, configuration name (you can find out the configuration name from the /splits endpoint), the split name and the offset and length of the slice you’d like to download:

PythonJavaScriptcURLCopied

import requests
API_URL = "https://datasets-server.boincai.com/rows?dataset=rotten_tomatoes&config=default&split=train&offset=150&length=10"
def query():
    response = requests.get(API_URL)
    return response.json()
data = query()

You can download slices of 100 rows maximum at a time.

The response looks like:

Copied

{
  "features": [
    {
      "feature_idx": 0,
      "name": "text",
      "type": { "dtype": "string", "_type": "Value" }
    },
    {
      "feature_idx": 1,
      "name": "label",
      "type": { "names": ["neg", "pos"], "_type": "ClassLabel" }
    }
  ],
  "rows": [
    {
      "row_idx": 150,
      "row": {
        "text": "enormously likable , partly because it is aware of its own grasp of the absurd .",
        "label": 1
      },
      "truncated_cells": []
    },
    {
      "row_idx": 151,
      "row": {
        "text": "here's a british flick gleefully unconcerned with plausibility , yet just as determined to entertain you .",
        "label": 1
      },
      "truncated_cells": []
    },
    ...,
    ...
  ],
  "num_rows_total": 8530,
  "num_rows_per_page": 100
}

Search text in a dataset

The /search endpoint returns a JSON list of a slice of rows of a dataset that match a text query. The text is searched in the columns of type string, even if the values are nested in a dictionary. It also returns the types of data features (“columns” data types). The response format is the same as the /rows endpoint. You should specify the dataset name, configuration name (you can find out the configuration name from the /splits endpoint), the split name and the search query you’d like to find in the text columns:

PythonJavaScriptcURLCopied

import requests
API_URL = "https://datasets-server.boincai.com/search?dataset=rotten_tomatoes&config=default&split=train&query=cat"
def query():
    response = requests.get(API_URL)
    return response.json()
data = query()

You can get slices of 100 rows maximum at a time, and you can ask for other slices using the offset and length parameters, as for the /rows endpoint.

The response looks like:

Copied

{
  "features": [
    {
      "feature_idx": 0,
      "name": "text",
      "type": { "dtype": "string", "_type": "Value" }
    },
    {
      "feature_idx": 1,
      "name": "label",
      "type": { "dtype": "int64", "_type": "Value" }
    }
  ],
  "rows": [
    {
      "row_idx": 9,
      "row": {
        "text": "take care of my cat offers a refreshingly different slice of asian cinema .",
        "label": 1
      },
      "truncated_cells": []
    },
    {
      "row_idx": 472,
      "row": {
        "text": "[ \" take care of my cat \" ] is an honestly nice little film that takes us on an examination of young adult life in urban south korea through the hearts and minds of the five principals .",
        "label": 1
      },
      "truncated_cells": []
    },
    ...,
    ...
  ],
  "num_rows_total": 12,
  "num_rows_per_page": 100
}

Access Parquet files

PythonJavaScriptcURLCopied

import requests
API_URL = "https://datasets-server.boincai.com/parquet?dataset=rotten_tomatoes"
def query():
    response = requests.get(API_URL)
    return response.json()
data = query()

This returns a URL to the Parquet file for each split:

Copied

{
  "parquet_files": [
    {
      "dataset": "rotten_tomatoes",
      "config": "default",
      "split": "test",
      "url": "https://boincai.com/datasets/rotten_tomatoes/resolve/refs%2Fconvert%2Fparquet/default/test/0000.parquet",
      "filename": "0000.parquet",
      "size": 92206
    },
    {
      "dataset": "rotten_tomatoes",
      "config": "default",
      "split": "train",
      "url": "https://boincai.com/datasets/rotten_tomatoes/resolve/refs%2Fconvert%2Fparquet/default/train/0000.parquet",
      "filename": "0000.parquet",
      "size": 698845
    },
    {
      "dataset": "rotten_tomatoes",
      "config": "default",
      "split": "validation",
      "url": "https://boincai.com/datasets/rotten_tomatoes/resolve/refs%2Fconvert%2Fparquet/default/validation/0000.parquet",
      "filename": "0000.parquet",
      "size": 90001
    }
  ],
  "pending": [],
  "failed": [],
  "partial": false
}

Get the size of the dataset

The /size endpoint returns a JSON with the size (number of rows and size in bytes) of the dataset, and for every configuration and split:

PythonJavaScriptcURLCopied

import requests
API_URL = "https://datasets-server.boincai.com/size?dataset=rotten_tomatoes"
def query():
    response = requests.get(API_URL)
    return response.json()
data = query()

This returns a URL to the Parquet file for each split:

Copied

{
  "size": {
    "dataset": {
      "dataset": "rotten_tomatoes",
      "num_bytes_original_files": 487770,
      "num_bytes_parquet_files": 881052,
      "num_bytes_memory": 1345449,
      "num_rows": 10662
    },
    "configs": [
      {
        "dataset": "rotten_tomatoes",
        "config": "default",
        "num_bytes_original_files": 487770,
        "num_bytes_parquet_files": 881052,
        "num_bytes_memory": 1345449,
        "num_rows": 10662,
        "num_columns": 2
      }
    ],
    "splits": [
      {
        "dataset": "rotten_tomatoes",
        "config": "default",
        "split": "train",
        "num_bytes_parquet_files": 698845,
        "num_bytes_memory": 1074806,
        "num_rows": 8530,
        "num_columns": 2
      },
      {
        "dataset": "rotten_tomatoes",
        "config": "default",
        "split": "validation",
        "num_bytes_parquet_files": 90001,
        "num_bytes_memory": 134675,
        "num_rows": 1066,
        "num_columns": 2
      },
      {
        "dataset": "rotten_tomatoes",
        "config": "default",
        "split": "test",
        "num_bytes_parquet_files": 92206,
        "num_bytes_memory": 135968,
        "num_rows": 1066,
        "num_columns": 2
      }
    ]
  },
  "pending": [],
  "failed": [],
  "partial": false
}

Sign up for a if you don't already have one! While you can use Datasets Server without a BOINC AI account, you won't be able to access like and without providing a which you can find in your user settings.

Feel free to try out the API in , or . This quickstart will show you how to query the endpoints programmatically.

To check whether a specific dataset is valid, for example, , use the /is-valid endpoint:

Datasets Server converts every public dataset on the Hub to the format. The /parquet endpoint returns a JSON list of the Parquet URLs for a dataset:

🌍
BOINC AI account
gated datasets
CommonVoice
ImageNet
user token
Postman
ReDoc
RapidAPI
Rotten Tomatoes
Parquet
/is-valid
/splits
/first-rows
/rows
/search
/parquet
/size