Quickstart
Quickstart
In this quickstart, you’ll learn how to use the Datasets Server’s REST API to:
Check whether a dataset on the Hub is functional.
Return the configuration and splits of a dataset.
Preview the first 100 rows of a dataset.
Download slices of rows of a dataset.
Search a word in a dataset.
Access the dataset as parquet files.
Each feature is served through an endpoint summarized in the table below:
GET
Get the first rows of a dataset split.
- dataset: name of the dataset
- config: name of the config
- split: name of the split
GET
Get a slice of rows of a dataset split.
- dataset: name of the dataset
- config: name of the config
- split: name of the split
- offset: offset of the slice
- length: length of the slice (maximum 100)
GET
Search text in a dataset split.
- dataset: name of the dataset
- config: name of the config
- split: name of the split
- query: text to search for
There is no installation or setup required to use Datasets Server.
Sign up for a BOINC AI account if you don't already have one! While you can use Datasets Server without a BOINC AI account, you won't be able to access gated datasets like CommonVoice and ImageNet without providing a user token which you can find in your user settings.
Feel free to try out the API in Postman, ReDoc or RapidAPI. This quickstart will show you how to query the endpoints programmatically.
The base URL of the REST API is:
Copied
Gated datasets
For gated datasets, you’ll need to provide your user token in headers of your query. Otherwise, you’ll get an error message to retry with authentication.
PythonJavaScriptcURLCopied
You’ll see the following error if you’re trying to access a gated dataset without providing your user token:
Copied
Check dataset validity
To check whether a specific dataset is valid, for example, Rotten Tomatoes, use the /is-valid endpoint:
PythonJavaScriptcURLCopied
This returns whether the dataset provides a preview (see /first-rows), the viewer (see /rows) and the search (see /search):
Copied
List configurations and splits
The /splits endpoint returns a JSON list of the splits in a dataset:
PythonJavaScriptcURLCopied
This returns the available configuration and splits in the dataset:
Copied
Preview a dataset
The /first-rows endpoint returns a JSON list of the first 100 rows of a dataset. It also returns the types of data features (“columns” data types). You should specify the dataset name, configuration name (you can find out the configuration name from the /splits endpoint), and split name of the dataset you’d like to preview:
PythonJavaScriptcURLCopied
This returns the first 100 rows of the dataset:
Copied
Download slices of a dataset
The /rows endpoint returns a JSON list of a slice of rows of a dataset at any given location (offset). It also returns the types of data features (“columns” data types). You should specify the dataset name, configuration name (you can find out the configuration name from the /splits endpoint), the split name and the offset and length of the slice you’d like to download:
PythonJavaScriptcURLCopied
You can download slices of 100 rows maximum at a time.
The response looks like:
Copied
Search text in a dataset
The /search endpoint returns a JSON list of a slice of rows of a dataset that match a text query. The text is searched in the columns of type string, even if the values are nested in a dictionary. It also returns the types of data features (“columns” data types). The response format is the same as the /rows endpoint. You should specify the dataset name, configuration name (you can find out the configuration name from the /splits endpoint), the split name and the search query you’d like to find in the text columns:
PythonJavaScriptcURLCopied
You can get slices of 100 rows maximum at a time, and you can ask for other slices using the offset and length parameters, as for the /rows endpoint.
The response looks like:
Copied
Access Parquet files
Datasets Server converts every public dataset on the Hub to the Parquet format. The /parquet endpoint returns a JSON list of the Parquet URLs for a dataset:
PythonJavaScriptcURLCopied
This returns a URL to the Parquet file for each split:
Copied
Get the size of the dataset
The /size endpoint returns a JSON with the size (number of rows and size in bytes) of the dataset, and for every configuration and split:
PythonJavaScriptcURLCopied
This returns a URL to the Parquet file for each split:
Copied
Last updated