List Parquet files

List Parquet files

Datasets can be published in any format (CSV, JSONL, directories of images, etc.) to the Hub, and they are easily accessed with the 🌍 Datasets library. For a more performant experience (especially when it comes to large datasets), Datasets Server automatically converts every public dataset to the Parquet format. The Parquet files are published to the Hub on a specific refs/convert/parquet branch (like this amazon_polarity branch for example) that lives in parallel to the main branch.

In order for Datasets Server to generate a Parquet version of a dataset, the dataset must be public.

This guide shows you how to use Datasets Server’s /parquet endpoint to retrieve a list of a dataset’s files converted to Parquet. Feel free to also try it out with Postman, RapidAPI, or ReDoc.

The /parquet endpoint accepts the dataset name as its query parameter:

PythonJavaScriptcURLCopied

import requests
headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = "https://datasets-server.boincai.com/parquet?dataset=duorc"
def query():
    response = requests.get(API_URL, headers=headers)
    return response.json()
data = query()

The endpoint response is a JSON containing a list of the dataset’s files in the Parquet format. For example, the duorc dataset has six Parquet files, which corresponds to the test, train and validation splits of its two configurations, ParaphraseRC and SelfRC (see the List splits and configurations guide for more details about splits and configurations).

The endpoint also gives the filename and size of each file:

Copied

{
  "parquet_files": [
    {
      "dataset": "duorc",
      "config": "ParaphraseRC",
      "split": "test",
      "url": "https://boincai.com/datasets/duorc/resolve/refs%2Fconvert%2Fparquet/ParaphraseRC/test/0000.parquet",
      "filename": "0000.parquet",
      "size": 6136590
    },
    {
      "dataset": "duorc",
      "config": "ParaphraseRC",
      "split": "train",
      "url": "https://boincai.com/datasets/duorc/resolve/refs%2Fconvert%2Fparquet/ParaphraseRC/train/0000.parquet",
      "filename": "0000.parquet",
      "size": 26005667
    },
    {
      "dataset": "duorc",
      "config": "ParaphraseRC",
      "split": "validation",
      "url": "https://boincai.com/datasets/duorc/resolve/refs%2Fconvert%2Fparquet/ParaphraseRC/validation/0000.parquet",
      "filename": "0000.parquet",
      "size": 5566867
    },
    {
      "dataset": "duorc",
      "config": "SelfRC",
      "split": "test",
      "url": "https://boincai.com/datasets/duorc/resolve/refs%2Fconvert%2Fparquet/SelfRC/test/0000.parquet",
      "filename": "0000.parquet",
      "size": 3035735
    },
    {
      "dataset": "duorc",
      "config": "SelfRC",
      "split": "train",
      "url": "https://boincai.com/datasets/duorc/resolve/refs%2Fconvert%2Fparquet/SelfRC/train/0000.parquet",
      "filename": "0000.parquet",
      "size": 14851719
    },
    {
      "dataset": "duorc",
      "config": "SelfRC",
      "split": "validation",
      "url": "https://boincai.com/datasets/duorc/resolve/refs%2Fconvert%2Fparquet/SelfRC/validation/0000.parquet",
      "filename": "0000.parquet",
      "size": 3114389
    }
  ]
}

Sharded Parquet files

Big datasets are partitioned into Parquet files (shards) of about 500MB each. The filename contains the name of the dataset, the split, the shard index, and the total number of shards (dataset-name-train-0000-of-0004.parquet). For a given split, the elements in the list are sorted by their shard index, in ascending order. For example, the train split of the amazon_polarity dataset is partitioned into 4 shards:

Copied

{
  "parquet_files": [
    {
      "dataset": "amazon_polarity",
      "config": "amazon_polarity",
      "split": "test",
      "url": "https://boincai.com/datasets/amazon_polarity/resolve/refs%2Fconvert%2Fparquet/amazon_polarity/test/0000.parquet",
      "filename": "0000.parquet",
      "size": 117422359
    },
    {
      "dataset": "amazon_polarity",
      "config": "amazon_polarity",
      "split": "train",
      "url": "https://boincai.com/datasets/amazon_polarity/resolve/refs%2Fconvert%2Fparquet/amazon_polarity/train/0000.parquet",
      "filename": "0000.parquet",
      "size": 320281121
    },
    {
      "dataset": "amazon_polarity",
      "config": "amazon_polarity",
      "split": "train",
      "url": "https://boincai.com/datasets/amazon_polarity/resolve/refs%2Fconvert%2Fparquet/amazon_polarity/train/0001.parquet",
      "filename": "0001.parquet",
      "size": 320627716
    },
    {
      "dataset": "amazon_polarity",
      "config": "amazon_polarity",
      "split": "train",
      "url": "https://boincai.com/datasets/amazon_polarity/resolve/refs%2Fconvert%2Fparquet/amazon_polarity/train/0002.parquet",
      "filename": "0002.parquet",
      "size": 320587882
    },
    {
      "dataset": "amazon_polarity",
      "config": "amazon_polarity",
      "split": "train",
      "url": "https://boincai.com/datasets/amazon_polarity/resolve/refs%2Fconvert%2Fparquet/amazon_polarity/train/0003.parquet",
      "filename": "0003.parquet",
      "size": 66515954
    }
  ],
  "pending": [],
  "failed": []
}

To read and query the Parquet files, take a look at the Query datasets from Datasets Server guide.

Partially converted datasets

The Parquet version can be partial if the dataset is not already in Parquet format or if it is bigger than 5GB.

In that case the Parquet files are generated up to 5GB and placed in a split directory prefixed with “partial”, e.g. “partial-train” instead of “train”.

Parquet-native datasets

When the dataset is already in Parquet format, the data are not converted and the files in refs/convert/parquet are links to the original files. This rule suffers an exception to ensure the Datasets Server API to stay fast: if the row group size of the original Parquet files is too big, new Parquet files are generated.

Using the BOINC AI Hub API

For convenience, you can directly use the BOINC AIHub /api/parquet endpoint which returns the list of Parquet URLs:

PythonJavaScriptcURLCopied

import requests
headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = "https://boincai.com/api/datasets/duorc/parquet"
def query():
    response = requests.get(API_URL, headers=headers)
    return response.json()
urls = query()

The endpoint response is a JSON containing a list of the dataset’s files URLs in the Parquet format for each split and configuration. For example, the duorc dataset has one Parquet file for the train split of the “ParaphraseRC” configuration (see the List splits and configurations guide for more details about splits and configurations).

Copied

{
  "ParaphraseRC": {
    "test": [
      "https://boincai.com/api/datasets/duorc/parquet/ParaphraseRC/test/0.parquet"
    ],
    "train": [
      "https://boincai.com/api/datasets/duorc/parquet/ParaphraseRC/train/0.parquet"
    ],
    "validation": [
      "https://boincai.com/api/datasets/duorc/parquet/ParaphraseRC/validation/0.parquet"
    ]
  },
  "SelfRC": {
    "test": [
      "https://boincai.com/api/datasets/duorc/parquet/SelfRC/test/0.parquet"
    ],
    "train": [
      "https://boincai.com/api/datasets/duorc/parquet/SelfRC/train/0.parquet"
    ],
    "validation": [
      "https://boincai.com/api/datasets/duorc/parquet/SelfRC/validation/0.parquet"
    ]
  }
}

Optionally you can specify which configuration name to return, as well as which split:

PythonJavaScriptcURLCopied

import requests
headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = "https://boincai.com/api/datasets/duorc/parquet/ParaphraseRC/train"
def query():
    response = requests.get(API_URL, headers=headers)
    return response.json()
urls = query()

Copied

[
  "https://boincai.com/api/datasets/duorc/parquet/ParaphraseRC/train/0.parquet"
]

Each parquet file can also be accessed using its shard index: https://boincai.com/api/datasets/duorc/parquet/ParaphraseRC/train/0.parquet redirects to https://boincai.com/datasets/duorc/resolve/refs%2Fconvert%2Fparquet/ParaphraseRC/train/0000.parquet for example.

Last updated