Search text in a dataset

Search text in a dataset

Datasets Server provides a /search endpoint for searching words in a dataset.

Currently, only datasets with Parquet exports are supported so Datasets Server can index the contents and run the search without downloading the whole dataset.

This guide shows you how to use Datasets Server’s /search endpoint to search for a query string. Feel free to also try it out with ReDoc.

The text is searched in the columns of type string, even if the values are nested in a dictionary.

The /search endpoint accepts five query parameters:

  • dataset: the dataset name, for example glue or mozilla-foundation/common_voice_10_0

  • config: the configuration name, for example cola

  • split: the split name, for example train

  • query: the text to search

  • offset: the offset of the slice, for example 150

  • length: the length of the slice, for example 10 (maximum: 100)

For example, let’s search for the text "dog" in the train split of the SelfRC configuration of the duorc dataset, restricting the results to the slice 150-151:

PythonJavaScriptcURLCopied

import requests
headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = "https://datasets-server.boincai.com/search?dataset=duorc&config=SelfRC&split=train&query=dog&offset=150&length=2"
def query():
    response = requests.get(API_URL, headers=headers)
    return response.json()
data = query()

The endpoint response is a JSON containing two keys (same format as /rows):

  • The features of a dataset, including the column’s name and data type.

  • The slice of rows of a dataset and the content contained in each column of a specific row.

The rows are ordered by the row index, and the text strings matching the query are not highlighted.

For example, here are the features and the slice 150-151 of matching rows of the duorc/SelfRC train split for the query dog:

Copied

Truncated responses

Unlike /first-rows, there is currently no truncation in /search. The truncated_cells field is still there but is always empty.

Last updated