Search text in a dataset

Search text in a dataset

Datasets Server provides a /search endpoint for searching words in a dataset.

Currently, only datasets with Parquet exportsarrow-up-right are supported so Datasets Server can index the contents and run the search without downloading the whole dataset.

This guide shows you how to use Datasets Server’s /search endpoint to search for a query string. Feel free to also try it out with ReDocarrow-up-right.

The text is searched in the columns of type string, even if the values are nested in a dictionary.

The /search endpoint accepts five query parameters:

  • dataset: the dataset name, for example glue or mozilla-foundation/common_voice_10_0

  • config: the configuration name, for example cola

  • split: the split name, for example train

  • query: the text to search

  • offset: the offset of the slice, for example 150

  • length: the length of the slice, for example 10 (maximum: 100)

For example, let’s search for the text "dog" in the train split of the SelfRC configuration of the duorc dataset, restricting the results to the slice 150-151:

PythonJavaScriptcURLCopied

import requests
headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = "https://datasets-server.boincai.com/search?dataset=duorc&config=SelfRC&split=train&query=dog&offset=150&length=2"
def query():
    response = requests.get(API_URL, headers=headers)
    return response.json()
data = query()

The endpoint response is a JSON containing two keys (same format as /rowsarrow-up-right):

  • The featuresarrow-up-right of a dataset, including the column’s name and data type.

  • The slice of rows of a dataset and the content contained in each column of a specific row.

The rows are ordered by the row index, and the text strings matching the query are not highlighted.

For example, here are the features and the slice 150-151 of matching rows of the duorc/SelfRC train split for the query dog:

Copied

Truncated responses

Unlike /first-rows, there is currently no truncation in /search. The truncated_cells field is still there but is always empty.

Last updated