Datasets-server
  • 🌍GET STARTED
    • BOINC AI Datasets server
    • Quickstart
    • Analyze a dataset on the Hub
  • 🌍GUIDES
    • Check dataset validity
    • List splits and configurations
    • Get dataset information
    • Preview a dataset
    • Download slices of rows
    • Search text in a dataset
    • Filter rows in a dataset
    • List Parquet files
    • Get the number of rows and the bytes size
    • Explore dataset statistics
    • 🌍QUERY DATASETS FROM DATASETS SERVER
      • Overview
      • ClickHouse
      • DuckDB
      • Pandas
      • Polars
  • 🌍CONCEPTUAL GUIDES
    • Splits and configurations
    • Data types
    • Server infrastructure
Powered by GitBook
On this page
  • Server infrastructure
  • Job queue
  • Workers
  • Cache
  1. CONCEPTUAL GUIDES

Server infrastructure

PreviousData types

Last updated 1 year ago

Server infrastructure

The has two main components that work together to return queries about a dataset instantly:

  • a user-facing web API for exploring and returning information about a dataset

  • a server runs the queries ahead of time and caches them in a database

While most of the documentation is focused on the web API, the server is crucial because it performs all the time-consuming preprocessing and stores the results so the web API can retrieve and serve them to the user. This saves a user time because instead of generating the response every time it gets requested, Datasets Server can return the preprocessed results instantly from the cache.

There are three elements that keep the server running: the job queue, workers, and the cache.

Job queue

The job queue is a list of jobs stored in a Mongo database that should be completed by the workers. The jobs are practically identical to the endpoints the user uses; only the server runs the jobs ahead of time, and the user gets the results when they use the endpoint.

There are three jobs:

  • /splits corresponds to the /splits endpoint. It refreshes a dataset and then returns that dataset’s splits and configurations. For every split in the dataset, it’ll create a new job.

  • /first-rows corresponds to the /first-rows endpoint. It gets the first 100 rows and columns of a dataset split.

  • /parquet corresponds to the /parquet endpoint. It downloads the whole dataset, converts it to and publishes the parquet files to the Hub.

You might’ve noticed the /rows and /search endpoints don’t have a job in the queue. The responses from these endpoints are generated on demand.

Workers

Workers are responsible for executing the jobs in the queue. They complete the actual preprocessing requests, such as getting a list of splits and configurations. The workers can be controlled by configurable environment variables, like the minimum or the maximum number of rows returned by a worker or the maximum number of jobs to start per dataset user or organization.

Cache

Once the workers complete a job, the results are stored - or cached - in a Mongo database. When a user makes a request with an endpoint like /first-rows, Datasets Server retrieves the preprocessed response from the cache, and serves it to the user. This eliminates the time a user would’ve waited if the server hadn’t already completed the job and stored the response.

As a result, users can get their requested information about a dataset (even large ones) nearly instantaneously!

Take a look at the for a complete list of the environment variables if you’re interested in learning more.

🌍
Datasets Server
parquet
workers configuration