Datasets-server
  • 🌍GET STARTED
    • BOINC AI Datasets server
    • Quickstart
    • Analyze a dataset on the Hub
  • 🌍GUIDES
    • Check dataset validity
    • List splits and configurations
    • Get dataset information
    • Preview a dataset
    • Download slices of rows
    • Search text in a dataset
    • Filter rows in a dataset
    • List Parquet files
    • Get the number of rows and the bytes size
    • Explore dataset statistics
    • 🌍QUERY DATASETS FROM DATASETS SERVER
      • Overview
      • ClickHouse
      • DuckDB
      • Pandas
      • Polars
  • 🌍CONCEPTUAL GUIDES
    • Splits and configurations
    • Data types
    • Server infrastructure
Powered by GitBook
On this page
  1. CONCEPTUAL GUIDES

Data types

PreviousSplits and configurationsNextServer infrastructure

Last updated 1 year ago

Data types

Datasets supported by Datasets Server have a tabular format, meaning a data point is represented in a row and its features are contained in columns. Using the /first-rows endpoint allows you to preview the first 100 rows of a dataset and information about each feature. Within the features key, you’ll notice it returns a _type field. This value describes the data type of the column, and it is also known as a dataset’s .

There are several different data Features for representing different data formats such as and for speech and image data respectively. Knowing a dataset feature gives you a better understanding of the data type you’re working with, and how you can preprocess it.

For example, the /first-rows endpoint for the dataset returns the following:

Copied

{"dataset": "rotten_tomatoes",
 "config": "default",
 "split": "train",
 "features": [{"feature_idx": 0,
   "name": "text",
   "type": {"dtype": "string", 
   "id": null,
   "_type": "Value"}},
  {"feature_idx": 1,
   "name": "label",
   "type": {"num_classes": 2,
    "names": ["neg", "pos"],
    "id": null,
    "_type": "ClassLabel"}}],
  ...
 }

This dataset has two columns, text and label:

The text column has a type of Value. The type is extremely versatile and represents scalar values such as strings, integers, dates, and even timestamp values.

The label column has a type of ClassLabel. The type represents the number of classes in a dataset and their label names. Naturally, this means you’ll frequently see ClassLabel used in classification datasets.

For a complete list of available data types, take a look at the documentation.

🌍
Features
Audio
Image
Rotten Tomatoes
Value
ClassLabel
Features