Datasets-server
  • 🌍GET STARTED
    • BOINC AI Datasets server
    • Quickstart
    • Analyze a dataset on the Hub
  • 🌍GUIDES
    • Check dataset validity
    • List splits and configurations
    • Get dataset information
    • Preview a dataset
    • Download slices of rows
    • Search text in a dataset
    • Filter rows in a dataset
    • List Parquet files
    • Get the number of rows and the bytes size
    • Explore dataset statistics
    • 🌍QUERY DATASETS FROM DATASETS SERVER
      • Overview
      • ClickHouse
      • DuckDB
      • Pandas
      • Polars
  • 🌍CONCEPTUAL GUIDES
    • Splits and configurations
    • Data types
    • Server infrastructure
Powered by GitBook
On this page
  • Splits and configurations
  • Splits
  • Configurations
  1. CONCEPTUAL GUIDES

Splits and configurations

PreviousCONCEPTUAL GUIDESNextData types

Last updated 1 year ago

Splits and configurations

Machine learning datasets are commonly organized in splits and they may also have configurations. These internal structures provide the scaffolding for building out a dataset, and determines how a dataset should be split and organized. Understanding a dataset’s structure can help you create your own dataset, and know which subset of data you should use when during model training and evaluation.

split-configs-server

Splits

Every processed and cleaned dataset contains splits, specific subsets of data reserved for specific needs. The most common splits are:

  • train: data used to train a model; this data is exposed to the model

  • validation: data reserved for evaluation and improving model hyperparameters; this data is hidden from the model

  • test: data reserved for evaluation only; this data is completely hidden from the model and ourselves

The validation and test sets are especially important to ensure a model is actually learning instead of overfitting, or just memorizing the data.

Configurations

A configuration is a higher-level internal structure than a split, and a configuration contains splits. You can think of a configuration as a sub-dataset contained within a larger dataset. It is a useful structure for adding additional layers of organization to a dataset. For example, if you take a look at the dataset, you’ll notice there are eight different languages. While you can create a dataset containing all eight languages, it’s probably neater to create a dataset with each language as a configuration. This way, users can instantly load a dataset with their language of interest instead of preprocessing the dataset to filter for a specific language.

Configurations are flexible, and can be used to organize a dataset along whatever objective you’d like. For example, the dataset uses configurations to organize the dataset by task. One configuration is dedicated to segmenting the whole image, while the other configuration is for instance segmentation.

🌍
Multilingual LibriSpeech (MLS)
SceneParse150