Splits and configurations

Splits and configurations

Machine learning datasets are commonly organized in splits and they may also have configurations. These internal structures provide the scaffolding for building out a dataset, and determines how a dataset should be split and organized. Understanding a dataset’s structure can help you create your own dataset, and know which subset of data you should use when during model training and evaluation.

split-configs-server

Splits

Every processed and cleaned dataset contains splits, specific subsets of data reserved for specific needs. The most common splits are:

  • train: data used to train a model; this data is exposed to the model

  • validation: data reserved for evaluation and improving model hyperparameters; this data is hidden from the model

  • test: data reserved for evaluation only; this data is completely hidden from the model and ourselves

The validation and test sets are especially important to ensure a model is actually learning instead of overfitting, or just memorizing the data.

Configurations

A configuration is a higher-level internal structure than a split, and a configuration contains splits. You can think of a configuration as a sub-dataset contained within a larger dataset. It is a useful structure for adding additional layers of organization to a dataset. For example, if you take a look at the Multilingual LibriSpeech (MLS)arrow-up-right dataset, you’ll notice there are eight different languages. While you can create a dataset containing all eight languages, it’s probably neater to create a dataset with each language as a configuration. This way, users can instantly load a dataset with their language of interest instead of preprocessing the dataset to filter for a specific language.

Configurations are flexible, and can be used to organize a dataset along whatever objective you’d like. For example, the SceneParse150arrow-up-right dataset uses configurations to organize the dataset by task. One configuration is dedicated to segmenting the whole image, while the other configuration is for instance segmentation.

Last updated