Datasets
  • ๐ŸŒGET STARTED
    • Datasets
    • Quickstart
    • Installation
  • ๐ŸŒTUTORIALS
    • Overview
    • Load a dataset from the Hub
    • Know your dataset
    • Preprocess
    • Evaluate predictions
    • Create a data
    • Share a dataset to the Hub
  • ๐ŸŒHOW-TO GUIDES
    • Overview
    • ๐ŸŒGENERAL USAGE
      • Load
      • Process
      • Stream
      • Use with TensorFlow
      • Use with PyTorch
      • Use with JAX
      • Use with Spark
      • Cache management
      • Cloud storage
      • Search index
      • Metrics
      • Beam Datasets
    • ๐ŸŒAUDIO
      • Load audio data
      • Process audio data
      • Create an audio dataset
    • ๐ŸŒVISION
      • Load image data
      • Process image data
      • Create an image dataset
      • Depth estimation
      • Image classification
      • Semantic segmentation
      • Object detection
    • ๐ŸŒTEXT
      • Load text data
      • Process text data
    • ๐ŸŒTABULAR
      • Load tabular data
    • ๐ŸŒDATASET REPOSITORY
      • Share
      • Create a dataset card
      • Structure your repository
      • Create a dataset loading script
  • ๐ŸŒCONCEPTUAL GUIDES
    • Datasets with Arrow
    • The cache
    • Dataset or IterableDataset
    • Dataset features
    • Build and load
    • Batch mapping
    • All about metrics
  • ๐ŸŒREFERENCE
    • Main classes
    • Builder classes
    • Loading methods
    • Table Classes
    • Logging methods
    • Task templates
Powered by GitBook
On this page
  • Quickstart
  • Audio
  • Vision
  • NLP
  • What's next?
  1. GET STARTED

Quickstart

PreviousDatasetsNextInstallation

Last updated 1 year ago

Quickstart

This quickstart is intended for developers who are ready to dive into the code and see an example of how to integrate ๐ŸŒ Datasets into their model training workflow. If youโ€™re a beginner, we recommend starting with our , where youโ€™ll get a more thorough introduction.

Each dataset is unique, and depending on the task, some datasets may require additional steps to prepare it for training. But you can always use ๐ŸŒ Datasets tools to load and process a dataset. The fastest and easiest way to get started is by loading an existing dataset from the . There are thousands of datasets to choose from, spanning many tasks. Choose the type of dataset you want to work with, and letโ€™s get started!

Check out of the BOINC AI course to learn more about other important topics such as loading remote or local datasets, tools for cleaning up a dataset, and creating your own dataset.

Start by installing ๐ŸŒ Datasets:

Copied

pip install datasets

๐ŸŒ Datasets also support audio and image data formats:

  • To work with audio datasets, install the feature:

    Copied

    pip install datasets[audio]
  • To work with image datasets, install the feature:

    Copied

    pip install datasets[vision]

Besides ๐ŸŒ Datasets, make sure your preferred machine learning framework is installed:

PytorchHide Pytorch contentCopied

pip install torch

TensorFlowHide TensorFlow contentCopied

pip install tensorflow

Audio

Copied

>>> from datasets import load_dataset, Audio

>>> dataset = load_dataset("PolyAI/minds14", "en-US", split="train")

Copied

>>> from transformers import AutoModelForAudioClassification, AutoFeatureExtractor

>>> model = AutoModelForAudioClassification.from_pretrained("facebook/wav2vec2-base")
>>> feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-base")

Copied

>>> dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
>>> dataset[0]["audio"]
{'array': array([ 2.3443763e-05,  2.1729663e-04,  2.2145823e-04, ...,
         3.8356509e-05, -7.3497440e-06, -2.1754686e-05], dtype=float32),
 'path': '/root/.cache/boincai/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav',
 'sampling_rate': 16000}

4. Create a function to preprocess the audio array with the feature extractor, and truncate and pad the sequences into tidy rectangular tensors. The most important thing to remember is to call the audio array in the feature extractor since the array - the actual speech signal - is the model input.

Copied

>>> def preprocess_function(examples):
...     audio_arrays = [x["array"] for x in examples["audio"]]
...     inputs = feature_extractor(
...         audio_arrays,
...         sampling_rate=16000,
...         padding=True,
...         max_length=100000,
...         truncation=True,
...     )
...     return inputs

>>> dataset = dataset.map(preprocess_function, batched=True)

Copied

>>> dataset = dataset.rename_column("intent_class", "labels")

6. Set the dataset format according to the machine learning framework youโ€™re using.

PytorchHide Pytorch content

Copied

>>> from torch.utils.data import DataLoader

>>> dataset.set_format(type="torch", columns=["input_values", "labels"])
>>> dataloader = DataLoader(dataset, batch_size=4)

TensorFlowHide TensorFlow content

Copied

>>> import tensorflow as tf

>>> tf_dataset = model.prepare_tf_dataset(
...     dataset,
...     batch_size=4,
...     shuffle=True,
... )

Vision

Copied

>>> from datasets import load_dataset, Image

>>> dataset = load_dataset("beans", split="train")

Copied

>>> from torchvision.transforms import Compose, ColorJitter, ToTensor

>>> jitter = Compose(
...     [ColorJitter(brightness=0.5, hue=0.5), ToTensor()]
... )

3. Create a function to apply your transform to the dataset and generate the model input: pixel_values.

Copied

>>> def transforms(examples):
...     examples["pixel_values"] = [jitter(image.convert("RGB")) for image in examples["image"]]
...     return examples

Copied

>>> dataset = dataset.with_transform(transforms)

5. Set the dataset format according to the machine learning framework youโ€™re using.

PytorchHide Pytorch content

Copied

>>> from torch.utils.data import DataLoader

>>> def collate_fn(examples):
...     images = []
...     labels = []
...     for example in examples:
...         images.append((example["pixel_values"]))
...         labels.append(example["labels"])
...         
...     pixel_values = torch.stack(images)
...     labels = torch.tensor(labels)
...     return {"pixel_values": pixel_values, "labels": labels}
>>> dataloader = DataLoader(dataset, collate_fn=collate_fn, batch_size=4)

TensorFlowHide TensorFlow content

Before you start, make sure you have up-to-date versions of albumentations and cv2 installed:

Copied

pip install -U albumentations opencv-python

Copied

>>> import albumentations
>>> import numpy as np

>>> transform = albumentations.Compose([
...     albumentations.RandomCrop(width=256, height=256),
...     albumentations.HorizontalFlip(p=0.5),
...     albumentations.RandomBrightnessContrast(p=0.2),
... ])

>>> def transforms(examples):
...     examples["pixel_values"] = [
...         transform(image=np.array(image))["image"] for image in examples["image"]
...     ]
...     return examples

>>> dataset.set_transform(transforms)
>>> tf_dataset = model.prepare_tf_dataset(
...     dataset,
...     batch_size=4,
...     shuffle=True,
... )

NLP

Copied

>>> from datasets import load_dataset

>>> dataset = load_dataset("glue", "mrpc", split="train")

Copied

PytorchTensorFlow>>> from transformers import AutoModelForSequenceClassification, AutoTokenizer

>>> model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

3. Create a function to tokenize the dataset, and you should also truncate and pad the text into tidy rectangular tensors. The tokenizer generates three new columns in the dataset: input_ids, token_type_ids, and an attention_mask. These are the model inputs.

Copied

>>> def encode(examples):
...     return tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, padding="max_length")

>>> dataset = dataset.map(encode, batched=True)
>>> dataset[0]
{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
'label': 1,
'idx': 0,
'input_ids': array([  101,  7277,  2180,  5303,  4806,  1117,  1711,   117,  2292, 1119,  1270,   107,  1103,  7737,   107,   117,  1104,  9938, 4267, 12223, 21811,  1117,  2554,   119,   102, 11336,  6732, 3384,  1106,  1140,  1112,  1178,   107,  1103,  7737,   107, 117,  7277,  2180,  5303,  4806,  1117,  1711,  1104,  9938, 4267, 12223, 21811,  1117,  2554,   119,   102]),
'token_type_ids': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]),
'attention_mask': array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])}

Copied

>>> dataset = dataset.map(lambda examples: {"labels": examples["label"]}, batched=True)

5. Set the dataset format according to the machine learning framework youโ€™re using.

PytorchHide Pytorch content

Copied

>>> import torch

>>> dataset.set_format(type="torch", columns=["input_ids", "token_type_ids", "attention_mask", "labels"])
>>> dataloader = torch.utils.data.DataLoader(dataset, batch_size=32)

TensorFlowHide TensorFlow content

Copied

>>> import tensorflow as tf

>>> tf_dataset = model.prepare_tf_dataset(
...     dataset,
...     batch_size=4,
...     shuffle=True,
... )

What's next?

This completes the ๐ŸŒ Datasets quickstart! You can load any text, audio, or image dataset with a single function and get it ready for your model to train on.

Audio datasets are loaded just like text datasets. However, an audio dataset is preprocessed a bit differently. Instead of a tokenizer, youโ€™ll need a . An audio input may also require resampling its sampling rate to match the sampling rate of the pretrained model youโ€™re using. In this quickstart, youโ€™ll prepare the dataset for a model train on and classify the banking issue a customer is having.

1. Load the MInDS-14 dataset by providing the function with the dataset name, dataset configuration (not all datasets will have a configuration), and a dataset split:

2. Next, load a pretrained model and its corresponding feature extractor from the ๐ŸŒ library. It is totally normal to see a warning after you load the model about some weights not being initialized. This is expected because you are loading this model checkpoint for training with another task.

3. The dataset card indicates the sampling rate is 8kHz, but the Wav2Vec2 model was pretrained on a sampling rate of 16kHZ. Youโ€™ll need to upsample the audio column with the function and feature to match the modelโ€™s sampling rate.

Once you have a preprocessing function, use the function to speed up processing by applying the function to batches of examples in the dataset.

5. Use the function to rename the intent_class column to labels, which is the expected input name in :

Use the function to set the dataset format to torch and specify the columns you want to format. This function applies formatting on-the-fly. After converting to PyTorch tensors, wrap the dataset in :

Use the method from ๐ŸŒ Transformers to prepare the dataset to be compatible with TensorFlow, and ready to train/fine-tune a model, as it wraps a BOINC AI as a tf.data.Dataset with collation and batching, so one can pass it directly to Keras methods like fit() without further modification.

7. Start training with your machine learning framework! Check out the ๐ŸŒ Transformers for an end-to-end example of how to train a model on an audio dataset.

Image datasets are loaded just like text datasets. However, instead of a tokenizer, youโ€™ll need a to preprocess the dataset. Applying data augmentation to an image is common in computer vision to make the model more robust against overfitting. Youโ€™re free to use any data augmentation library you want, and then you can apply the augmentations with ๐ŸŒ Datasets. In this quickstart, youโ€™ll load the dataset and get it ready for the model to train on and identify disease from the leaf images.

1. Load the Beans dataset by providing the function with the dataset name and a dataset split:

2. Now you can add some data augmentations with any library (, , ) you like. Here, youโ€™ll use to randomly change the color properties of an image:

4. Use the function to apply the data augmentations on-the-fly:

Wrap the dataset in . Youโ€™ll also need to create a collate function to collate the samples into batches:

Use the method from ๐ŸŒ Transformers to prepare the dataset to be compatible with TensorFlow, and ready to train/fine-tune a model, as it wraps a BOINC AI as a tf.data.Dataset with collation and batching, so one can pass it directly to Keras methods like fit() without further modification.

6. Start training with your machine learning framework! Check out the ๐ŸŒ Transformers for an end-to-end example of how to train a model on an image dataset.

Text needs to be tokenized into individual tokens by a . For the quickstart, youโ€™ll load the training dataset to train a model to determine whether a pair of sentences mean the same thing.

1. Load the MRPC dataset by providing the function with the dataset name, dataset configuration (not all datasets will have a configuration), and dataset split:

2. Next, load a pretrained model and its corresponding tokenizer from the ๐ŸŒ library. It is totally normal to see a warning after you load the model about some weights not being initialized. This is expected because you are loading this model checkpoint for training with another task.

Use the function to speed up processing by applying your tokenization function to batches of examples in the dataset:

4. Rename the label column to labels, which is the expected input name in :

Use the function to set the dataset format to torch and specify the columns you want to format. This function applies formatting on-the-fly. After converting to PyTorch tensors, wrap the dataset in :

Use the method from ๐ŸŒ Transformers to prepare the dataset to be compatible with TensorFlow, and ready to train/fine-tune a model, as it wraps a BOINC AI as a tf.data.Dataset with collation and batching, so one can pass it directly to Keras methods like fit() without further modification.

6. Start training with your machine learning framework! Check out the ๐ŸŒ Transformers for an end-to-end example of how to train a model on a text dataset.

For your next steps, take a look at our and learn how to do more specific things like loading different dataset formats, aligning labels, and streaming large datasets. If youโ€™re interested in learning more about ๐ŸŒ Datasets core concepts, grab a cup of coffee and read our !

๐ŸŒ
feature extractor
MInDS-14
load_dataset()
Wav2Vec2
Transformers
MInDS-14
cast_column()
Audio
map()
rename_column()
Wav2Vec2ForSequenceClassification
set_format()
torch.utils.data.DataLoader
prepare_tf_dataset
Dataset
audio classification guide
feature extractor
Beans
load_dataset()
Albumentations
imgaug
Kornia
torchvision
with_transform()
torch.utils.data.DataLoader
prepare_tf_dataset
Dataset
image classification guide
tokenizer
Microsoft Research Paraphrase Corpus (MRPC)
load_dataset()
BERT
Transformers
map()
BertForSequenceClassification
set_format()
torch.utils.data.DataLoader
prepare_tf_dataset
Dataset
text classification guide
How-to guides
Conceptual Guides
tutorials
BOINC AI Hub
Chapter 5
Audio
Image