Datasets
  • 🌍GET STARTED
    • Datasets
    • Quickstart
    • Installation
  • 🌍TUTORIALS
    • Overview
    • Load a dataset from the Hub
    • Know your dataset
    • Preprocess
    • Evaluate predictions
    • Create a data
    • Share a dataset to the Hub
  • 🌍HOW-TO GUIDES
    • Overview
    • 🌍GENERAL USAGE
      • Load
      • Process
      • Stream
      • Use with TensorFlow
      • Use with PyTorch
      • Use with JAX
      • Use with Spark
      • Cache management
      • Cloud storage
      • Search index
      • Metrics
      • Beam Datasets
    • 🌍AUDIO
      • Load audio data
      • Process audio data
      • Create an audio dataset
    • 🌍VISION
      • Load image data
      • Process image data
      • Create an image dataset
      • Depth estimation
      • Image classification
      • Semantic segmentation
      • Object detection
    • 🌍TEXT
      • Load text data
      • Process text data
    • 🌍TABULAR
      • Load tabular data
    • 🌍DATASET REPOSITORY
      • Share
      • Create a dataset card
      • Structure your repository
      • Create a dataset loading script
  • 🌍CONCEPTUAL GUIDES
    • Datasets with Arrow
    • The cache
    • Dataset or IterableDataset
    • Dataset features
    • Build and load
    • Batch mapping
    • All about metrics
  • 🌍REFERENCE
    • Main classes
    • Builder classes
    • Loading methods
    • Table Classes
    • Logging methods
    • Task templates
Powered by GitBook
On this page
  • Process audio data
  • Cast
  • Map
  1. HOW-TO GUIDES
  2. AUDIO

Process audio data

PreviousLoad audio dataNextCreate an audio dataset

Last updated 1 year ago

Process audio data

This guide shows specific methods for processing audio datasets. Learn how to:

  • Resample the sampling rate.

  • Use with audio datasets.

For a guide on how to process any type of dataset, take a look at the .

Cast

The function is used to cast a column to another feature to be decoded. When you use this function with the feature, you can resample the sampling rate:

Copied

>>> from datasets import load_dataset, Audio

>>> dataset = load_dataset("PolyAI/minds14", "en-US", split="train")
>>> dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))

Audio files are decoded and resampled on-the-fly, so the next time you access an example, the audio file is resampled to 16kHz:

Copied

>>> dataset[0]["audio"]
{'array': array([ 2.3443763e-05,  2.1729663e-04,  2.2145823e-04, ...,
         3.8356509e-05, -7.3497440e-06, -2.1754686e-05], dtype=float32),
 'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav',
 'sampling_rate': 16000}

Map

  • For pretrained speech recognition models, load a feature extractor and tokenizer and combine them in a processor:

    Copied

    >>> from transformers import AutoTokenizer, AutoFeatureExtractor, AutoProcessor
    
    >>> model_checkpoint = "facebook/wav2vec2-large-xlsr-53"
    # after defining a vocab.json file you can instantiate a tokenizer object:
    >>> tokenizer = AutoTokenizer("./vocab.json", unk_token="[UNK]", pad_token="[PAD]", word_delimiter_token="|")
    >>> feature_extractor = AutoFeatureExtractor.from_pretrained(model_checkpoint)
    >>> processor = AutoProcessor.from_pretrained(feature_extractor=feature_extractor, tokenizer=tokenizer)
  • For fine-tuned speech recognition models, you only need to load a processor:

    Copied

    >>> from transformers import AutoProcessor
    
    >>> processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base-960h")

Copied

>>> def prepare_dataset(batch):
...     audio = batch["audio"]
...     batch["input_values"] = processor(audio["array"], sampling_rate=audio["sampling_rate"]).input_values[0]
...     batch["input_length"] = len(batch["input_values"])
...     with processor.as_target_processor():
...         batch["labels"] = processor(batch["sentence"]).input_ids
...     return batch
>>> dataset = dataset.map(prepare_dataset, remove_columns=dataset.column_names)

The function helps preprocess your entire dataset at once. Depending on the type of model you’re working with, you’ll need to either load a or a .

When you use with your preprocessing function, include the audio column to ensure you’re actually resampling the audio data:

🌍
🌍
map()
general process guide
cast_column()
Audio
map()
feature extractor
processor
map()