Process text data
Last updated
Last updated
This guide shows specific methods for processing text datasets. Learn how to:
Tokenize a dataset with .
Align dataset labels with label ids for NLI datasets.
For a guide on how to process any type of dataset, take a look at the .
The function supports processing batches of examples at once which speeds up tokenization.
Load a tokenizer from 🌍 :
Copied
Set the batched
parameter to True
in the function to apply the tokenizer to batches of examples:
Copied
Copied
Copied
To align the dataset label mapping with the mapping used by a model, create a dictionary of the label name and id to align on:
Copied
Copied
You can also use this function to assign a custom mapping of labels to ids.
The function converts the returned values to a PyArrow-supported format. But explicitly returning the tensors as NumPy arrays is faster because it is a natively supported PyArrow format. Set return_tensors="np"
when you tokenize your text:
The function aligns a dataset label id with the label name. Not all 🌍 Transformers models follow the prescribed label mapping of the original dataset, especially for NLI datasets. For example, the dataset uses the following label mapping:
Pass the dictionary of the label mappings to the function, and the column to align on: