# Process text data

## Process text data

This guide shows specific methods for processing text datasets. Learn how to:

* Tokenize a dataset with [map()](https://huggingface.co/docs/datasets/v2.14.5/en/package_reference/main_classes#datasets.Dataset.map).
* Align dataset labels with label ids for NLI datasets.

For a guide on how to process any type of dataset, take a look at the [general process guide](https://huggingface.co/docs/datasets/process).

### Map

The [map()](https://huggingface.co/docs/datasets/v2.14.5/en/package_reference/main_classes#datasets.Dataset.map) function supports processing batches of examples at once which speeds up tokenization.

Load a tokenizer from 🌍 [Transformers](https://huggingface.co/transformers/):

Copied

```
>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
```

Set the `batched` parameter to `True` in the [map()](https://huggingface.co/docs/datasets/v2.14.5/en/package_reference/main_classes#datasets.Dataset.map) function to apply the tokenizer to batches of examples:

Copied

```
>>> dataset = dataset.map(lambda examples: tokenizer(examples["text"]), batched=True)
>>> dataset[0]
{'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .', 
 'label': 1, 
 'input_ids': [101, 1996, 2600, 2003, 16036, 2000, 2022, 1996, 7398, 2301, 1005, 1055, 2047, 1000, 16608, 1000, 1998, 2008, 2002, 1005, 1055, 2183, 2000, 2191, 1037, 17624, 2130, 3618, 2084, 7779, 29058, 8625, 13327, 1010, 3744, 1011, 18856, 19513, 3158, 5477, 4168, 2030, 7112, 16562, 2140, 1012, 102], 
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
```

The [map()](https://huggingface.co/docs/datasets/v2.14.5/en/package_reference/main_classes#datasets.Dataset.map) function converts the returned values to a PyArrow-supported format. But explicitly returning the tensors as NumPy arrays is faster because it is a natively supported PyArrow format. Set `return_tensors="np"` when you tokenize your text:

Copied

```
>>> dataset = dataset.map(lambda examples: tokenizer(examples["text"], return_tensors="np"), batched=True)
```

### Align

The [align\_labels\_with\_mapping()](https://huggingface.co/docs/datasets/v2.14.5/en/package_reference/main_classes#datasets.Dataset.align_labels_with_mapping) function aligns a dataset label id with the label name. Not all 🌍 Transformers models follow the prescribed label mapping of the original dataset, especially for NLI datasets. For example, the [MNLI](https://huggingface.co/datasets/glue) dataset uses the following label mapping:

Copied

```
>>> label2id = {"entailment": 0, "neutral": 1, "contradiction": 2}
```

To align the dataset label mapping with the mapping used by a model, create a dictionary of the label name and id to align on:

Copied

```
>>> label2id = {"contradiction": 0, "neutral": 1, "entailment": 2}
```

Pass the dictionary of the label mappings to the [align\_labels\_with\_mapping()](https://huggingface.co/docs/datasets/v2.14.5/en/package_reference/main_classes#datasets.Dataset.align_labels_with_mapping) function, and the column to align on:

Copied

```
>>> from datasets import load_dataset

>>> mnli = load_dataset("glue", "mnli", split="train")
>>> mnli_aligned = mnli.align_labels_with_mapping(label2id, "label")
```

You can also use this function to assign a custom mapping of labels to ids.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://boinc-ai.gitbook.io/datasets/how-to-guides/text/process-text-data.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
