# Training from memory

## Training from memory

In the [Quicktour](https://huggingface.co/docs/tokenizers/quicktour), we saw how to build and train a tokenizer using text files, but we can actually use any Python Iterator. In this section we’ll see a few different ways of training our tokenizer.

For all the examples listed below, we’ll use the same [Tokenizer](https://huggingface.co/docs/tokenizers/v0.13.4.rc2/en/api/tokenizer#tokenizers.Tokenizer) and `Trainer`, built as following:

Copied

```
from tokenizers import Tokenizer, decoders, models, normalizers, pre_tokenizers, trainers
tokenizer = Tokenizer(models.Unigram())
tokenizer.normalizer = normalizers.NFKC()
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel()
tokenizer.decoder = decoders.ByteLevel()
trainer = trainers.UnigramTrainer(
    vocab_size=20000,
    initial_alphabet=pre_tokenizers.ByteLevel.alphabet(),
    special_tokens=["<PAD>", "<BOS>", "<EOS>"],
)
```

This tokenizer is based on the [Unigram](https://huggingface.co/docs/tokenizers/v0.13.4.rc2/en/api/models#tokenizers.models.Unigram) model. It takes care of normalizing the input using the NFKC Unicode normalization method, and uses a [ByteLevel](https://huggingface.co/docs/tokenizers/v0.13.4.rc2/en/api/pre-tokenizers#tokenizers.pre_tokenizers.ByteLevel) pre-tokenizer with the corresponding decoder.

For more information on the components used here, you can check [here](https://huggingface.co/docs/tokenizers/components).

### The most basic way

As you probably guessed already, the easiest way to train our tokenizer is by using a `List`{.interpreted-text role=“obj”}:

Copied

```
# First few lines of the "Zen of Python" https://www.python.org/dev/peps/pep-0020/
data = [
    "Beautiful is better than ugly."
    "Explicit is better than implicit."
    "Simple is better than complex."
    "Complex is better than complicated."
    "Flat is better than nested."
    "Sparse is better than dense."
    "Readability counts."
]
tokenizer.train_from_iterator(data, trainer=trainer)
```

Easy, right? You can use anything working as an iterator here, be it a `List`{.interpreted-text role=“obj”}, `Tuple`{.interpreted-text role=“obj”}, or a `np.Array`{.interpreted-text role=“obj”}. Anything works as long as it provides strings.

### Using the 🌍 Datasets library

An awesome way to access one of the many datasets that exist out there is by using the 🌍 Datasets library. For more information about it, you should check [the official documentation here](https://huggingface.co/docs/datasets/).

Let’s start by loading our dataset:

Copied

```
import datasets
dataset = datasets.load_dataset("wikitext", "wikitext-103-raw-v1", split="train+test+validation")
```

The next step is to build an iterator over this dataset. The easiest way to do this is probably by using a generator:

Copied

```
def batch_iterator(batch_size=1000):
    for i in range(0, len(dataset), batch_size):
        yield dataset[i : i + batch_size]["text"]
```

As you can see here, for improved efficiency we can actually provide a batch of examples used to train, instead of iterating over them one by one. By doing so, we can expect performances very similar to those we got while training directly from files.

With our iterator ready, we just need to launch the training. In order to improve the look of our progress bars, we can specify the total length of the dataset:

Copied

```
tokenizer.train_from_iterator(batch_iterator(), trainer=trainer, length=len(dataset))
```

And that’s it!

### Using gzip files

Since gzip files in Python can be used as iterators, it is extremely simple to train on such files:

Copied

```
import gzip
with gzip.open("data/my-file.0.gz", "rt") as f:
    tokenizer.train_from_iterator(f, trainer=trainer)
```

Now if we wanted to train from multiple gzip files, it wouldn’t be much harder:

Copied

```
files = ["data/my-file.0.gz", "data/my-file.1.gz", "data/my-file.2.gz"]
def gzip_iterator():
    for path in files:
        with gzip.open(path, "rt") as f:
            for line in f:
                yield line
tokenizer.train_from_iterator(gzip_iterator(), trainer=trainer)
```

And voilà!


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://boinc-ai.gitbook.io/tokenizers/getting-started/training-from-memory.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
