# Load text data

## Load text data

This guide shows you how to load text datasets. To learn how to load any type of dataset, take a look at the [general loading guide](https://huggingface.co/docs/datasets/loading).&#x20;

Text files are one of the most common file types for storing a dataset. By default, 🌍 Datasets samples a text file line by line to build the dataset.

Copied

```
>>> from datasets import load_dataset
>>> dataset = load_dataset("text", data_files={"train": ["my_text_1.txt", "my_text_2.txt"], "test": "my_test_file.txt"})

# Load from a directory
>>> dataset = load_dataset("text", data_dir="path/to/text/dataset")
```

To sample a text file by paragraph or even an entire document, use the `sample_by` parameter:

Copied

```
# Sample by paragraph
>>> dataset = load_dataset("text", data_files={"train": "my_train_file.txt", "test": "my_test_file.txt"}, sample_by="paragraph")

# Sample by document
>>> dataset = load_dataset("text", data_files={"train": "my_train_file.txt", "test": "my_test_file.txt"}, sample_by="document")
```

You can also use grep patterns to load specific files:

Copied

```
>>> from datasets import load_dataset
>>> c4_subset = load_dataset("allenai/c4", data_files="en/c4-train.0000*-of-01024.json.gz")
```

To load remote text files via HTTP, pass the URLs instead:

Copied

```
>>> dataset = load_dataset("text", data_files="https://huggingface.co/datasets/lhoestq/test/resolve/main/some_text.txt")
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://boinc-ai.gitbook.io/datasets/how-to-guides/text/load-text-data.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
