Datasets
  • 🌍GET STARTED
    • Datasets
    • Quickstart
    • Installation
  • 🌍TUTORIALS
    • Overview
    • Load a dataset from the Hub
    • Know your dataset
    • Preprocess
    • Evaluate predictions
    • Create a data
    • Share a dataset to the Hub
  • 🌍HOW-TO GUIDES
    • Overview
    • 🌍GENERAL USAGE
      • Load
      • Process
      • Stream
      • Use with TensorFlow
      • Use with PyTorch
      • Use with JAX
      • Use with Spark
      • Cache management
      • Cloud storage
      • Search index
      • Metrics
      • Beam Datasets
    • 🌍AUDIO
      • Load audio data
      • Process audio data
      • Create an audio dataset
    • 🌍VISION
      • Load image data
      • Process image data
      • Create an image dataset
      • Depth estimation
      • Image classification
      • Semantic segmentation
      • Object detection
    • 🌍TEXT
      • Load text data
      • Process text data
    • 🌍TABULAR
      • Load tabular data
    • 🌍DATASET REPOSITORY
      • Share
      • Create a dataset card
      • Structure your repository
      • Create a dataset loading script
  • 🌍CONCEPTUAL GUIDES
    • Datasets with Arrow
    • The cache
    • Dataset or IterableDataset
    • Dataset features
    • Build and load
    • Batch mapping
    • All about metrics
  • 🌍REFERENCE
    • Main classes
    • Builder classes
    • Loading methods
    • Table Classes
    • Logging methods
    • Task templates
Powered by GitBook
On this page
  • Load tabular data
  • CSV files
  • Pandas DataFrames
  • Databases
  1. HOW-TO GUIDES
  2. TABULAR

Load tabular data

PreviousTABULARNextDATASET REPOSITORY

Last updated 1 year ago

Load tabular data

A tabular dataset is a generic dataset used to describe any data stored in rows and columns, where the rows represent an example and the columns represent a feature (can be continuous or categorical). These datasets are commonly stored in CSV files, Pandas DataFrames, and in database tables. This guide will show you how to load and create a tabular dataset from:

  • CSV files

  • Pandas DataFrames

  • Databases

CSV files

🌍 Datasets can read CSV files by specifying the generic csv dataset builder name in the method. To load more than one CSV file, pass them as a list to the data_files parameter:

Copied

>>> from datasets import load_dataset
>>> dataset = load_dataset("csv", data_files="my_file.csv")

# load multiple CSV files
>>> dataset = load_dataset("csv", data_files=["my_file_1.csv", "my_file_2.csv", "my_file_3.csv"])

You can also map specific CSV files to the train and test splits:

Copied

>>> dataset = load_dataset("csv", data_files={"train": ["my_train_file_1.csv", "my_train_file_2.csv"], "test": "my_test_file.csv"})

To load remote CSV files, pass the URLs instead:

Copied

>>> base_url = "https://huggingface.co/datasets/lhoestq/demo1/resolve/main/data/"
>>> dataset = load_dataset('csv', data_files={"train": base_url + "train.csv", "test": base_url + "test.csv"})

To load zipped CSV files:

Copied

>>> url = "https://domain.org/train_data.zip"
>>> data_files = {"train": url}
>>> dataset = load_dataset("csv", data_files=data_files)

Pandas DataFrames

Copied

>>> from datasets import Dataset
>>> import pandas as pd

# create a Pandas DataFrame
>>> df = pd.read_csv("https://huggingface.co/datasets/imodels/credit-card/raw/main/train.csv")
>>> df = pd.DataFrame(df)
# load Dataset from Pandas DataFrame
>>> dataset = Dataset.from_pandas(df)

Use the splits parameter to specify the name of the dataset split:

Copied

>>> train_ds = Dataset.from_pandas(train_df, split="train")
>>> test_ds = Dataset.from_pandas(test_df, split="test")

Databases

Datasets stored in databases are typically accessed with SQL queries. With 🌍 Datasets, you can connect to a database, query for the data you need, and create a dataset out of it. Then you can use all the processing features of 🌍 Datasets to prepare your dataset for training.

SQLite

SQLite is a small, lightweight database that is fast and easy to set up. You can use an existing database if you’d like, or follow along and start from scratch.

Copied

>>> import sqlite3
>>> import pandas as pd

>>> conn = sqlite3.connect("us_covid_data.db")
>>> df = pd.read_csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv")
>>> df.to_sql("states", conn, if_exists="replace")

This creates a states table in the us_covid_data.db database which you can now load into a dataset.

For SQLite, it is:

Copied

>>> uri = "sqlite:///us_covid_data.db"

Copied

>>> from datasets import Dataset

>>> ds = Dataset.from_sql("states", uri)
>>> ds
Dataset({
    features: ['index', 'date', 'state', 'fips', 'cases', 'deaths'],
    num_rows: 54382
})

Copied

>>> ds.filter(lambda x: x["state"] == "California")

You can also load a dataset from a SQL query instead of an entire table, which is useful for querying and joining multiple tables.

Copied

>>> from datasets import Dataset

>>> ds = Dataset.from_sql('SELECT * FROM states WHERE state="California";', uri)
>>> ds
Dataset({
    features: ['index', 'date', 'state', 'fips', 'cases', 'deaths'],
    num_rows: 1019
})

Copied

>>> ds.filter(lambda x: x["cases"] > 10000)

PostgreSQL

🌍 Datasets also supports loading datasets from with the method:

If the dataset doesn’t look as expected, you should explicitly . A may not always carry enough information for Arrow to automatically infer a data type. For example, if a DataFrame is of length 0 or if the Series only contains None/NaN objects, the type is set to null.

Start by creating a quick SQLite database with this from the New York Times:

To connect to the database, you’ll need the that identifies your database. Connecting to a database with a URI caches the returned dataset. The URI string differs for each database dialect, so be sure to check the for whichever database you’re using.

Load the table by passing the table name and URI to :

Then you can use all of 🌍 Datasets process features like for example:

Load the dataset by passing your query and URI to :

Then you can use all of 🌍 Datasets process features like for example:

You can also connect and load a dataset from a PostgreSQL database, however we won’t directly demonstrate how in the documentation because the example is only meant to be run in a notebook. Instead, take a look at how to install and setup a PostgreSQL server in this !

After you’ve setup your PostgreSQL database, you can use the method to load a dataset from a table or query.

🌍
🌍
load_dataset()
Pandas DataFrames
from_pandas()
specify your dataset features
pandas.Series
Covid-19 data
URI string
Database URLs
from_sql()
filter()
from_sql()
filter()
notebook
from_sql()