Load tabular data

Load tabular data

A tabular dataset is a generic dataset used to describe any data stored in rows and columns, where the rows represent an example and the columns represent a feature (can be continuous or categorical). These datasets are commonly stored in CSV files, Pandas DataFrames, and in database tables. This guide will show you how to load and create a tabular dataset from:

  • CSV files

  • Pandas DataFrames

  • Databases

CSV files

๐ŸŒ Datasets can read CSV files by specifying the generic csv dataset builder name in the load_dataset()arrow-up-right method. To load more than one CSV file, pass them as a list to the data_files parameter:

Copied

>>> from datasets import load_dataset
>>> dataset = load_dataset("csv", data_files="my_file.csv")

# load multiple CSV files
>>> dataset = load_dataset("csv", data_files=["my_file_1.csv", "my_file_2.csv", "my_file_3.csv"])

You can also map specific CSV files to the train and test splits:

Copied

>>> dataset = load_dataset("csv", data_files={"train": ["my_train_file_1.csv", "my_train_file_2.csv"], "test": "my_test_file.csv"})

To load remote CSV files, pass the URLs instead:

Copied

To load zipped CSV files:

Copied

Pandas DataFrames

๐ŸŒ Datasets also supports loading datasets from Pandas DataFramesarrow-up-right with the from_pandas()arrow-up-right method:

Copied

Use the splits parameter to specify the name of the dataset split:

Copied

If the dataset doesnโ€™t look as expected, you should explicitly specify your dataset featuresarrow-up-right. A pandas.Seriesarrow-up-right may not always carry enough information for Arrow to automatically infer a data type. For example, if a DataFrame is of length 0 or if the Series only contains None/NaN objects, the type is set to null.

Databases

Datasets stored in databases are typically accessed with SQL queries. With ๐ŸŒ Datasets, you can connect to a database, query for the data you need, and create a dataset out of it. Then you can use all the processing features of ๐ŸŒ Datasets to prepare your dataset for training.

SQLite

SQLite is a small, lightweight database that is fast and easy to set up. You can use an existing database if youโ€™d like, or follow along and start from scratch.

Start by creating a quick SQLite database with this Covid-19 dataarrow-up-right from the New York Times:

Copied

This creates a states table in the us_covid_data.db database which you can now load into a dataset.

To connect to the database, youโ€™ll need the URI stringarrow-up-right that identifies your database. Connecting to a database with a URI caches the returned dataset. The URI string differs for each database dialect, so be sure to check the Database URLsarrow-up-right for whichever database youโ€™re using.

For SQLite, it is:

Copied

Load the table by passing the table name and URI to from_sql()arrow-up-right:

Copied

Then you can use all of ๐ŸŒ Datasets process features like filter()arrow-up-right for example:

Copied

You can also load a dataset from a SQL query instead of an entire table, which is useful for querying and joining multiple tables.

Load the dataset by passing your query and URI to from_sql()arrow-up-right:

Copied

Then you can use all of ๐ŸŒ Datasets process features like filter()arrow-up-right for example:

Copied

PostgreSQL

You can also connect and load a dataset from a PostgreSQL database, however we wonโ€™t directly demonstrate how in the documentation because the example is only meant to be run in a notebook. Instead, take a look at how to install and setup a PostgreSQL server in this notebookarrow-up-right!

After youโ€™ve setup your PostgreSQL database, you can use the from_sql()arrow-up-right method to load a dataset from a table or query.

Last updated