Load tabular data
Last updated
Last updated
A tabular dataset is a generic dataset used to describe any data stored in rows and columns, where the rows represent an example and the columns represent a feature (can be continuous or categorical). These datasets are commonly stored in CSV files, Pandas DataFrames, and in database tables. This guide will show you how to load and create a tabular dataset from:
CSV files
Pandas DataFrames
Databases
π Datasets can read CSV files by specifying the generic csv
dataset builder name in the method. To load more than one CSV file, pass them as a list to the data_files
parameter:
Copied
You can also map specific CSV files to the train and test splits:
Copied
To load remote CSV files, pass the URLs instead:
Copied
To load zipped CSV files:
Copied
Copied
Use the splits
parameter to specify the name of the dataset split:
Copied
Datasets stored in databases are typically accessed with SQL queries. With π Datasets, you can connect to a database, query for the data you need, and create a dataset out of it. Then you can use all the processing features of π Datasets to prepare your dataset for training.
SQLite is a small, lightweight database that is fast and easy to set up. You can use an existing database if youβd like, or follow along and start from scratch.
Copied
This creates a states
table in the us_covid_data.db
database which you can now load into a dataset.
For SQLite, it is:
Copied
Copied
Copied
You can also load a dataset from a SQL query instead of an entire table, which is useful for querying and joining multiple tables.
Copied
Copied
π Datasets also supports loading datasets from with the method:
If the dataset doesnβt look as expected, you should explicitly . A may not always carry enough information for Arrow to automatically infer a data type. For example, if a DataFrame is of length 0
or if the Series only contains None/NaN
objects, the type is set to null
.
Start by creating a quick SQLite database with this from the New York Times:
To connect to the database, youβll need the that identifies your database. Connecting to a database with a URI caches the returned dataset. The URI string differs for each database dialect, so be sure to check the for whichever database youβre using.
Load the table by passing the table name and URI to :
Then you can use all of π Datasets process features like for example:
Load the dataset by passing your query and URI to :
Then you can use all of π Datasets process features like for example:
You can also connect and load a dataset from a PostgreSQL database, however we wonβt directly demonstrate how in the documentation because the example is only meant to be run in a notebook. Instead, take a look at how to install and setup a PostgreSQL server in this !
After youβve setup your PostgreSQL database, you can use the method to load a dataset from a table or query.