is a fast DataFrame library written in Rust with Arrow as its foundation.
π‘ Learn more about how to get the dataset URLs in the guide.
Letβs start by grabbing the URLs to the train split of the dataset from Datasets Server:
Copied
r = requests.get("https://datasets-server.boincai.com/parquet?dataset=blog_authorship_corpus")
j = r.json()
urls = [f['url'] for f in j['parquet_files'] if f['split'] == 'train']
urls
['https://boincai.com/datasets/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/train/0000.parquet',
'https://boincai.com/datasets/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/train/0001.parquet']
To read from a single Parquet file, use the function to read it into a DataFrame and then execute your query:
To read multiple Parquet files - for example, if the dataset is sharded - youβll need to use the function to concatenate the files into a single DataFrame:
Polars offers a that is more performant and memory-efficient for large Parquet files. The LazyFrame API keeps track of what you want to do, and itβll only execute the entire query when youβre ready. This way, the lazy API doesnβt load everything into RAM beforehand, and it allows you to work with datasets larger than your available RAM.
To lazily read a Parquet file, use the function instead. Then, execute the entire query with the function: