Datasets-server
  • 🌍GET STARTED
    • BOINC AI Datasets server
    • Quickstart
    • Analyze a dataset on the Hub
  • 🌍GUIDES
    • Check dataset validity
    • List splits and configurations
    • Get dataset information
    • Preview a dataset
    • Download slices of rows
    • Search text in a dataset
    • Filter rows in a dataset
    • List Parquet files
    • Get the number of rows and the bytes size
    • Explore dataset statistics
    • 🌍QUERY DATASETS FROM DATASETS SERVER
      • Overview
      • ClickHouse
      • DuckDB
      • Pandas
      • Polars
  • 🌍CONCEPTUAL GUIDES
    • Splits and configurations
    • Data types
    • Server infrastructure
Powered by GitBook
On this page
  • Polars
  • Lazy API
  1. GUIDES
  2. QUERY DATASETS FROM DATASETS SERVER

Polars

PreviousPandasNextCONCEPTUAL GUIDES

Last updated 1 year ago

Polars

is a fast DataFrame library written in Rust with Arrow as its foundation.

πŸ’‘ Learn more about how to get the dataset URLs in the guide.

Let’s start by grabbing the URLs to the train split of the dataset from Datasets Server:

Copied

r = requests.get("https://datasets-server.boincai.com/parquet?dataset=blog_authorship_corpus")
j = r.json()
urls = [f['url'] for f in j['parquet_files'] if f['split'] == 'train']
urls
['https://boincai.com/datasets/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/train/0000.parquet',
 'https://boincai.com/datasets/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/train/0001.parquet']

To read from a single Parquet file, use the function to read it into a DataFrame and then execute your query:

Copied

import polars as pl

df = (
    pl.read_parquet("https://boincai.com/datasets/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/train/0000.parquet")
    .groupby("horoscope")
    .agg(
        [
            pl.count(),
            pl.col("text").str.n_chars().mean().alias("avg_blog_length")
        ]
    )
    .sort("avg_blog_length", descending=True)
    .limit(5)
)
print(df)
shape: (5, 3)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ horoscope ┆ count ┆ avg_blog_length β”‚
β”‚ ---       ┆ ---   ┆ ---             β”‚
β”‚ str       ┆ u32   ┆ f64             β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•ͺ═══════β•ͺ═════════════════║
β”‚ Aquarius  ┆ 34062 ┆ 1129.218836     β”‚
β”‚ Cancer    ┆ 41509 ┆ 1098.366812     β”‚
β”‚ Capricorn ┆ 33961 ┆ 1073.2002       β”‚
β”‚ Libra     ┆ 40302 ┆ 1072.071833     β”‚
β”‚ Leo       ┆ 40587 ┆ 1064.053687     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Copied

import polars as pl
df = (
    pl.concat([pl.read_parquet(url) for url in urls])
    .groupby("horoscope")
    .agg(
        [
            pl.count(),
            pl.col("text").str.n_chars().mean().alias("avg_blog_length")
        ]
    )
    .sort("avg_blog_length", descending=True)
    .limit(5)
)
print(df)
shape: (5, 3)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ horoscope   ┆ count ┆ avg_blog_length β”‚
β”‚ ---         ┆ ---   ┆ ---             β”‚
β”‚ str         ┆ u32   ┆ f64             β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═══════β•ͺ═════════════════║
β”‚ Aquarius    ┆ 49568 ┆ 1125.830677     β”‚
β”‚ Cancer      ┆ 63512 ┆ 1097.956087     β”‚
β”‚ Libra       ┆ 60304 ┆ 1060.611054     β”‚
β”‚ Capricorn   ┆ 49402 ┆ 1059.555261     β”‚
β”‚ Sagittarius ┆ 50431 ┆ 1057.458984     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Lazy API

Copied

import polars as pl

q = (
    pl.scan_parquet("https://boincai.com/datasets/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/train/0000.parquet")
    .groupby("horoscope")
    .agg(
        [
            pl.count(),
            pl.col("text").str.n_chars().mean().alias("avg_blog_length")
        ]
    )
    .sort("avg_blog_length", descending=True)
    .limit(5)
)
df = q.collect()

To read multiple Parquet files - for example, if the dataset is sharded - you’ll need to use the function to concatenate the files into a single DataFrame:

Polars offers a that is more performant and memory-efficient for large Parquet files. The LazyFrame API keeps track of what you want to do, and it’ll only execute the entire query when you’re ready. This way, the lazy API doesn’t load everything into RAM beforehand, and it allows you to work with datasets larger than your available RAM.

To lazily read a Parquet file, use the function instead. Then, execute the entire query with the function:

🌍
🌍
Polars
List Parquet files
blog_authorship_corpus
read_parquet
concat
lazy API
scan_parquet
collect