Datasets-server
  • 🌍GET STARTED
    • BOINC AI Datasets server
    • Quickstart
    • Analyze a dataset on the Hub
  • 🌍GUIDES
    • Check dataset validity
    • List splits and configurations
    • Get dataset information
    • Preview a dataset
    • Download slices of rows
    • Search text in a dataset
    • Filter rows in a dataset
    • List Parquet files
    • Get the number of rows and the bytes size
    • Explore dataset statistics
    • 🌍QUERY DATASETS FROM DATASETS SERVER
      • Overview
      • ClickHouse
      • DuckDB
      • Pandas
      • Polars
  • 🌍CONCEPTUAL GUIDES
    • Splits and configurations
    • Data types
    • Server infrastructure
Powered by GitBook
On this page
  1. GUIDES
  2. QUERY DATASETS FROM DATASETS SERVER

Pandas

PreviousDuckDBNextPolars

Last updated 1 year ago

Pandas

is a popular DataFrame library for data analysis.

To read from a single Parquet file, use the function to read it into a DataFrame:

Copied

import pandas as pd

df = (
    pd.read_parquet("https://boincai.com/datasets/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/train/0000.parquet")
    .groupby('horoscope')['text']
    .apply(lambda x: x.str.len().mean())
    .sort_values(ascending=False)
    .head(5)
)

To read multiple Parquet files - for example, if the dataset is sharded - you’ll need to use the function to concatenate the files into a single DataFrame:

Copied

df = (
      pd.concat([pd.read_parquet(url) for url in urls])
      .groupby('horoscope')['text']
      .apply(lambda x: x.str.len().mean())
      .sort_values(ascending=False)
      .head(5)
)
🌍
🌍
Pandas
read_parquet
concat