is a database that supports reading and querying Parquet files really fast. Begin by creating a connection to DuckDB, and then install and load the extension to read and write remote files:
Now you can write and execute your SQL query on the Parquet file:
PythonJavaScriptCopied
con.sql(f"SELECT horoscope, count(*), AVG(LENGTH(text)) AS avg_blog_length FROM '{url}' GROUP BY horoscope ORDER BY avg_blog_length DESC LIMIT(5)")
┌───────────┬──────────────┬────────────────────┐
│ horoscope │ count_star() │ avg_blog_length │
│ varchar │ int64 │ double │
├───────────┼──────────────┼────────────────────┤
│ Aquarius │ 34062 │ 1129.218836239798 │
│ Cancer │ 41509 │ 1098.366812016671 │
│ Capricorn │ 33961 │ 1073.2002002296751 │
│ Libra │ 40302 │ 1072.0718326633914 │
│ Leo │ 40587 │ 1064.0536871412028 │
└───────────┴──────────────┴────────────────────┘
To query multiple files - for example, if the dataset is sharded:
PythonJavaScriptCopied
con.sql(f"SELECT horoscope, count(*), AVG(LENGTH(text)) AS avg_blog_length FROM read_parquet({urls[:2]}) GROUP BY horoscope ORDER BY avg_blog_length DESC LIMIT(5)")
┌─────────────┬──────────────┬────────────────────┐
│ horoscope │ count_star() │ avg_blog_length │
│ varchar │ int64 │ double │
├─────────────┼──────────────┼────────────────────┤
│ Aquarius │ 49568 │ 1125.8306770497095 │
│ Cancer │ 63512 │ 1097.95608703867 │
│ Libra │ 60304 │ 1060.6110539931017 │
│ Capricorn │ 49402 │ 1059.5552609206104 │
│ Sagittarius │ 50431 │ 1057.4589835616982 │
└─────────────┴──────────────┴────────────────────┘
, a package powered by , is also available for running DuckDB in any browser. This could be useful, for instance, if you want to create a web app to query Parquet files from the browser!