ClickHouse
ClickHouse
ClickHouse is a fast and efficient column-oriented database for analytical workloads, making it easy to analyze Hub-hosted datasets with SQL. To get started quickly, use clickhouse-local to run SQL queries from the command line and avoid the need to fully install ClickHouse.
Check this blog for more details about how to analyze datasets on the Hub with ClickHouse.
To start, download and install clickhouse-local:
Copied
curl https://clickhouse.com/ | shFor this example, youβll analyze the maharshipandya/spotify-tracks-dataset which contains information about Spotify tracks. Datasets on the Hub are stored as Parquet files and you can access it with the /parquet endpoint:
Copied
import requests
r = requests.get("https://datasets-server.boincai.com/parquet?dataset=maharshipandya/spotify-tracks-dataset")
j = r.json()
url = [f['url'] for f in j['parquet_files']]
url
['https://boincai.com/datasets/maharshipandya/spotify-tracks-dataset/resolve/refs%2Fconvert%2Fparquet/default/train/0000.parquet']Aggregate functions
Now you can begin to analyze the dataset. Use the -q argument to specify the query to execute, and the url function to create a table from the data in the Parquet file.
You should set enable_url_encoding to 0 to ensure the escape characters in the URL are preserved as intended, and max_https_get_redirects to 1 to redirect to the path of the Parquet file.
Letβs start by identifying the most popular artists:
Copied
ClickHouse also provides functions for visualizing your queries. For example, you can use the bar function to create a bar chart of the danceability of songs:
Copied
To get a deeper understanding about a dataset, ClickHouse provides statistical analysis functions for determining how your data is correlated, calculating statistical hypothesis tests, and more. Take a look at ClickHouseβs List of Aggregate Functions for a complete list of available aggregate functions.
User-defined function (UDFs)
A user-defined function (UDF) allows you to reuse custom logic. Many Hub datasets are often sharded into more than one Parquet file, so it can be easier and more efficient to create a UDF to list and query all the Parquet files of a given dataset from just the dataset name.
For this example, youβll need to run clickhouse-local in console mode so the UDF persists between queries:
Copied
Remember to set enable_url_encoding to 0 and max_https_get_redirects to 1 to redirect to the path of the Parquet files:
Copied
Letβs create a function to return a list of Parquet files from the blog_authorship_corpus:
Copied
You can make this even easier by creating another function that calls boincai_paths and outputs all the files based on the dataset name:
Copied
Now use the hf function to query any dataset by passing the dataset name:
Copied
Last updated