Beam Datasets
Beam Datasets
Some datasets are too large to be processed on a single machine. Instead, you can process them with Apache Beam, a library for parallel data processing. The processing pipeline is executed on a distributed processing backend such as Apache Flink, Apache Spark, or Google Cloud Dataflow.
We have already created Beam pipelines for some of the larger datasets like wikipedia, and wiki40b. You can load these normally with load_dataset(). But if you want to run your own Beam pipeline with Dataflow, here is how:
Specify the dataset and configuration you want to process:
Copied
DATASET_NAME=your_dataset_name # ex: wikipedia
CONFIG_NAME=your_config_name # ex: 20220301.en
Input your Google Cloud Platform information:
Copied
PROJECT=your_project
BUCKET=your_bucket
REGION=your_region
Specify your Python requirements:
Copied
echo "datasets" > /tmp/beam_requirements.txt
echo "apache_beam" >> /tmp/beam_requirements.txt
Run the pipeline:
Copied
datasets-cli run_beam datasets/$DATASET_NAME \
--name $CONFIG_NAME \
--save_info \
--cache_dir gs://$BUCKET/cache/datasets \
--beam_pipeline_options=\
"runner=DataflowRunner,project=$PROJECT,job_name=$DATASET_NAME-gen,"\
"staging_location=gs://$BUCKET/binaries,temp_location=gs://$BUCKET/temp,"\
"region=$REGION,requirements_file=/tmp/beam_requirements.txt"
When you run your pipeline, you can adjust the parameters to change the runner (Flink or Spark), output location (S3 bucket or HDFS), and the number of workers.
Last updated