Beam Datasets
Last updated
Last updated
Some datasets are too large to be processed on a single machine. Instead, you can process them with , a library for parallel data processing. The processing pipeline is executed on a distributed processing backend such as , , or .
We have already created Beam pipelines for some of the larger datasets like , and . You can load these normally with . But if you want to run your own Beam pipeline with Dataflow, here is how:
Specify the dataset and configuration you want to process:
Copied
Input your Google Cloud Platform information:
Copied
Specify your Python requirements:
Copied
Run the pipeline:
Copied
When you run your pipeline, you can adjust the parameters to change the runner (Flink or Spark), output location (S3 bucket or HDFS), and the number of workers.