Datasets
  • 🌍GET STARTED
    • Datasets
    • Quickstart
    • Installation
  • 🌍TUTORIALS
    • Overview
    • Load a dataset from the Hub
    • Know your dataset
    • Preprocess
    • Evaluate predictions
    • Create a data
    • Share a dataset to the Hub
  • 🌍HOW-TO GUIDES
    • Overview
    • 🌍GENERAL USAGE
      • Load
      • Process
      • Stream
      • Use with TensorFlow
      • Use with PyTorch
      • Use with JAX
      • Use with Spark
      • Cache management
      • Cloud storage
      • Search index
      • Metrics
      • Beam Datasets
    • 🌍AUDIO
      • Load audio data
      • Process audio data
      • Create an audio dataset
    • 🌍VISION
      • Load image data
      • Process image data
      • Create an image dataset
      • Depth estimation
      • Image classification
      • Semantic segmentation
      • Object detection
    • 🌍TEXT
      • Load text data
      • Process text data
    • 🌍TABULAR
      • Load tabular data
    • 🌍DATASET REPOSITORY
      • Share
      • Create a dataset card
      • Structure your repository
      • Create a dataset loading script
  • 🌍CONCEPTUAL GUIDES
    • Datasets with Arrow
    • The cache
    • Dataset or IterableDataset
    • Dataset features
    • Build and load
    • Batch mapping
    • All about metrics
  • 🌍REFERENCE
    • Main classes
    • Builder classes
    • Loading methods
    • Table Classes
    • Logging methods
    • Task templates
Powered by GitBook
On this page
  1. HOW-TO GUIDES
  2. GENERAL USAGE

Beam Datasets

PreviousMetricsNextAUDIO

Last updated 1 year ago

Beam Datasets

Some datasets are too large to be processed on a single machine. Instead, you can process them with , a library for parallel data processing. The processing pipeline is executed on a distributed processing backend such as , , or .

We have already created Beam pipelines for some of the larger datasets like , and . You can load these normally with . But if you want to run your own Beam pipeline with Dataflow, here is how:

  1. Specify the dataset and configuration you want to process:

Copied

DATASET_NAME=your_dataset_name  # ex: wikipedia
CONFIG_NAME=your_config_name    # ex: 20220301.en
  1. Input your Google Cloud Platform information:

Copied

PROJECT=your_project
BUCKET=your_bucket
REGION=your_region
  1. Specify your Python requirements:

Copied

echo "datasets" > /tmp/beam_requirements.txt
echo "apache_beam" >> /tmp/beam_requirements.txt
  1. Run the pipeline:

Copied

datasets-cli run_beam datasets/$DATASET_NAME \
--name $CONFIG_NAME \
--save_info \
--cache_dir gs://$BUCKET/cache/datasets \
--beam_pipeline_options=\
"runner=DataflowRunner,project=$PROJECT,job_name=$DATASET_NAME-gen,"\
"staging_location=gs://$BUCKET/binaries,temp_location=gs://$BUCKET/temp,"\
"region=$REGION,requirements_file=/tmp/beam_requirements.txt"

When you run your pipeline, you can adjust the parameters to change the runner (Flink or Spark), output location (S3 bucket or HDFS), and the number of workers.

🌍
🌍
Apache Beam
Apache Flink
Apache Spark
Google Cloud Dataflow
wikipedia
wiki40b
load_dataset()