
GCP Dataflow Api Example Python

The Apache Beam SDK is an open source programming model for data pipelines. In Google Cloud, you can define a pipeline with an Apache Beam program and then use Dataflow to run your pipeline.

Run the Python3.9 Docker image in the cloud terminal:

1docker run -it -e DEVSHELL_PROJECT_ID=$DEVSHELL_PROJECT_ID python:3.9 /bin/bash

This command pulls a Docker container with the latest stable version of Python 3.9 and then opens up a command shell for you to run the following commands inside your container.

Now install the apache Beam SDK:

1pip install 'apache-beam[gcp]'==2.42.0

To test it, run the wordcount example:

1python -m apache_beam.examples.wordcount --output OUTPUT_FILE

Run a Dataflow pipeline

Create a bucket, and set a environment variable with its name:

1BUCKET=gs://<bucket name provided earlier>

Let’s run the wordcount.py example remotely:

1python -m apache_beam.examples.wordcount --project $DEVSHELL_PROJECT_ID \
2  --runner DataflowRunner \
3  --staging_location $BUCKET/staging \
4  --temp_location $BUCKET/temp \
5  --output $BUCKET/results/output \
6  --region "filled in at lab start"

