GCP Dataflow Api Example Python
The Apache Beam SDK is an open source programming model for data pipelines. In Google Cloud, you can define a pipeline with an Apache Beam program and then use Dataflow to run your pipeline.
Run the Python3.9 Docker image in the cloud terminal:
1docker run -it -e DEVSHELL_PROJECT_ID=$DEVSHELL_PROJECT_ID python:3.9 /bin/bash
This command pulls a Docker container with the latest stable version of Python 3.9 and then opens up a command shell for you to run the following commands inside your container.
Now install the apache Beam SDK:
1pip install 'apache-beam[gcp]'==2.42.0
To test it, run the wordcount example:
1python -m apache_beam.examples.wordcount --output OUTPUT_FILE
Run a Dataflow pipeline
Create a bucket, and set a environment variable with its name:
1BUCKET=gs://<bucket name provided earlier>
Let’s run the wordcount.py
example remotely:
1python -m apache_beam.examples.wordcount --project $DEVSHELL_PROJECT_ID \
2 --runner DataflowRunner \
3 --staging_location $BUCKET/staging \
4 --temp_location $BUCKET/temp \
5 --output $BUCKET/results/output \
6 --region "filled in at lab start"
#certification #engineer #machine #platform #cloud #path #learning #gcp #google #ai #model #development #lab #data preparation