Workflow Orchestration Airflow
Data orchestration is the coordination and automation of data flow across various tools and systems to deliver quality data products and analytics
ETL: Extract, Transform and Load
Airflow is an open source tool for programmatically authoring, scheduling and monitoring your data pipelines
Airflow has become the de-facto choice for orchestration for the following reasons:
- Airflow uses Python to create pipelines.
- Airflow is a community driven open source project, and improves every month.
- Airflow has good observability, through easy to use user interface
- Airflow allows use to create data base scheduling
- Airflow is extremely extensible with third-party integrations
There are four cases where data is business critical where airflow can help:
- data-powered applications: programs or systems that rely on data to function and provide value
- critical operational processes: the essential workflows that are crucial for a business to function
- analytics and reporting: systematic analysis of data to derive insights
- MLOps and AI: MLOps involves the deployment and management of machine learning models withing operational workflows
Airflow is not a streaming solution; it can be combined with Kafka to provide this capability.
How does Airflow work?
Airflow is based on DAG (Directed Acyclic Graph). A DAG is a single data pipeline. A DAG is made up of tasks. A task is a single unit of work in a DAG, and can be represented as a node in the graph. An operator defines the work that a task does.
Operators can be divided into three main categories:
- Action operators: any operator that execute something
- Transfer operator: Perform transfer operations that move data between two systems
- Sensor operator: wait for an event before executing the next task.
Core Components
- API Server - FastAPI server serving the UI and handling task execution requests.
- Scheduler - Schedules tasks when dependencies are fulfilled.
- DAG File Processor - Dedicated process for parsing DAGs.
- Metadata Database - A database where all metadata are stored.
- Executor - Defines how tasks are executed. It does not execute the tasks.
- Queue - Defines the execution task order.
- Worker - Process executing the tasks, defined by the executor.
- Triggerer - Process running asyncio to support deferrable operators.
Lifecycle of a DAG
- You create the DAG and add it to the DAG folder
- Every 5 minutes by default, the DAG processor checks the DAG folder, parses and serializes the DAG into the Metadata database
- The Scheduler reads from the metadata database to check for new workflows to run
- The scheduler creates and schedules new Task Instances and pass them to the Executor
#data-engineering #study-plan #career-development #zoomcamp #astro #etl