Quidest?

Workflow Orchestration Airflow

Data orchestration is the coordination and automation of data flow across various tools and systems to deliver quality data products and analytics

ETL: Extract, Transform and Load

Airflow is an open source tool for programmatically authoring, scheduling and monitoring your data pipelines

Airflow has become the de-facto choice for orchestration for the following reasons:

There are four cases where data is business critical where airflow can help:

Airflow is not a streaming solution; it can be combined with Kafka to provide this capability.

How does Airflow work?

Airflow is based on DAG (Directed Acyclic Graph). A DAG is a single data pipeline. A DAG is made up of tasks. A task is a single unit of work in a DAG, and can be represented as a node in the graph. An operator defines the work that a task does.

Operators can be divided into three main categories:

  1. Action operators: any operator that execute something
  2. Transfer operator: Perform transfer operations that move data between two systems
  3. Sensor operator: wait for an event before executing the next task.

Core Components

  1. API Server - FastAPI server serving the UI and handling task execution requests.
  2. Scheduler - Schedules tasks when dependencies are fulfilled.
  3. DAG File Processor - Dedicated process for parsing DAGs.
  4. Metadata Database - A database where all metadata are stored.
  5. Executor - Defines how tasks are executed. It does not execute the tasks.
  6. Queue - Defines the execution task order.
  7. Worker - Process executing the tasks, defined by the executor.
  8. Triggerer - Process running asyncio to support deferrable operators.

Lifecycle of a DAG

  1. You create the DAG and add it to the DAG folder
  2. Every 5 minutes by default, the DAG processor checks the DAG folder, parses and serializes the DAG into the Metadata database
  3. The Scheduler reads from the metadata database to check for new workflows to run
  4. The scheduler creates and schedules new Task Instances and pass them to the Executor

#data-engineering #study-plan #career-development #zoomcamp #astro #etl