Apache Airflow

Apache Airflow Production Installation Guide

This guide provides comprehensive information for deploying Apache Airflow in a production environment, covering architecture, prerequisites, installation steps, high availability (HA) configurations, and best practices.

1. Apache Airflow Architecture Overview

Airflow’s architecture consists of multiple components, some required for a bare-minimum installation and others optional for extensibility, performance, and scalability.

Required Components:

  • Scheduler: Handles triggering scheduled workflows and submitting tasks to the executor. The executor is a configuration property of the scheduler and runs within its process.
  • DAG Processor: Parses DAG files and serializes them into the metadata database.
  • Webserver: Provides a user interface for inspecting, triggering, and debugging DAGs and tasks.
  • DAG Files Folder: Contains the DAG files read by the scheduler to determine tasks and their schedules.
  • Metadata Database: Stores the state of workflows and tasks. This is a critical component for production deployments.

Optional Components:

  • Worker: Executes tasks given by the scheduler. In basic setups, the worker might be part of the scheduler. For distributed environments, workers can run as separate processes (e.g., with CeleryExecutor or KubernetesExecutor).
  • Triggerer: Executes deferred tasks in an asyncio event loop. Necessary only when using deferred tasks.
  • Plugins Folder: Extends Airflow’s functionality with custom operators, sensors, or other features.

Deployment Considerations:

Airflow components are Python applications deployable using various mechanisms. While a single-machine setup is possible for simple cases, Airflow is designed for distributed environments, allowing components to run on different machines with separate security perimeters and scalability.

Separating components enhances security by isolating them and enabling different tasks. For example, a separate DAG processor ensures the scheduler doesn’t directly access DAG files, preventing arbitrary code execution by DAG authors.

Production deployments often involve different user roles:

  • Deployment Manager: Installs and configures Airflow.
  • DAG Author: Writes and submits DAGs.
  • Operations User: Triggers and monitors DAG execution.

2. Prerequisites and System Requirements

Before installing Apache Airflow, ensure your environment meets the following requirements:

  • Python: Versions 3.9, 3.10, 3.11, 3.12 are tested and supported.
  • Databases: For production, an external database is mandatory. Supported options include:
    • PostgreSQL: Versions 12, 13, 14, 15, 16
    • MySQL: Version 8.0
    • Warning: SQLite is for testing/development ONLY and should NOT be used in production. MariaDB is NOT supported.
  • Kubernetes: Versions 1.26, 1.27, 1.28, 1.29, 1.30 (if deploying on Kubernetes).
  • Memory: A minimum of 4GB of memory is recommended, but actual requirements depend heavily on your chosen deployment size and workload.
  • Operating System: Airflow runs on POSIX-compliant operating systems. For production, only Linux-based distributions are supported.

3. High Availability (HA) Configurations and Deployment Patterns

Achieving high availability in Apache Airflow is crucial for ensuring continuous operation and resilience. Here are key considerations and deployment patterns:

Database Backend:

  • Mandatory External Database: For production, always use a robust external database like PostgreSQL or MySQL. The default SQLite backend is not suitable for production due to potential data loss and lack of scalability.
  • Database Setup: Create an empty database and grant Airflow’s user appropriate permissions (CREATE/ALTER). After changing the backend configuration, run airflow db migrate to initialize the database schema.

Multi-Node Cluster:

  • Executors for Distributed Environments: For multi-node setups, replace the default LocalExecutor with either the KubernetesExecutor or CeleryExecutor.
  • DAG and Configuration Synchronization: In a distributed environment, DAG files and configurations must be synchronized across all nodes. Airflow only sends execution instructions, not the files themselves. Mechanisms like Git repositories with regular pull operations or distributed file systems can be used for synchronization.

Logging:

  • Distributed Log Storage: Configure log storage to a distributed file system (e.g., S3, GCS) or external services (e.g., Stackdriver Logging, Elasticsearch, Amazon CloudWatch). This ensures logs persist even if nodes are ephemeral or fail.

Configuration Management:

  • Environment Variables: Use environment variables for configurations that change across deployments (e.g., metadata DB connection strings, passwords). Airflow supports AIRFLOW__{SECTION}__{KEY} format for this purpose.

Scheduler Uptime and Resilience:

  • Health Checks: Implement health checks for the Airflow scheduler to detect and address instances of hanging or unresponsiveness. Monitor scheduler heartbeat to ensure its continuous operation.

Containerization and Orchestration:

  • Production Container Images: Utilize the official Apache Airflow Docker Image (OCI) for containerized deployments. This ensures consistent environments across different deployment stages.
  • Helm Chart for Kubernetes: For Kubernetes deployments, leverage the official Airflow Helm chart. It simplifies the definition, installation, and upgrade of Airflow deployments on Kubernetes clusters.

Live Upgrades:

  • Distributed Deployment for Zero Downtime: Live upgrades (without downtime) are possible in distributed Airflow deployments, especially for patch-level or minor version upgrades without significant metadata database schema changes. Thorough testing in a staging environment is crucial.
  • Rolling Restarts: For Webserver and Triggerer components, rolling restarts can be performed without downtime if multiple instances are running.
  • Scheduler and Worker Upgrades: The upgrade process for schedulers and workers depends on the executor:
    • Local Executor: Requires pausing DAGs or accepting task interruption.
    • Celery Executor: Workers can be put into offline mode, allowed to finish tasks, then upgraded and restarted.
    • Kubernetes Executor: Schedulers, triggerers, and webservers can be upgraded with rolling restarts. Workers are managed by Kubernetes.

4. Installation Steps (General Guidelines)

Detailed installation steps will vary based on your chosen deployment environment (e.g., bare metal, Docker, Kubernetes). However, the general process involves:

  1. Prepare Environment: Install Python, pip, and any necessary system dependencies.
  2. Set up Database: Install and configure your chosen external database (PostgreSQL or MySQL).
  3. Install Airflow: Install Apache Airflow via pip.
  4. Initialize Database: Configure airflow.cfg to point to your external database and run airflow db migrate.
  5. Start Components: Start the scheduler, webserver, and any workers/triggerers.
  6. DAG Synchronization: Set up a mechanism to synchronize DAG files across all Airflow components.
  7. Monitoring and Logging: Configure monitoring tools and distributed logging.

5. Best Practices for Production Airflow

  • Version Control DAGs: Store DAGs in a version control system (e.g., Git) and automate their deployment.
  • Separate Environments: Maintain separate environments for development, staging, and production.
  • Resource Allocation: Allocate sufficient CPU, memory, and disk space based on your workload and expected growth.
  • Security: Implement strong authentication, authorization, and network security measures. Restrict access to the Airflow UI and metadata database.
  • Monitoring and Alerting: Set up comprehensive monitoring for Airflow components, tasks, and system resources. Configure alerts for failures or performance issues.
  • Regular Backups: Regularly back up your metadata database.
  • Idempotent DAGs: Design DAGs to be idempotent, meaning they can be run multiple times without causing unintended side effects.
  • Small, Modular DAGs: Break down complex workflows into smaller, modular DAGs for easier management and debugging.
  • Error Handling and Retries: Implement robust error handling and retry mechanisms within your DAGs.
  • Clear Logging: Ensure your DAGs and tasks produce clear and informative logs.
  • Upgrade Strategy: Plan and test your Airflow upgrade strategy in a staging environment before applying it to production.

This guide provides a high-level overview. For detailed, environment-specific installation instructions, refer to the official Apache Airflow documentation and community resources.