Apache Airflow
What it is: Workflow orchestration platform. Schedule and monitor data pipelines defined as Python DAGs.
What It Does Best
DAGs in Python. Define workflows as code. Version control, testing, reusability.
Rich UI. See pipeline status, logs, task dependencies. Debug failures visually.
Retry and alerting. Automatic retries, custom failure alerts, SLA monitoring.
Key Features
DAG scheduling: Cron-based scheduling with backfilling and catchup
Task dependencies: Define complex task graphs with upstream/downstream relationships
Operators: Pre-built integrations for common tasks (SQL, cloud services, APIs)
XCom: Share data between tasks in a workflow
UI and monitoring: Web interface for tracking runs, viewing logs, and debugging
Pricing
Open Source: Free, Apache 2.0 license (self-hosted)
AWS MWAA: $400+/month (managed service on AWS)
Google Cloud Composer: $300+/month (managed on GCP)
Astronomer: From $250/month (managed platform with extras)
When to Use It
✅ Complex data pipelines with dependencies
✅ Scheduled ETL jobs
✅ Need monitoring and alerting
✅ Team knows Python
✅ Batch processing workflows with multiple steps
When NOT to Use It
❌ Real-time streaming (use Kafka, Spark Streaming)
❌ Simple cron jobs (cron is simpler)
❌ Event-driven workflows (use event processors)
❌ Low latency requirements (batch-oriented)
❌ Non-technical team (steep learning curve)
Common Use Cases
ETL pipelines: Extract data from sources, transform, load to warehouse
Data warehouse refreshes: Scheduled updates of reporting tables and aggregations
ML model training: Orchestrate data prep, training, evaluation, deployment
Report generation: Daily/weekly reports with multiple data sources
Data quality checks: Automated testing and validation of data pipelines
Airflow vs Alternatives
vs Prefect: Prefect more modern, Airflow more mature and widely adopted
vs Luigi: Airflow has better UI and community, Luigi simpler
vs Dagster: Dagster better for data assets, Airflow better for workflows
Unique Strengths
Python-native: Full Python for task logic, not just configuration
Extensive operators: Hundreds of pre-built integrations
Strong community: Large ecosystem, lots of resources and support
Backfilling: Easily rerun historical data pipeline executions
Bottom line: Industry standard for batch data pipelines. Setup complexity pays off once you have multiple dependent jobs. Essential for data engineering teams.