Apache Kafka
What it is: Distributed event streaming platform. Publish-subscribe messaging at massive scale. Backbone of modern data architectures.
What It Does Best
High throughput. Millions of events per second per cluster. Linear scaling with partitions.
Durability and replay. Persistent log storage. Reprocess historical events. Time travel through data.
Ecosystem. Kafka Connect (integrate everything), Kafka Streams (stream processing), Schema Registry. Complete platform.
Key Features
Partitioning: Scale horizontally by distributing across partitions
Replication: Fault tolerance with configurable replication factor
Kafka Connect: 200+ connectors for databases, cloud services, apps
Kafka Streams: Lightweight stream processing library
Exactly-once semantics: Guaranteed message delivery and processing
Pricing
Open Source: Free, Apache 2.0 license (self-hosted)
Confluent Cloud: $0.11/GB ingress, $0.05/GB egress
AWS MSK: ~$0.21/hour per broker + storage costs
Azure Event Hubs: Kafka protocol support, tiered pricing
When to Use It
✅ Real-time data pipelines
✅ Event-driven microservices
✅ Activity tracking and monitoring
✅ Log aggregation and stream processing
✅ Building event sourcing architectures
When NOT to Use It
❌ Request-response patterns (use REST/gRPC)
❌ Small-scale messaging (RabbitMQ/Redis simpler)
❌ Need complex routing (use RabbitMQ)
❌ Simple pub-sub (cloud-native options simpler)
❌ Small team without ops expertise (complex to operate)
Common Use Cases
Data pipelines: Real-time ETL between systems
Event sourcing: Store all state changes as immutable events
Activity tracking: User actions, clickstreams, page views
Log aggregation: Centralize logs from distributed services
Microservices messaging: Asynchronous communication between services
Kafka vs Alternatives
vs RabbitMQ: Kafka better for high throughput, RabbitMQ better for routing complexity
vs AWS Kinesis: Kafka more features and control, Kinesis fully managed
vs Pulsar: Pulsar multi-tenancy and geo-replication, Kafka more mature
Unique Strengths
Replay capability: Reprocess historical data at any time
High throughput: Millions of messages per second per cluster
Massive ecosystem: Largest community and connector library
Log-based storage: Append-only log enables unique use cases
Bottom line: The event streaming platform. Ubiquitous in modern data infrastructure. Not the easiest to operate, but nothing else matches the combination of throughput, durability, and ecosystem. Essential knowledge for data engineers.