Useful Data Tips

Apache Kafka

⏱️ 8 sec read 🗄️ Data Management

What it is: Distributed event streaming platform. Publish-subscribe messaging at massive scale. Backbone of modern data architectures.

What It Does Best

High throughput. Millions of events per second per cluster. Linear scaling with partitions.

Durability and replay. Persistent log storage. Reprocess historical events. Time travel through data.

Ecosystem. Kafka Connect (integrate everything), Kafka Streams (stream processing), Schema Registry. Complete platform.

Key Features

Partitioning: Scale horizontally by distributing across partitions

Replication: Fault tolerance with configurable replication factor

Kafka Connect: 200+ connectors for databases, cloud services, apps

Kafka Streams: Lightweight stream processing library

Exactly-once semantics: Guaranteed message delivery and processing

Pricing

Open Source: Free, Apache 2.0 license (self-hosted)

Confluent Cloud: $0.11/GB ingress, $0.05/GB egress

AWS MSK: ~$0.21/hour per broker + storage costs

Azure Event Hubs: Kafka protocol support, tiered pricing

When to Use It

✅ Real-time data pipelines

✅ Event-driven microservices

✅ Activity tracking and monitoring

✅ Log aggregation and stream processing

✅ Building event sourcing architectures

When NOT to Use It

❌ Request-response patterns (use REST/gRPC)

❌ Small-scale messaging (RabbitMQ/Redis simpler)

❌ Need complex routing (use RabbitMQ)

❌ Simple pub-sub (cloud-native options simpler)

❌ Small team without ops expertise (complex to operate)

Common Use Cases

Data pipelines: Real-time ETL between systems

Event sourcing: Store all state changes as immutable events

Activity tracking: User actions, clickstreams, page views

Log aggregation: Centralize logs from distributed services

Microservices messaging: Asynchronous communication between services

Kafka vs Alternatives

vs RabbitMQ: Kafka better for high throughput, RabbitMQ better for routing complexity

vs AWS Kinesis: Kafka more features and control, Kinesis fully managed

vs Pulsar: Pulsar multi-tenancy and geo-replication, Kafka more mature

Unique Strengths

Replay capability: Reprocess historical data at any time

High throughput: Millions of messages per second per cluster

Massive ecosystem: Largest community and connector library

Log-based storage: Append-only log enables unique use cases

Bottom line: The event streaming platform. Ubiquitous in modern data infrastructure. Not the easiest to operate, but nothing else matches the combination of throughput, durability, and ecosystem. Essential knowledge for data engineers.

Visit Apache Kafka →

← Back to Data Management Tools